public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Add fopt-info-oacc
@ 2016-01-18 17:27 ` Tom de Vries
  2016-01-18 18:28   ` Sandra Loosemore
  2016-01-21 21:55   ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Thomas Schwinge
  0 siblings, 2 replies; 25+ messages in thread
From: Tom de Vries @ 2016-01-18 17:27 UTC (permalink / raw)
  To: gcc-patches, Thomas Schwinge, Nathan Sidwell, Allen, Randy

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]

Hi,

This patch introduces an option fopt-info-oacc.

When using the option like this with a kernels region in kernels-loop.c 
that parloops does not manage to parallelize:
...
$ gcc kernels-loop.c -S -O2 -fopenacc -fopt-info-oacc-all
...

we get a message:
...
kernels-loop.c:23:9: note: kernels region executed sequentially. 
Consider mapping it to host execution, to avoid data copy penalty.
...

Any comments?

Thanks,
- Tom

[-- Attachment #2: 0001-Add-fopt-info-oacc.patch --]
[-- Type: text/x-patch, Size: 2863 bytes --]

Add fopt-info-oacc

---
 gcc/dumpfile.c |  1 +
 gcc/dumpfile.h |  5 +++--
 gcc/omp-low.c  | 30 +++++++++++++++++++++++++++++-
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/gcc/dumpfile.c b/gcc/dumpfile.c
index 144e371..e8aa0e1 100644
--- a/gcc/dumpfile.c
+++ b/gcc/dumpfile.c
@@ -137,6 +137,7 @@ static const struct dump_option_value_info optgroup_options[] =
   {"loop", OPTGROUP_LOOP},
   {"inline", OPTGROUP_INLINE},
   {"vec", OPTGROUP_VEC},
+  {"oacc", OPTGROUP_OACC},
   {"optall", OPTGROUP_ALL},
   {NULL, 0}
 };
diff --git a/gcc/dumpfile.h b/gcc/dumpfile.h
index c168cbf..6e1c657 100644
--- a/gcc/dumpfile.h
+++ b/gcc/dumpfile.h
@@ -97,9 +97,10 @@ enum tree_dump_index
 #define OPTGROUP_LOOP        (1 << 2)   /* Loop optimization passes */
 #define OPTGROUP_INLINE      (1 << 3)   /* Inlining passes */
 #define OPTGROUP_VEC         (1 << 4)   /* Vectorization passes */
-#define OPTGROUP_OTHER       (1 << 5)   /* All other passes */
+#define OPTGROUP_OACC        (1 << 5)   /* Openacc passes */
+#define OPTGROUP_OTHER       (1 << 6)   /* All other passes */
 #define OPTGROUP_ALL	     (OPTGROUP_IPA | OPTGROUP_LOOP | OPTGROUP_INLINE \
-                              | OPTGROUP_VEC | OPTGROUP_OTHER)
+                              | OPTGROUP_VEC | OPTGROUP_OACC | OPTGROUP_OTHER)
 
 /* Define a tree dump switch.  */
 struct dump_file_info
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index a6e3fe3..d5c3484 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -20139,6 +20139,34 @@ execute_oacc_device_lower ()
 	     : fn_level < 0 ? "Function is parallel offload\n"
 	     : "Function is routine level %d\n", fn_level);
 
+#if defined ACCEL_COMPILER
+  bool is_kernels = oacc_fn_attrib_kernels_p (attrs);
+  if (is_kernels)
+    {
+      bool all_one = true;
+      tree pos = TREE_VALUE (attrs);
+      for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
+	{
+	  tree tree_val = TREE_VALUE (pos);
+	  unsigned HOST_WIDE_INT val = (tree_val
+					? TREE_INT_CST_LOW (tree_val)
+					: 1);
+	  if (val != 1)
+	    {
+	      all_one = false;
+	      break;
+	    }
+	  pos = TREE_CHAIN (pos);
+	}
+
+      if (all_one)
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, cfun->function_start_locus,
+			 "Kernels region executed sequentially.  Consider"
+			 " mapping it to host execution, to avoid data copy"
+			 " penalty.\n");
+    }
+#endif
+
   unsigned outer_mask = fn_level >= 0 ? GOMP_DIM_MASK (fn_level) - 1 : 0;
   unsigned used_mask = oacc_loop_partition (loops, outer_mask);
   int dims[GOMP_DIM_MAX];
@@ -20312,7 +20340,7 @@ const pass_data pass_data_oacc_device_lower =
 {
   GIMPLE_PASS, /* type */
   "oaccdevlow", /* name */
-  OPTGROUP_NONE, /* optinfo_flags */
+  OPTGROUP_OACC, /* optinfo_flags */
   TV_NONE, /* tv_id */
   PROP_cfg, /* properties_required */
   0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Add fopt-info-oacc
  2016-01-18 17:27 ` [PATCH] Add fopt-info-oacc Tom de Vries
@ 2016-01-18 18:28   ` Sandra Loosemore
  2016-01-18 20:30     ` Richard Sandiford
  2016-01-21 21:55   ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Thomas Schwinge
  1 sibling, 1 reply; 25+ messages in thread
From: Sandra Loosemore @ 2016-01-18 18:28 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches, Thomas Schwinge, Nathan Sidwell, Allen, Randy

On 01/18/2016 10:26 AM, Tom de Vries wrote:
> Hi,
>
> This patch introduces an option fopt-info-oacc.
>
> When using the option like this with a kernels region in kernels-loop.c
> that parloops does not manage to parallelize:
> ...
> $ gcc kernels-loop.c -S -O2 -fopenacc -fopt-info-oacc-all
> ...
>
> we get a message:
> ...
> kernels-loop.c:23:9: note: kernels region executed sequentially.
> Consider mapping it to host execution, to avoid data copy penalty.
> ...
>
> Any comments?

Needs documentation?

-Sandra

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Add fopt-info-oacc
  2016-01-18 18:28   ` Sandra Loosemore
@ 2016-01-18 20:30     ` Richard Sandiford
  0 siblings, 0 replies; 25+ messages in thread
From: Richard Sandiford @ 2016-01-18 20:30 UTC (permalink / raw)
  To: Sandra Loosemore
  Cc: Tom de Vries, gcc-patches, Thomas Schwinge, Nathan Sidwell, Allen, Randy

Sandra Loosemore <sandra@codesourcery.com> writes:
> On 01/18/2016 10:26 AM, Tom de Vries wrote:
>> Hi,
>>
>> This patch introduces an option fopt-info-oacc.
>>
>> When using the option like this with a kernels region in kernels-loop.c
>> that parloops does not manage to parallelize:
>> ...
>> $ gcc kernels-loop.c -S -O2 -fopenacc -fopt-info-oacc-all
>> ...
>>
>> we get a message:
>> ...
>> kernels-loop.c:23:9: note: kernels region executed sequentially.
>> Consider mapping it to host execution, to avoid data copy penalty.
>> ...
>>
>> Any comments?
>
> Needs documentation?

Also, sorry for the drive-by comment, but: -fopt-info-openacc-all seems
more consistent with -fopenacc and is only three characters longer.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc)
  2016-01-18 17:27 ` [PATCH] Add fopt-info-oacc Tom de Vries
  2016-01-18 18:28   ` Sandra Loosemore
@ 2016-01-21 21:55   ` Thomas Schwinge
  2016-01-22  7:40     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge
                       ` (3 more replies)
  1 sibling, 4 replies; 25+ messages in thread
From: Thomas Schwinge @ 2016-01-21 21:55 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Nathan Sidwell, Allen, Randy, Jakub Jelinek

Hi!

On Mon, 18 Jan 2016 18:26:49 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> This patch introduces an option fopt-info-oacc.
> 
> When using the option like this with a kernels region in kernels-loop.c 
> that parloops does not manage to parallelize:
> ...
> $ gcc kernels-loop.c -S -O2 -fopenacc -fopt-info-oacc-all
> ...
> 
> we get a message:
> ...
> kernels-loop.c:23:9: note: kernels region executed sequentially. 
> Consider mapping it to host execution, to avoid data copy penalty.
> ...

Yay for helping the user understand what the compiler is doing!

> Any comments?

Telling from real-world code that we've been having a look at, when the
above situation happens, we're -- in the vast majority of all cases -- in
a situation where we generally want to avoid offloading (unless
explicitly requested), "to avoid data copy penalty" as well as typically
much slower single-threaded execution on the GPU.  Obviously, that will
have to be revisited as parloops (or any other mechanism in GCC) is able
to better understand/use the parallelism in OpenACC kernels constructs.

So, building upon Tom's patch, I have implemented an "avoid offloading"
flag given the presence of one un-parallelized OpenACC kernels construct.
This is currently only enabled for OpenACC kernels constructs, in
combination with nvptx offloading, but I think the general scheme will be
useful also for other constructs as well as other (non-shared memory)
offloading targets.

Also, "avoid offloading" is just a default: if a user explicitly
requested the use of, for example, a Nvidia GPU (with an
acc_init(acc_device_nvidia) call, or by setting the
ACC_DEVICE_TYPE=nvidia environemnt variable, for example), then we cannot
apply host-fallback execution, because in this case the user can
rightfully assume Nvidia GPU semantics (async clause works, and so on).


The new warning (very similar to the one that Tom proposed) also
uncovered a bunch of OpenACC kernels test cases in libgomp that did not
have OpenACC kernels processing enabled (-ftree-parallelize-loops), but
which parloops can handle fine once that is enabled -- and also a bunch
of OpenACC kernels test cases that parloops doesn't handle but it looked
as they were meant to be.  (Maybe I'm wrong about that, though.)  Anyway,
Tom, would you please make a note to audit all use of -foffload-force in
the libgomp testsuite?  (It is appropriate for all test cases that
parloops truely is not meant to handle, but for all others, that flag
should probably be removed and instead an XFAILed dg-bogus directive
added, so that we will notice (XPASS) once it does handle them.)


I've also added a new command-line option, -foffload-force, that restores
the current behavior, inhibits the "avoid offloading" handling.  This is
primarily meant for GCC (libgomp) testsuite usage, but could occasionally
also be useful for users.  Considering alternatives (that can be applied
in a more fine-grained way, case by case per OpenACC kernels construct):

1) a new GCC-specific pragma, for example:

    #pragma GCC force offloading
    #pragma acc kernels
      [un-parallelizable stuff]

2) a new GCC-specific clause, for example in the implementation
namespace, starting with "_":

    #pragma acc kernels _force_offloading
      [un-parallelizable stuff]

..., the -foffload-force flag was the simplest solution.  (Because, if
you're going to alter the sources anyway, you might as well just remove
the one offending OpenACC kernels construct...)


Committed to gomp-4_0-branch in r232709:

commit 41a76d233e714fd7b79dc1f40823f607c38306ba
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Jan 21 21:52:50 2016 +0000

    Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
    
    	gcc/
    	* common.opt: Add -foffload-force.
    	* lto-wrapper.c (merge_and_complain, append_compiler_options):
    	Handle it.
    	* doc/invoke.texi: Document it.
    	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
    	(record_id): Parse, and set it.
    	(process): Use it.
    	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
    	offloading".
    	(nvptx_record_offload_symbol): Use it.
    	(nvptx_goacc_validate_dims): Set it.
    	libgomp/
    	* target.c (GOMP_offload_register_ver)
    	(GOMP_offload_unregister_ver, gomp_init_device)
    	(gomp_unload_device, gomp_offload_target_available_p): Handle and
    	document "avoid offloading" ("host_table == NULL").
    	(resolve_device): Document "avoid offloading".
    	* oacc-init.c (resolve_device): Likewise.
    	* libgomp.texi (Enabling OpenACC): Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
    	file.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
    	* testsuite/libgomp.oacc-c++/non-scalar-data.C: Set
    	"-foffload-force".
    	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90:
    	Likewise.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c: Set
    	"-ftree-parallelize-loops=32".
    	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/if-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
    	* testsuite/libgomp.oacc-fortran/asyncwait-1.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/asyncwait-2.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/asyncwait-3.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/deviceptr-1.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/kernels-map-1.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@232709 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp                                 |  14 ++
 gcc/common.opt                                     |   4 +
 gcc/config/nvptx/mkoffload.c                       |  73 +++++++++-
 gcc/config/nvptx/nvptx.c                           |  42 +++++-
 gcc/doc/invoke.texi                                |  11 +-
 gcc/lto-wrapper.c                                  |   2 +
 libgomp/ChangeLog.gomp                             | 150 +++++++++++++++++++++
 libgomp/libgomp.texi                               |   8 ++
 libgomp/oacc-init.c                                |   8 +-
 libgomp/target.c                                   |  86 ++++++++----
 .../testsuite/libgomp.oacc-c++/non-scalar-data.C   |   3 +-
 .../testsuite/libgomp.oacc-c-c++-common/abort-3.c  |   3 +-
 .../testsuite/libgomp.oacc-c-c++-common/abort-4.c  |   3 +-
 .../libgomp.oacc-c-c++-common/asyncwait-1.c        |   1 +
 .../libgomp.oacc-c-c++-common/avoid-offloading-1.c |  25 ++++
 .../libgomp.oacc-c-c++-common/avoid-offloading-2.c |  38 ++++++
 .../libgomp.oacc-c-c++-common/avoid-offloading-3.c |  29 ++++
 .../combined-directives-1.c                        |   2 +-
 .../libgomp.oacc-c-c++-common/default-1.c          |   4 +-
 .../libgomp.oacc-c-c++-common/deviceptr-1.c        |   4 +-
 .../libgomp.oacc-c-c++-common/host_data-1.c        |   1 +
 libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c |   2 +-
 .../libgomp.oacc-c-c++-common/kernels-1.c          |   4 +-
 .../kernels-alias-ipa-pta-2.c                      |   5 +-
 .../kernels-alias-ipa-pta-3.c                      |   5 +-
 .../kernels-alias-ipa-pta.c                        |   5 +-
 .../libgomp.oacc-c-c++-common/kernels-empty.c      |   3 +
 .../kernels-loop-and-seq-2.c                       |   3 +-
 .../kernels-loop-and-seq-5.c                       |   3 +-
 .../kernels-loop-and-seq-6.c                       |   3 +-
 .../kernels-loop-and-seq.c                         |   3 +-
 .../kernels-loop-collapse.c                        |   3 +-
 .../kernels-private-vars-local-worker-1.c          |   3 +-
 .../kernels-private-vars-local-worker-2.c          |   3 +-
 .../kernels-private-vars-local-worker-3.c          |   3 +-
 .../kernels-private-vars-local-worker-4.c          |   3 +-
 .../kernels-private-vars-local-worker-5.c          |   3 +-
 .../kernels-private-vars-loop-gang-1.c             |   3 +-
 .../kernels-private-vars-loop-gang-2.c             |   3 +-
 .../kernels-private-vars-loop-gang-3.c             |   3 +-
 .../kernels-private-vars-loop-gang-4.c             |   3 +-
 .../kernels-private-vars-loop-gang-5.c             |   3 +-
 .../kernels-private-vars-loop-gang-6.c             |   4 +
 .../kernels-private-vars-loop-vector-1.c           |   3 +-
 .../kernels-private-vars-loop-vector-2.c           |   3 +-
 .../kernels-private-vars-loop-worker-1.c           |   3 +-
 .../kernels-private-vars-loop-worker-2.c           |   3 +-
 .../kernels-private-vars-loop-worker-3.c           |   3 +-
 .../kernels-private-vars-loop-worker-4.c           |   3 +-
 .../kernels-private-vars-loop-worker-5.c           |   3 +-
 .../kernels-private-vars-loop-worker-6.c           |   3 +-
 .../kernels-private-vars-loop-worker-7.c           |   3 +-
 .../kernels-reduction-1.c                          |   3 +-
 .../testsuite/libgomp.oacc-c-c++-common/nested-2.c |   2 +-
 .../testsuite/libgomp.oacc-fortran/asyncwait-1.f90 |   1 +
 .../testsuite/libgomp.oacc-fortran/asyncwait-2.f90 |   1 +
 .../testsuite/libgomp.oacc-fortran/asyncwait-3.f90 |   1 +
 .../libgomp.oacc-fortran/avoid-offloading-1.f      |  29 ++++
 .../libgomp.oacc-fortran/avoid-offloading-2.f      |  40 ++++++
 .../libgomp.oacc-fortran/avoid-offloading-3.f      |  30 +++++
 .../libgomp.oacc-fortran/combined-directives-1.f90 |   1 +
 .../testsuite/libgomp.oacc-fortran/default-1.f90   |   3 +
 .../testsuite/libgomp.oacc-fortran/deviceptr-1.f90 |   5 +-
 libgomp/testsuite/libgomp.oacc-fortran/if-1.f90    |   5 +-
 .../kernels-acc-loop-reduction-2.f90               |   5 +
 .../kernels-acc-loop-reduction.f90                 |   5 +
 .../libgomp.oacc-fortran/kernels-collapse-3.f90    |   2 +
 .../libgomp.oacc-fortran/kernels-collapse-4.f90    |   2 +
 .../libgomp.oacc-fortran/kernels-independent.f90   |   2 +-
 .../libgomp.oacc-fortran/kernels-map-1.f90         |   3 +
 .../kernels-private-vars-loop-gang-2.f90           |   2 +
 .../kernels-private-vars-loop-gang-3.f90           |   2 +
 .../kernels-private-vars-loop-gang-6.f90           |   2 +
 .../kernels-private-vars-loop-vector-1.f90         |   2 +
 .../kernels-private-vars-loop-vector-2.f90         |   2 +
 .../kernels-private-vars-loop-worker-1.f90         |   2 +
 .../kernels-private-vars-loop-worker-2.f90         |   2 +
 .../kernels-private-vars-loop-worker-3.f90         |   2 +
 .../kernels-private-vars-loop-worker-4.f90         |   2 +
 .../kernels-private-vars-loop-worker-5.f90         |   2 +
 .../kernels-private-vars-loop-worker-6.f90         |   2 +
 .../kernels-private-vars-loop-worker-7.f90         |   2 +
 .../libgomp.oacc-fortran/kernels-reduction-1.f90   |   2 +
 .../libgomp.oacc-fortran/non-scalar-data.f90       |   1 +
 84 files changed, 700 insertions(+), 78 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index cdd279b..f991b91 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,17 @@
+2016-01-21  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* common.opt: Add -foffload-force.
+	* lto-wrapper.c (merge_and_complain, append_compiler_options):
+	Handle it.
+	* doc/invoke.texi: Document it.
+	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
+	(record_id): Parse, and set it.
+	(process): Use it.
+	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
+	offloading".
+	(nvptx_record_offload_symbol): Use it.
+	(nvptx_goacc_validate_dims): Set it.
+
 2016-01-20  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* gimplify.c (gimplify_scan_omp_clauses):  Consider OACC_{DATA,
diff --git gcc/common.opt gcc/common.opt
index 793a062..c905f71 100644
--- gcc/common.opt
+++ gcc/common.opt
@@ -1786,6 +1786,10 @@ Enum(offload_alias) String(pointer) Value(OFFLOAD_ALIAS_POINTER)
 EnumValue
 Enum(offload_alias) String(none) Value(OFFLOAD_ALIAS_NONE)
 
+foffload-force
+Common Var(flag_offload_force)
+Force offloading if the compiler wanted to avoid it.
+
 fomit-frame-pointer
 Common Report Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git gcc/config/nvptx/mkoffload.c gcc/config/nvptx/mkoffload.c
index cce562d..de6a8ad 100644
--- gcc/config/nvptx/mkoffload.c
+++ gcc/config/nvptx/mkoffload.c
@@ -41,9 +41,19 @@ const char tool_name[] = "nvptx mkoffload";
 
 #define COMMENT_PREFIX "#"
 
+enum id_map_flag
+  {
+    /* All clear.  */
+    ID_MAP_FLAG_NONE = 0,
+    /* Avoid offloading.  For example, because there is no sufficient
+       parallelism.  */
+    ID_MAP_FLAG_AVOID_OFFLOADING = 1
+  };
+
 struct id_map
 {
   id_map *next;
+  int flags;
   char *ptx_name;
 };
 
@@ -107,6 +117,38 @@ record_id (const char *p1, id_map ***where)
     fatal_error (input_location, "malformed ptx file");
 
   id_map *v = XNEW (id_map);
+
+  /* Do we have any flags?  */
+  v->flags = ID_MAP_FLAG_NONE;
+  if (p1[0] == '(')
+    {
+      /* Current flag.  */
+      const char *cur = p1 + 1;
+
+      /* Seek to the beginning of ") ".  */
+      p1 = strchr (cur, ')');
+      if (!p1 || p1 > end || p1[1] != ' ')
+	fatal_error (input_location, "malformed ptx file: "
+		     "expected \") \" at \"%s\"", cur);
+
+      while (cur < p1)
+	{
+	  const char *next = strchr (cur, ',');
+	  if (!next || next > p1)
+	    next = p1;
+
+	  if (strncmp (cur, "avoid offloading", next - cur - 1) == 0)
+	    v->flags |= ID_MAP_FLAG_AVOID_OFFLOADING;
+	  else
+	    fatal_error (input_location, "malformed ptx file: "
+			 "unknown flag at \"%s\"", cur);
+
+	  cur = next;
+	}
+
+      /* Skip past ") ".  */
+      p1 += 2;
+    }
   size_t len = end - p1;
   v->ptx_name = XNEWVEC (char, len + 1);
   memcpy (v->ptx_name, p1, len);
@@ -296,12 +338,17 @@ process (FILE *in, FILE *out)
   fprintf (out, "\n};\n\n");
 
   /* Dump out function idents.  */
+  bool avoid_offloading_p = false;
   fprintf (out, "static const struct nvptx_fn {\n"
 	   "  const char *name;\n"
 	   "  unsigned short dim[%d];\n"
 	   "} func_mappings[] = {\n", GOMP_DIM_MAX);
   for (comma = "", id = func_ids; id; comma = ",", id = id->next)
-    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+    {
+      if (id->flags & ID_MAP_FLAG_AVOID_OFFLOADING)
+	avoid_offloading_p = true;
+      fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+    }
   fprintf (out, "\n};\n\n");
 
   fprintf (out,
@@ -318,7 +365,11 @@ process (FILE *in, FILE *out)
 	   "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
 	   "  func_mappings,"
 	   "  sizeof (func_mappings) / sizeof (func_mappings[0])\n"
-	   "};\n\n");
+	   "};\n");
+  if (avoid_offloading_p)
+    /* Need a unique handle for target_data.  */
+    fprintf (out, "static int target_data_avoid_offloading;\n");
+  fprintf (out, "\n");
 
   fprintf (out, "#ifdef __cplusplus\n"
 	   "extern \"C\" {\n"
@@ -338,18 +389,28 @@ process (FILE *in, FILE *out)
   fprintf (out, "static __attribute__((constructor)) void init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register_ver (%#x, __OFFLOAD_TABLE__,"
-	   "%d/*NVIDIA_PTX*/, &target_data);\n"
-	   "};\n",
+	   "%d/*NVIDIA_PTX*/, &target_data);\n",
 	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
 	   GOMP_DEVICE_NVIDIA_PTX);
+  if (avoid_offloading_p)
+    fprintf (out, "  GOMP_offload_register_ver (%#x, (void *) 0,"
+	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
+	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
+	     GOMP_DEVICE_NVIDIA_PTX);
+  fprintf (out, "};\n");
 
   fprintf (out, "static __attribute__((destructor)) void fini (void)\n"
 	   "{\n"
 	   "  GOMP_offload_unregister_ver (%#x, __OFFLOAD_TABLE__,"
-	   "%d/*NVIDIA_PTX*/, &target_data);\n"
-	   "};\n",
+	   "%d/*NVIDIA_PTX*/, &target_data);\n",
 	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
 	   GOMP_DEVICE_NVIDIA_PTX);
+  if (avoid_offloading_p)
+    fprintf (out, "  GOMP_offload_unregister_ver (%#x, (void *) 0,"
+	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
+	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
+	     GOMP_DEVICE_NVIDIA_PTX);
+  fprintf (out, "};\n");
 }
 
 static void
diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
index dfbdcfb..3faacd5 100644
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -3811,6 +3811,9 @@ static const struct attribute_spec nvptx_attribute_table[] =
   /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
        affects_type_identity } */
   { "kernel", 0, 0, true, false,  false, nvptx_handle_kernel_attribute, false },
+  /* Avoid offloading.  For example, because there is no sufficient
+     parallelism.  */
+  { "omp avoid offloading", 0, 0, true, false, false, NULL, false },
   { NULL, 0, 0, false, false, false, NULL, false }
 };
 \f
@@ -3875,7 +3878,10 @@ nvptx_record_offload_symbol (tree decl)
 	tree dims = TREE_VALUE (attr);
 	unsigned ix;
 
-	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
+	fprintf (asm_out_file, "//:FUNC_MAP %s\"%s\"",
+		 (lookup_attribute ("omp avoid offloading",
+				    DECL_ATTRIBUTES (decl))
+		  ? "(avoid offloading) " : ""),
 		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
 
 	for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
@@ -4135,6 +4141,40 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 static bool
 nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 {
+  /* Detect if a function is unsuitable for offloading.  */
+  if (!flag_offload_force && decl)
+    {
+      tree oacc_function_attr = get_oacc_fn_attrib (decl);
+      if (oacc_function_attr
+	  && oacc_fn_attrib_kernels_p (oacc_function_attr))
+	{
+	  bool avoid_offloading_p = true;
+	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
+	    {
+	      if (dims[ix] > 1)
+		{
+		  avoid_offloading_p = false;
+		  break;
+		}
+	    }
+	  if (avoid_offloading_p)
+	    {
+	      /* OpenACC kernels constructs will never be parallelized for
+		 optimization levels smaller than -O2; avoid the diagnostic in
+		 this case.  */
+	      if (optimize >= 2)
+		warning_at (DECL_SOURCE_LOCATION (decl), 0,
+			    "OpenACC kernels construct will be executed "
+			    "sequentially; will by default avoid offloading "
+			    "to prevent data copy penalty");
+	      DECL_ATTRIBUTES (decl)
+		= tree_cons (get_identifier ("omp avoid offloading"),
+			     NULL_TREE, DECL_ATTRIBUTES (decl));
+
+	    }
+	}
+    }
+
   bool changed = false;
 
   /* The vector size must be 32, unless this is a SEQ routine.  */
diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
index c608a36..c9c79fc 100644
--- gcc/doc/invoke.texi
+++ gcc/doc/invoke.texi
@@ -1153,7 +1153,7 @@ See S/390 and zSeries Options.
 -finstrument-functions-exclude-function-list=@var{sym},@var{sym},@dots{} @gol
 -finstrument-functions-exclude-file-list=@var{file},@var{file},@dots{} @gol
 -fno-common  -fno-ident @gol
--foffload-alias=@r{[}none@r{|}pointer@r{|}all@r{]} @gol
+-foffload-alias=@r{[}none@r{|}pointer@r{|}all@r{]}  -foffload-force @gol
 -fpcc-struct-return  -fpic  -fPIC -fpie -fPIE -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
@@ -24230,6 +24230,15 @@ objects references in an offload region do not alias.  The option
 aliasing in offload regions.  The default value is
 @option{-foffload-alias=none}.
 
+@item -foffload-force
+@opindex -foffload-force
+The option @option{-foffload-force} forces offloading if the compiler
+wanted to avoid it.  For example, when there isn't sufficient
+parallelism in certain offloading constructs, the compiler may come to
+the conclusion that offloading incurs too much overhead (for data
+transfers, for example), and unless overridden with this flag, it then
+suggests to the runtime (libgomp) to avoid offloading.
+
 @item -fexceptions
 @opindex fexceptions
 Enable exception handling.  Generates extra code needed to propagate
diff --git gcc/lto-wrapper.c gcc/lto-wrapper.c
index 91bb1e8..5e03544 100644
--- gcc/lto-wrapper.c
+++ gcc/lto-wrapper.c
@@ -275,6 +275,7 @@ merge_and_complain (struct cl_decoded_option **decoded_options,
 	case OPT_fsigned_zeros:
 	case OPT_ftrapping_math:
 	case OPT_fwrapv:
+	case OPT_foffload_force:
 	case OPT_fopenmp:
 	case OPT_fopenacc:
 	case OPT_fcheck_pointer_bounds:
@@ -516,6 +517,7 @@ append_compiler_options (obstack *argv_obstack, struct cl_decoded_option *opts,
 	case OPT_fsigned_zeros:
 	case OPT_ftrapping_math:
 	case OPT_fwrapv:
+	case OPT_foffload_force:
 	case OPT_fopenmp:
 	case OPT_fopenacc:
 	case OPT_fopenacc_dim_:
diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 2003a8a..b089e27 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,153 @@
+2016-01-21  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* target.c (GOMP_offload_register_ver)
+	(GOMP_offload_unregister_ver, gomp_init_device)
+	(gomp_unload_device, gomp_offload_target_available_p): Handle and
+	document "avoid offloading" ("host_table == NULL").
+	(resolve_device): Document "avoid offloading".
+	* oacc-init.c (resolve_device): Likewise.
+	* libgomp.texi (Enabling OpenACC): Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
+	file.
+	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
+	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
+	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
+	* testsuite/libgomp.oacc-c++/non-scalar-data.C: Set
+	"-foffload-force".
+	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90:
+	Likewise.
+
+	* testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c: Set
+	"-ftree-parallelize-loops=32".
+	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/if-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
+	* testsuite/libgomp.oacc-fortran/asyncwait-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/asyncwait-2.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/asyncwait-3.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/deviceptr-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-map-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
+
 2016-01-20  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* testsuite/libgomp.oacc-c++/non-scalar-data.C: New test.
diff --git libgomp/libgomp.texi libgomp/libgomp.texi
index 8870084..2841b2e 100644
--- libgomp/libgomp.texi
+++ libgomp/libgomp.texi
@@ -1818,6 +1818,14 @@ flag @option{-fopenacc} must be specified.  This enables the OpenACC directive
 arranges for automatic linking of the OpenACC runtime library 
 (@ref{OpenACC Runtime Library Routines}).
 
+Offloading is enabled by default.  In some cases, the compiler may
+come to the conclusion that offloading incurs too much overhead, and
+suggest to the runtime to avoid it.  To counteract that, you can use
+the option @option{-foffload-force} to force offloading in such cases.
+Alternatively, offloading is also enabled if a specific device type is
+requested, in a call to @code{acc_init} or by setting the
+@env{ACC_DEVICE_TYPE} environment variable, for example.
+
 A complete description of all OpenACC directives accepted may be found in 
 the @uref{http://www.openacc.org/, OpenACC} Application Programming
 Interface manual, version 2.0.
diff --git libgomp/oacc-init.c libgomp/oacc-init.c
index a90732d..b3d13a8 100644
--- libgomp/oacc-init.c
+++ libgomp/oacc-init.c
@@ -123,8 +123,9 @@ resolve_device (acc_device_t d, bool fail_is_error)
 	if (goacc_device_type)
 	  {
 	    /* Lookup the device that has been explicitly named, so do not pay
-	       attention to gomp_offload_target_available_p.  (That is, hard
-	       error if not actually available.)  */
+	       attention to gomp_offload_target_available_p.  (That is,
+	       enforced usage even with an "avoid offloading" flag set, and
+	       hard error if not actually available.)  */
 	    while (++d != _ACC_device_hwm)
 	      if (dispatchers[d]
 		  && !strcasecmp (goacc_device_type,
@@ -154,7 +155,8 @@ resolve_device (acc_device_t d, bool fail_is_error)
 	    && dispatchers[d]->get_num_devices_func () > 0
 	    /* No device has been explicitly named, so pay attention to
 	       gomp_offload_target_available_p, to not decide on an offload
-	       target that we don't have offload data available for.  */
+	       target that we don't have offload data available for, or have an
+	       "avoid offloading" flag set for.  */
 	    && gomp_offload_target_available_p (dispatchers[d]->type))
 	  goto found;
       /* No non-host device found.  */
diff --git libgomp/target.c libgomp/target.c
index 7adc4d0..c60e52a 100644
--- libgomp/target.c
+++ libgomp/target.c
@@ -130,8 +130,9 @@ resolve_device (int device)
     }
   gomp_mutex_unlock (&devices[device_id].lock);
 
-  /* If the device-var ICV does not actually have offload data available, don't
-     try use it (which will fail), and use host fallback instead.  */
+  /* Use host fallback instead of the device-var ICV if the latter doesn't
+     actually have offload data available (offloading will fail), or has an
+     "avoid offloading" flag set.  */
   if (device == GOMP_DEVICE_ICV
       && !gomp_offload_target_available_p (devices[device_id].type))
     return NULL;
@@ -1139,12 +1140,19 @@ gomp_unload_image_from_device (struct gomp_device_descr *devicep,
 
 /* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
-   the target, and TARGET_DATA needed by target plugin.  */
+   the target, and TARGET_DATA needed by target plugin.
+
+   If HOST_TABLE is NULL, this image (TARGET_DATA) is stored as an "avoid
+   offloading" flag, and the TARGET_TYPE will not be considered by default
+   until this image gets unregistered.  */
 
 void
 GOMP_offload_register_ver (unsigned version, const void *host_table,
 			   int target_type, const void *target_data)
 {
+  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
+	      version, host_table, target_type, target_data);
+
   int i;
 
   if (GOMP_VERSION_LIB (version) > GOMP_VERSION)
@@ -1153,16 +1161,19 @@ GOMP_offload_register_ver (unsigned version, const void *host_table,
   
   gomp_mutex_lock (&register_lock);
 
-  /* Load image to all initialized devices.  */
-  for (i = 0; i < num_devices; i++)
+  if (host_table != NULL)
     {
-      struct gomp_device_descr *devicep = &devices[i];
-      gomp_mutex_lock (&devicep->lock);
-      if (devicep->type == target_type
-	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_load_image_to_device (devicep, version,
-				   host_table, target_data, true);
-      gomp_mutex_unlock (&devicep->lock);
+      /* Load image to all initialized devices.  */
+      for (i = 0; i < num_devices; i++)
+	{
+	  struct gomp_device_descr *devicep = &devices[i];
+	  gomp_mutex_lock (&devicep->lock);
+	  if (devicep->type == target_type
+	      && devicep->state == GOMP_DEVICE_INITIALIZED)
+	    gomp_load_image_to_device (devicep, version,
+				       host_table, target_data, true);
+	  gomp_mutex_unlock (&devicep->lock);
+	}
     }
 
   /* Insert image to array of pending images.  */
@@ -1188,26 +1199,36 @@ GOMP_offload_register (const void *host_table, int target_type,
 
 /* This function should be called from every offload image while unloading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
-   the target, and TARGET_DATA needed by target plugin.  */
+   the target, and TARGET_DATA needed by target plugin.
+
+   If HOST_TABLE is NULL, the "avoid offloading" flag gets cleared for this
+   image (TARGET_DATA), and this TARGET_TYPE may again be considered by
+   default.  */
 
 void
 GOMP_offload_unregister_ver (unsigned version, const void *host_table,
 			     int target_type, const void *target_data)
 {
+  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
+	      version, host_table, target_type, target_data);
+
   int i;
 
   gomp_mutex_lock (&register_lock);
 
-  /* Unload image from all initialized devices.  */
-  for (i = 0; i < num_devices; i++)
+  if (host_table != NULL)
     {
-      struct gomp_device_descr *devicep = &devices[i];
-      gomp_mutex_lock (&devicep->lock);
-      if (devicep->type == target_type
-	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_unload_image_from_device (devicep, version,
-				       host_table, target_data);
-      gomp_mutex_unlock (&devicep->lock);
+      /* Unload image from all initialized devices.  */
+      for (i = 0; i < num_devices; i++)
+	{
+	  struct gomp_device_descr *devicep = &devices[i];
+	  gomp_mutex_lock (&devicep->lock);
+	  if (devicep->type == target_type
+	      && devicep->state == GOMP_DEVICE_INITIALIZED)
+	    gomp_unload_image_from_device (devicep, version,
+					   host_table, target_data);
+	  gomp_mutex_unlock (&devicep->lock);
+	}
     }
 
   /* Remove image from array of pending images.  */
@@ -1241,7 +1262,8 @@ gomp_init_device (struct gomp_device_descr *devicep)
   for (i = 0; i < num_offload_images; i++)
     {
       struct offload_image_descr *image = &offload_images[i];
-      if (image->type == devicep->type)
+      if (image->type == devicep->type
+	  && image->host_table != NULL)
 	gomp_load_image_to_device (devicep, image->version,
 				   image->host_table, image->target_data,
 				   false);
@@ -1261,7 +1283,8 @@ gomp_unload_device (struct gomp_device_descr *devicep)
       for (i = 0; i < num_offload_images; i++)
 	{
 	  struct offload_image_descr *image = &offload_images[i];
-	  if (image->type == devicep->type)
+	  if (image->type == devicep->type
+	      && image->host_table != NULL)
 	    gomp_unload_image_from_device (devicep, image->version,
 					   image->host_table,
 					   image->target_data);
@@ -1272,7 +1295,9 @@ gomp_unload_device (struct gomp_device_descr *devicep)
 /* Do we have offload data available for the given offload target type?
    Instead of verifying that *all* offload data is available that could
    possibly be required, we instead just look for *any*.  If we later find any
-   offload data missing, that's user error.  */
+   offload data missing, that's user error.  If any offload data of this target
+   type is tagged with an "avoid offloading" flag, do not consider this target
+   type available unless it has been initialized already.  */
 
 attribute_hidden bool
 gomp_offload_target_available_p (int type)
@@ -1290,6 +1315,9 @@ gomp_offload_target_available_p (int type)
       gomp_mutex_unlock (&devicep->lock);
     }
 
+  /* If the offload target has been initialized already, we ignore "avoid
+     offloading" flags.  This is important, because data/state may be present
+     on the device, that we must continue to use.  */
   if (!available)
     {
       gomp_mutex_lock (&register_lock);
@@ -1303,8 +1331,14 @@ gomp_offload_target_available_p (int type)
 
       /* Can the offload target be initialized?  */
       for (int i = 0; !available && i < num_offload_images; i++)
-	if (offload_images[i].type == type)
+	if (offload_images[i].type == type
+	    && offload_images[i].host_table != NULL)
 	  available = true;
+      /* If yes, is an "avoid offloading" flag set?  */
+      for (int i = 0; available && i < num_offload_images; i++)
+	if (offload_images[i].type == type
+	    && offload_images[i].host_table == NULL)
+	  available = false;
 
       gomp_mutex_unlock (&register_lock);
     }
diff --git libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
index 180e86f..fe919c8 100644
--- libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
+++ libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
@@ -1,7 +1,8 @@
 // Ensure that a non-scalar dummy arguments which are implicitly used inside
 // offloaded regions are properly mapped using present_or_copy.
 
-// { dg-do run }
+// Override the compiler's "avoid offloading" decision.
+// { dg-additional-options "-foffload-force" }
 
 #include <cassert>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
index bca425e..b0da8b7 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
@@ -1,4 +1,5 @@
-/* { dg-do run } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdio.h>
 #include <stdlib.h>
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
index c29ca3f..3079b78 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
@@ -1,4 +1,5 @@
-/* { dg-do run } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
index f3b490a..02e43af 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
@@ -1,6 +1,7 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
 /* <http://news.gmane.org/find-root.php?message_id=%3C87pp0aaksc.fsf%40kepler.schwinge.homeip.net%3E>.
    { dg-xfail-run-if "TODO" { *-*-* } } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-lcuda" } */
 
 #include <openacc.h>
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
new file mode 100644
index 0000000..e614785
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
@@ -0,0 +1,25 @@
+/* Test that the compiler decides to "avoid offloading".  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+  if (y != 1)
+    __builtin_abort();
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
new file mode 100644
index 0000000..c13436f
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
@@ -0,0 +1,38 @@
+/* Test that a user can override the compiler's "avoid offloading"
+   decision.  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  /* Override the compiler's "avoid offloading" decision.  */
+  acc_device_t d;
+#if defined ACC_DEVICE_TYPE_nvidia
+  d = acc_device_nvidia;
+#elif defined ACC_DEVICE_TYPE_host
+  d = acc_device_host;
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+  acc_init (d);
+
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_nvidia
+  if (y != 0)
+    __builtin_abort();
+#else
+  if (y != 1)
+    __builtin_abort();
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
new file mode 100644
index 0000000..e2301e6
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
@@ -0,0 +1,29 @@
+/* Test that a user can override the compiler's "avoid offloading"
+   decision.  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_nvidia
+  if (y != 0)
+    __builtin_abort();
+#else
+  if (y != 1)
+    __builtin_abort();
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
index dad6d13..f8ebbb1 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
@@ -1,6 +1,6 @@
 /* This test exercises combined directives.  */
 
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
index 1ac0b95..e512fcf 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
@@ -1,4 +1,6 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include  <openacc.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
index e62c315..b5c29ab 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
@@ -1,4 +1,6 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
index 51745ba..3ef6f9b 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
@@ -1,4 +1,5 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
 
 #include <stdlib.h>
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c
index 2887f66f..7b09917 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c
@@ -1,4 +1,4 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <openacc.h>
 #include <stdlib.h>
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
index aeb0142..a90c9466 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
@@ -1,4 +1,6 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
index 0f323c8..1dc0402 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
@@ -1,4 +1,7 @@
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
index 17a0f3d..baf6662 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
@@ -1,4 +1,7 @@
-/* { dg-additional-options "-O2 -foffload-alias=all -fipa-pta" } */
+/* { dg-additional-options "-foffload-alias=all -fipa-pta" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
index 44d4fd2..efbe43a 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
@@ -1,4 +1,7 @@
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
index a68a7cd..d527e14 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
@@ -1,3 +1,6 @@
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
+
 int
 main (void)
 {
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
index 2e4100f..6b561e4 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
index 83d4e7f..d965348 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
index 01d5e5e..9548cd6 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
index 61d1283..237d56c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
index f7f04cb..67e75cd 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c
index 2e920cd..195b2c5 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c
index 72249cc..f182a2c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c
index 1b0a7cc..4da360c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c
index bbe6b3c..1a8fc9c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c
index 18e5676..a3f2fb9 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c
index e424739..eac168c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c
index a12e36e..0c0f1e1 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c
index f8ec543..0ee0a95 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c
index 73561b3..e54873a 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c
index 3334830..9660c14 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c
index 88ab245..e4d1437 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c
@@ -1,3 +1,7 @@
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
+
 #include <assert.h>
 
 /* Test of gang-private aggregate variable declared on loop directive, with
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c
index 3f7062d..83f52de 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c
index dada424..25ceab5 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c
index 8d649d1..ac5f24a 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c
index a67f90e..a3d18a1 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c
index 465a800..3944399 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c
index a08ba69..d6dd81b 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c
index 1f76345..53293a3 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c
index fe2e23a..63b5b51 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c
index 12c17e4..65089de 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c
@@ -1,5 +1,6 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <assert.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c
index 3a2a5b5..ab38f91 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c
@@ -1,8 +1,9 @@
 /* Verify that a simple, explicit acc loop reduction works inside
  a kernels region.  */
 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
index c164598..94a5ae2 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
@@ -1,4 +1,4 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90 libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
index 01728bd..bc1210e 100644
--- libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
@@ -1,4 +1,5 @@
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
 
 program asyncwait
   integer, parameter :: N = 64
diff --git libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90 libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90
index fe131b6..2dfed6a 100644
--- libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90
@@ -1,4 +1,5 @@
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
 
 program asyncwait
   integer, parameter :: N = 64
diff --git libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90 libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90
index fa96a01..2c33c0f 100644
--- libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90
@@ -1,4 +1,5 @@
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
 
 program asyncwait
   integer, parameter :: N = 64
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
new file mode 100644
index 0000000..0f4edb1
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
@@ -0,0 +1,29 @@
+! Test that the compiler decides to "avoid offloading".
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST);
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+      IF (.NOT. Y) CALL ABORT
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
new file mode 100644
index 0000000..4c8ceac
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
@@ -0,0 +1,40 @@
+! Test that a user can override the compiler's "avoid offloading" decision.
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER :: D
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!     Override the compiler's "avoid offloading" decision.
+#if defined ACC_DEVICE_TYPE_nvidia
+      D = ACC_DEVICE_NVIDIA
+#elif defined ACC_DEVICE_TYPE_host
+      D = ACC_DEVICE_HOST
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+      CALL ACC_INIT (D)
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_nvidia
+      IF (Y) CALL ABORT
+#else
+      IF (.NOT. Y) CALL ABORT
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
new file mode 100644
index 0000000..5f669b7
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
@@ -0,0 +1,30 @@
+! Test that a user can override the compiler's "avoid offloading" decision.
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+!     Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER :: D
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_nvidia
+      IF (Y) CALL ABORT
+#else
+      IF (.NOT. Y) CALL ABORT
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
index 94100b2..3081e7a 100644
--- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
@@ -1,6 +1,7 @@
 ! This test exercises combined directives.
 
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
 
 program main
   integer, parameter :: n = 32
diff --git libgomp/testsuite/libgomp.oacc-fortran/default-1.f90 libgomp/testsuite/libgomp.oacc-fortran/default-1.f90
index 1059089..07c1e74 100644
--- libgomp/testsuite/libgomp.oacc-fortran/default-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/default-1.f90
@@ -1,4 +1,7 @@
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   implicit none
diff --git libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90 libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
index 276a172..4646be9 100644
--- libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
@@ -1,9 +1,10 @@
-! { dg-do run }
-
 ! Test the deviceptr clause with various directives
 ! and in combination with other directives where
 ! the deviceptr variable is implied.
 
+! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+
 subroutine subr1 (a, b)
   implicit none
   integer, parameter :: N = 8
diff --git libgomp/testsuite/libgomp.oacc-fortran/if-1.f90 libgomp/testsuite/libgomp.oacc-fortran/if-1.f90
index e54c1b2..784f8a1 100644
--- libgomp/testsuite/libgomp.oacc-fortran/if-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/if-1.f90
@@ -1,5 +1,8 @@
-! { dg-do run } */
+! { dg-do run }
 ! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   use openacc
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90
index fdf9409..854fe9c 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90
@@ -1,3 +1,8 @@
+! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
+
 program foo
 
   IMPLICIT NONE
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90
index 912a22b..b120b66 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90
@@ -1,3 +1,8 @@
+! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
+
 program foo
   IMPLICIT NONE
   INTEGER :: vol = 0
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90
index 9378b12..1aafefa 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90
@@ -2,6 +2,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program collapse3
   integer :: a(3,3,3), k, kk, kkk, l, ll, lll
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90
index dfd9cd2..1f2cf97 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90
@@ -2,6 +2,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program collapse4
   integer :: i, j, k, a(1:7, -3:5, 12:19), b(1:7, -3:5, 12:19)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90
index 9f17308..f6b2255 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90
@@ -1,4 +1,4 @@
-! { dg-do run } */
+! { dg-do run }
 ! { dg-additional-options "-cpp" }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
 
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90
index 01d62f8..14e14ab 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90
@@ -1,6 +1,9 @@
 ! Test the copy, copyin, copyout, pcopy, pcopyin, pcopyout, and pcreate
 ! clauses on kernels constructs.
 
+! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+
 program map
   integer, parameter     :: n = 20, c = 10
   integer                :: i, a(n), b(n), d(n)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90
index 43a1988..51a57b2 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: x, i, j, arr(0:32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90
index e5806ee..948f811 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: x, i, j, arr(0:32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90
index 7d19bba..6be2692 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   type vec3
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90
index 379bb3a..0312ee7 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90
@@ -2,6 +2,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: x, i, j, k, idx, arr(0:32*32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90
index 8873efe..7ce7f1b 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90
@@ -2,6 +2,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: i, j, k, idx, arr(0:32*32*32), pt(2)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90
index f513ec2..50d13e4 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90
@@ -2,6 +2,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: x, i, j, arr(0:32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90
index e7652d9..328a6b4 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: x, i, j, k, idx, arr(0:32*32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90
index c82ced7..a96221d 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: x, i, j, k, idx, arr(0:32*32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90
index e30de70..d2b30dd 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: x, i, j, k, idx, arr(0:32*32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90
index 20f8579..3cfcbb4 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: i, j, k, idx, arr(0:32*32*32)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90
index 48c3bfd..5f65926 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   type vec2
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90
index ca63796..27d1b27 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90
@@ -3,6 +3,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program main
   integer :: i, j, k, idx, arr(0:32*32*32), pt(2)
diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90
index e894b6d..dcabe02 100644
--- libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90
@@ -2,6 +2,8 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
 
 program reduction
   integer, parameter     :: n = 20
diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
index 4afb562..cae39ac 100644
--- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
@@ -2,6 +2,7 @@
 ! offloaded regions are properly mapped using present_or_copy.
 
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
 
 program main
   implicit none


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-21 21:55   ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Thomas Schwinge
@ 2016-01-22  7:40     ` Thomas Schwinge
  2016-01-22  8:36       ` Jakub Jelinek
  2016-06-30 21:46     ` Thomas Schwinge
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Thomas Schwinge @ 2016-01-22  7:40 UTC (permalink / raw)
  To: gcc-patches, Jakub Jelinek

Hi Jakub!

On Thu, 21 Jan 2016 22:54:26 +0100, I wrote:
> On Mon, 18 Jan 2016 18:26:49 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > [...] [OpenACC] kernels region [...]
> > that parloops does not manage to parallelize:

> Telling from real-world code that we've been having a look at, when the
> above situation happens, we're -- in the vast majority of all cases -- in
> a situation where we generally want to avoid offloading (unless
> explicitly requested), "to avoid data copy penalty" as well as typically
> much slower single-threaded execution on the GPU.  Obviously, that will
> have to be revisited as parloops (or any other mechanism in GCC) is able
> to better understand/use the parallelism in OpenACC kernels constructs.
> 
> So, building upon Tom's patch, I have implemented an "avoid offloading"
> flag given the presence of one un-parallelized OpenACC kernels construct.
> This is currently only enabled for OpenACC kernels constructs, in
> combination with nvptx offloading, but I think the general scheme will be
> useful also for other constructs as well as other (non-shared memory)
> offloading targets.

> Committed to gomp-4_0-branch in r232709:
> 
> commit 41a76d233e714fd7b79dc1f40823f607c38306ba
> Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
> Date:   Thu Jan 21 21:52:50 2016 +0000
> 
>     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"

Thought I'd check before porting it over -- will such a patch also be
accepted for trunk?


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-22  7:40     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge
@ 2016-01-22  8:36       ` Jakub Jelinek
  2016-01-22  9:00         ` Thomas Schwinge
  2016-01-22 13:18         ` Bernd Schmidt
  0 siblings, 2 replies; 25+ messages in thread
From: Jakub Jelinek @ 2016-01-22  8:36 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: gcc-patches

On Fri, Jan 22, 2016 at 08:40:26AM +0100, Thomas Schwinge wrote:
> On Thu, 21 Jan 2016 22:54:26 +0100, I wrote:
> > On Mon, 18 Jan 2016 18:26:49 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > > [...] [OpenACC] kernels region [...]
> > > that parloops does not manage to parallelize:
> 
> > Telling from real-world code that we've been having a look at, when the
> > above situation happens, we're -- in the vast majority of all cases -- in
> > a situation where we generally want to avoid offloading (unless
> > explicitly requested), "to avoid data copy penalty" as well as typically
> > much slower single-threaded execution on the GPU.  Obviously, that will
> > have to be revisited as parloops (or any other mechanism in GCC) is able
> > to better understand/use the parallelism in OpenACC kernels constructs.
> > 
> > So, building upon Tom's patch, I have implemented an "avoid offloading"
> > flag given the presence of one un-parallelized OpenACC kernels construct.
> > This is currently only enabled for OpenACC kernels constructs, in
> > combination with nvptx offloading, but I think the general scheme will be
> > useful also for other constructs as well as other (non-shared memory)
> > offloading targets.
> 
> > Committed to gomp-4_0-branch in r232709:
> > 
> > commit 41a76d233e714fd7b79dc1f40823f607c38306ba
> > Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
> > Date:   Thu Jan 21 21:52:50 2016 +0000
> > 
> >     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
> 
> Thought I'd check before porting it over -- will such a patch also be
> accepted for trunk?

I think it is a bad idea to go against what the user wrote.  Warning that
some code might not be efficient?  Perhaps (if properly guarded with some
warning option one can turn off, either on a per-source file or using
pragmas even more fine grained).  But by default not offloading?  That is
just wrong.

	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-22  8:36       ` Jakub Jelinek
@ 2016-01-22  9:00         ` Thomas Schwinge
  2016-01-22 13:18         ` Bernd Schmidt
  1 sibling, 0 replies; 25+ messages in thread
From: Thomas Schwinge @ 2016-01-22  9:00 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches

Hi Jakub!

On Fri, 22 Jan 2016 09:36:25 +0100, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Jan 22, 2016 at 08:40:26AM +0100, Thomas Schwinge wrote:
> > On Thu, 21 Jan 2016 22:54:26 +0100, I wrote:
> > > On Mon, 18 Jan 2016 18:26:49 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > > > [...] [OpenACC] kernels region [...]
> > > > that parloops does not manage to parallelize:
> > 
> > > Telling from real-world code that we've been having a look at, when the
> > > above situation happens, we're -- in the vast majority of all cases -- in
> > > a situation where we generally want to avoid offloading (unless
> > > explicitly requested), "to avoid data copy penalty" as well as typically
> > > much slower single-threaded execution on the GPU.  Obviously, that will
> > > have to be revisited as parloops (or any other mechanism in GCC) is able
> > > to better understand/use the parallelism in OpenACC kernels constructs.
> > > 
> > > So, building upon Tom's patch, I have implemented an "avoid offloading"
> > > flag given the presence of one un-parallelized OpenACC kernels construct.
> > > This is currently only enabled for OpenACC kernels constructs, in
> > > combination with nvptx offloading, but I think the general scheme will be
> > > useful also for other constructs as well as other (non-shared memory)
> > > offloading targets.
> > 
> > > Committed to gomp-4_0-branch in r232709:
> > > 
> > > commit 41a76d233e714fd7b79dc1f40823f607c38306ba
> > > Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
> > > Date:   Thu Jan 21 21:52:50 2016 +0000
> > > 
> > >     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
> > 
> > Thought I'd check before porting it over -- will such a patch also be
> > accepted for trunk?
> 
> I think it is a bad idea to go against what the user wrote.  Warning that
> some code might not be efficient?  Perhaps (if properly guarded with some
> warning option one can turn off, either on a per-source file or using
> pragmas even more fine grained).  But by default not offloading?  That is
> just wrong.

Well, let's argue the opposite way round: a user annotated the source
code with directives to help the compiler identify
parallelization/offloading opportunities.  These directives are just
descriptive hints however; (obeying program semantics, of course) the
compiler is free to ignore them, or just pay attention to some of them.
Suppose the compiler didn't find any parallelization opportunities, but
it knows that compared to host-fallback execution, offloading will be
slower for single-threaded code (data copy penalty, slower GPU clock
speed), so it only makes sense to not offload the code in such cases.

This is, quite possibly, semantically different from OpenMP directives,
where with OpenMP typically the compiler always exactly does what the
user prescribes with directives.  (But even there, you can automatically
apply SIMD parallelism, for example.  You just have to make sure that it
doesn't interfer with the program semantics, basically that the user
"won't notice".)

Does that clarify?


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-22  8:36       ` Jakub Jelinek
  2016-01-22  9:00         ` Thomas Schwinge
@ 2016-01-22 13:18         ` Bernd Schmidt
  2016-01-22 13:25           ` Jakub Jelinek
  2016-01-26 22:30           ` [gomp4] " Martin Jambor
  1 sibling, 2 replies; 25+ messages in thread
From: Bernd Schmidt @ 2016-01-22 13:18 UTC (permalink / raw)
  To: Jakub Jelinek, Thomas Schwinge; +Cc: gcc-patches

On 01/22/2016 09:36 AM, Jakub Jelinek wrote:
>
> I think it is a bad idea to go against what the user wrote.  Warning that
> some code might not be efficient?  Perhaps (if properly guarded with some
> warning option one can turn off, either on a per-source file or using
> pragmas even more fine grained).  But by default not offloading?  That is
> just wrong.

I'm leaning more towards Thomas' side of the argument. The kernels 
construct is a hint, a "do your best" request to the compiler. If the 
compiler sees that it can't parallelize a loop inside a kernels region, 
it's probably best not to offload it.


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-22 13:18         ` Bernd Schmidt
@ 2016-01-22 13:25           ` Jakub Jelinek
  2016-01-22 13:31             ` Bernd Schmidt
  2016-01-26 22:30           ` [gomp4] " Martin Jambor
  1 sibling, 1 reply; 25+ messages in thread
From: Jakub Jelinek @ 2016-01-22 13:25 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches

On Fri, Jan 22, 2016 at 02:18:38PM +0100, Bernd Schmidt wrote:
> On 01/22/2016 09:36 AM, Jakub Jelinek wrote:
> >
> >I think it is a bad idea to go against what the user wrote.  Warning that
> >some code might not be efficient?  Perhaps (if properly guarded with some
> >warning option one can turn off, either on a per-source file or using
> >pragmas even more fine grained).  But by default not offloading?  That is
> >just wrong.
> 
> I'm leaning more towards Thomas' side of the argument. The kernels construct
> is a hint, a "do your best" request to the compiler. If the compiler sees
> that it can't parallelize a loop inside a kernels region, it's probably best
> not to offload it.

What about #pragma oacc parallel?  That would never do that?

	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-22 13:25           ` Jakub Jelinek
@ 2016-01-22 13:31             ` Bernd Schmidt
  2016-02-04 14:47               ` Thomas Schwinge
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schmidt @ 2016-01-22 13:31 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Thomas Schwinge, gcc-patches

On 01/22/2016 02:25 PM, Jakub Jelinek wrote:

> What about #pragma oacc parallel?  That would never do that?

It shouldn't, no (IMO).


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-22 13:18         ` Bernd Schmidt
  2016-01-22 13:25           ` Jakub Jelinek
@ 2016-01-26 22:30           ` Martin Jambor
  1 sibling, 0 replies; 25+ messages in thread
From: Martin Jambor @ 2016-01-26 22:30 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jakub Jelinek, Thomas Schwinge, gcc-patches

On Fri, Jan 22, 2016 at 02:18:38PM +0100, Bernd Schmidt wrote:
> On 01/22/2016 09:36 AM, Jakub Jelinek wrote:
> >
> >I think it is a bad idea to go against what the user wrote.  Warning that
> >some code might not be efficient?  Perhaps (if properly guarded with some
> >warning option one can turn off, either on a per-source file or using
> >pragmas even more fine grained).  But by default not offloading?  That is
> >just wrong.
> 
> I'm leaning more towards Thomas' side of the argument. The kernels construct
> is a hint, a "do your best" request to the compiler. If the compiler sees
> that it can't parallelize a loop inside a kernels region, it's probably best
> not to offload it.
> 

Shouldn't such optimization feedback be output in MSG_NOTE dumps?
Vectorizer uses it to inform the user what it is doing, supposedly
with the intention to help the programmer find out why specific loops
are not vectorized (and run slowly).  I have also decided to use it to
inform the user whether a combination of OpenMP constructs is
gridified or not.

Unfortunately, notes seem to appear only in "detailed" dumps, which
often are not the best place for users to look into because of too
much information on gcc internals.  So the user interface aspect of
notes could perhaps be re-thought a bit.

In any event, I think that at least in the near term, good compiler
feedback could ease the efficient use of accelerators quite a lot,
like (they say) it did with early auto-vectorizing compilers.

Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-22 13:31             ` Bernd Schmidt
@ 2016-02-04 14:47               ` Thomas Schwinge
  2016-02-10 11:51                 ` Thomas Schwinge
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Schwinge @ 2016-02-04 14:47 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

Hi!

On Fri, 22 Jan 2016 14:31:35 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 01/22/2016 02:25 PM, Jakub Jelinek wrote:
> 
> > What about #pragma oacc parallel?  That would never do that?
> 
> It shouldn't, no (IMO).

Correct.


Here is the patch re-worked for trunk.  Instead of passing
-foffload-force in the affected libgomp test cases, I instead chose to
have them expect the warning.  This way, we're testing more in line to
what users will be doing, and we'll notice how the OpenACC kernels
handling improves, when parloops gets able to parallelize more offloaded
code (and the "avoid offloading" handling will no longer trigger).  OK to
commit?

commit acd66946777671486a0f69706b25a3ec5f877306
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Tue Feb 2 20:41:42 2016 +0100

    Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
    
    	gcc/
    	* common.opt: Add -foffload-force.
    	* lto-wrapper.c (merge_and_complain, append_compiler_options):
    	Handle it.
    	* doc/invoke.texi: Document it.
    	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
    	(record_id): Parse, and set it.
    	(process): Use it.
    	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
    	offloading".
    	(nvptx_record_offload_symbol): Use it.
    	(nvptx_goacc_validate_dims): Set it.
    	libgomp/
    	* libgomp.h (gomp_offload_target_available_p): New function
    	declaration.
    	* target.c (gomp_offload_target_available_p): New function
    	definition.
    	(GOMP_offload_register_ver, GOMP_offload_unregister_ver)
    	(gomp_init_device, gomp_unload_device): Handle and document "avoid
    	offloading" flag ("host_table == NULL").
    	(resolve_device): Document "avoid offloading".
    	* oacc-init.c (resolve_device): Likewise.
    	* libgomp.texi (Enabling OpenACC): Likewise.
    	* testsuite/lib/libgomp.exp
    	(check_effective_target_nvptx_offloading_configured): New proc
    	definition.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
    	file.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Expect warning.
    	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c: Set
    	"-ftree-parallelize-loops=32".
    	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
---
 gcc/common.opt                                     |    4 +
 gcc/config/nvptx/mkoffload.c                       |   73 +++++++++++-
 gcc/config/nvptx/nvptx.c                           |   42 ++++++-
 gcc/doc/invoke.texi                                |   12 +-
 gcc/lto-wrapper.c                                  |    2 +
 libgomp/libgomp.h                                  |    1 +
 libgomp/libgomp.texi                               |    8 ++
 libgomp/oacc-init.c                                |   19 ++-
 libgomp/target.c                                   |  122 ++++++++++++++++----
 libgomp/testsuite/lib/libgomp.exp                  |   10 ++
 .../testsuite/libgomp.oacc-c-c++-common/abort-3.c  |    4 +-
 .../testsuite/libgomp.oacc-c-c++-common/abort-4.c  |    4 +-
 .../libgomp.oacc-c-c++-common/avoid-offloading-1.c |   28 +++++
 .../libgomp.oacc-c-c++-common/avoid-offloading-2.c |   38 ++++++
 .../libgomp.oacc-c-c++-common/avoid-offloading-3.c |   29 +++++
 .../combined-directives-1.c                        |    4 +-
 .../libgomp.oacc-c-c++-common/default-1.c          |    4 +-
 .../libgomp.oacc-c-c++-common/deviceptr-1.c        |    4 +-
 .../libgomp.oacc-c-c++-common/host_data-1.c        |    1 +
 .../libgomp.oacc-c-c++-common/kernels-1.c          |   10 +-
 .../kernels-alias-ipa-pta-2.c                      |    4 +-
 .../kernels-alias-ipa-pta-3.c                      |    4 +-
 .../kernels-alias-ipa-pta.c                        |    4 +-
 .../libgomp.oacc-c-c++-common/kernels-empty.c      |    2 +-
 .../kernels-loop-and-seq-2.c                       |    3 +-
 .../kernels-loop-and-seq-3.c                       |    4 +-
 .../kernels-loop-and-seq-4.c                       |    3 +-
 .../kernels-loop-and-seq-5.c                       |    3 +-
 .../kernels-loop-and-seq-6.c                       |    3 +-
 .../kernels-loop-and-seq.c                         |    4 +-
 .../kernels-loop-collapse.c                        |    3 +-
 .../testsuite/libgomp.oacc-c-c++-common/nested-2.c |    2 +-
 .../libgomp.oacc-fortran/avoid-offloading-1.f      |   32 +++++
 .../libgomp.oacc-fortran/avoid-offloading-2.f      |   41 +++++++
 .../libgomp.oacc-fortran/avoid-offloading-3.f      |   31 +++++
 .../libgomp.oacc-fortran/combined-directives-1.f90 |    5 +-
 .../libgomp.oacc-fortran/non-scalar-data.f90       |    5 +-
 37 files changed, 494 insertions(+), 78 deletions(-)

diff --git gcc/common.opt gcc/common.opt
index 520fa9c..2cf798d 100644
--- gcc/common.opt
+++ gcc/common.opt
@@ -1779,6 +1779,10 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
 EnumValue
 Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
 
+foffload-force
+Common Var(flag_offload_force)
+Force offloading if the compiler wanted to avoid it.
+
 fomit-frame-pointer
 Common Report Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git gcc/config/nvptx/mkoffload.c gcc/config/nvptx/mkoffload.c
index c8eed45..586ee8b 100644
--- gcc/config/nvptx/mkoffload.c
+++ gcc/config/nvptx/mkoffload.c
@@ -41,9 +41,19 @@ const char tool_name[] = "nvptx mkoffload";
 
 #define COMMENT_PREFIX "#"
 
+enum id_map_flag
+  {
+    /* All clear.  */
+    ID_MAP_FLAG_NONE = 0,
+    /* Avoid offloading.  For example, because there is no sufficient
+       parallelism.  */
+    ID_MAP_FLAG_AVOID_OFFLOADING = 1
+  };
+
 struct id_map
 {
   id_map *next;
+  int flags;
   char *ptx_name;
 };
 
@@ -107,6 +117,38 @@ record_id (const char *p1, id_map ***where)
     fatal_error (input_location, "malformed ptx file");
 
   id_map *v = XNEW (id_map);
+
+  /* Do we have any flags?  */
+  v->flags = ID_MAP_FLAG_NONE;
+  if (p1[0] == '(')
+    {
+      /* Current flag.  */
+      const char *cur = p1 + 1;
+
+      /* Seek to the beginning of ") ".  */
+      p1 = strchr (cur, ')');
+      if (!p1 || p1 > end || p1[1] != ' ')
+	fatal_error (input_location, "malformed ptx file: "
+		     "expected \") \" at \"%s\"", cur);
+
+      while (cur < p1)
+	{
+	  const char *next = strchr (cur, ',');
+	  if (!next || next > p1)
+	    next = p1;
+
+	  if (strncmp (cur, "avoid offloading", next - cur - 1) == 0)
+	    v->flags |= ID_MAP_FLAG_AVOID_OFFLOADING;
+	  else
+	    fatal_error (input_location, "malformed ptx file: "
+			 "unknown flag at \"%s\"", cur);
+
+	  cur = next;
+	}
+
+      /* Skip past ") ".  */
+      p1 += 2;
+    }
   size_t len = end - p1;
   v->ptx_name = XNEWVEC (char, len + 1);
   memcpy (v->ptx_name, p1, len);
@@ -296,12 +338,17 @@ process (FILE *in, FILE *out)
   fprintf (out, "\n};\n\n");
 
   /* Dump out function idents.  */
+  bool avoid_offloading_p = false;
   fprintf (out, "static const struct nvptx_fn {\n"
 	   "  const char *name;\n"
 	   "  unsigned short dim[%d];\n"
 	   "} func_mappings[] = {\n", GOMP_DIM_MAX);
   for (comma = "", id = func_ids; id; comma = ",", id = id->next)
-    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+    {
+      if (id->flags & ID_MAP_FLAG_AVOID_OFFLOADING)
+	avoid_offloading_p = true;
+      fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+    }
   fprintf (out, "\n};\n\n");
 
   fprintf (out,
@@ -318,7 +365,11 @@ process (FILE *in, FILE *out)
 	   "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
 	   "  func_mappings,"
 	   "  sizeof (func_mappings) / sizeof (func_mappings[0])\n"
-	   "};\n\n");
+	   "};\n");
+  if (avoid_offloading_p)
+    /* Need a unique handle for target_data.  */
+    fprintf (out, "static int target_data_avoid_offloading;\n");
+  fprintf (out, "\n");
 
   fprintf (out, "#ifdef __cplusplus\n"
 	   "extern \"C\" {\n"
@@ -338,18 +389,28 @@ process (FILE *in, FILE *out)
   fprintf (out, "static __attribute__((constructor)) void init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register_ver (%#x, __OFFLOAD_TABLE__,"
-	   "%d/*NVIDIA_PTX*/, &target_data);\n"
-	   "};\n",
+	   "%d/*NVIDIA_PTX*/, &target_data);\n",
 	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
 	   GOMP_DEVICE_NVIDIA_PTX);
+  if (avoid_offloading_p)
+    fprintf (out, "  GOMP_offload_register_ver (%#x, (void *) 0,"
+	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
+	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
+	     GOMP_DEVICE_NVIDIA_PTX);
+  fprintf (out, "};\n");
 
   fprintf (out, "static __attribute__((destructor)) void fini (void)\n"
 	   "{\n"
 	   "  GOMP_offload_unregister_ver (%#x, __OFFLOAD_TABLE__,"
-	   "%d/*NVIDIA_PTX*/, &target_data);\n"
-	   "};\n",
+	   "%d/*NVIDIA_PTX*/, &target_data);\n",
 	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
 	   GOMP_DEVICE_NVIDIA_PTX);
+  if (avoid_offloading_p)
+    fprintf (out, "  GOMP_offload_unregister_ver (%#x, (void *) 0,"
+	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
+	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
+	     GOMP_DEVICE_NVIDIA_PTX);
+  fprintf (out, "};\n");
 }
 
 static void
diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
index 78614f8..fe28154 100644
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -3803,6 +3803,9 @@ static const struct attribute_spec nvptx_attribute_table[] =
   /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
        affects_type_identity } */
   { "kernel", 0, 0, true, false,  false, nvptx_handle_kernel_attribute, false },
+  /* Avoid offloading.  For example, because there is no sufficient
+     parallelism.  */
+  { "omp avoid offloading", 0, 0, true, false, false, NULL, false },
   { NULL, 0, 0, false, false, false, NULL, false }
 };
 \f
@@ -3867,7 +3870,10 @@ nvptx_record_offload_symbol (tree decl)
 	tree dims = TREE_VALUE (attr);
 	unsigned ix;
 
-	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
+	fprintf (asm_out_file, "//:FUNC_MAP %s\"%s\"",
+		 (lookup_attribute ("omp avoid offloading",
+				    DECL_ATTRIBUTES (decl))
+		  ? "(avoid offloading) " : ""),
 		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
 
 	for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
@@ -4124,6 +4130,40 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 static bool
 nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 {
+  /* Detect if a function is unsuitable for offloading.  */
+  if (!flag_offload_force && decl)
+    {
+      tree oacc_function_attr = get_oacc_fn_attrib (decl);
+      if (oacc_function_attr
+	  && oacc_fn_attrib_kernels_p (oacc_function_attr))
+	{
+	  bool avoid_offloading_p = true;
+	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
+	    {
+	      if (dims[ix] > 1)
+		{
+		  avoid_offloading_p = false;
+		  break;
+		}
+	    }
+	  if (avoid_offloading_p)
+	    {
+	      /* OpenACC kernels constructs will never be parallelized for
+		 optimization levels smaller than -O2; avoid the diagnostic in
+		 this case.  */
+	      if (optimize >= 2)
+		warning_at (DECL_SOURCE_LOCATION (decl), 0,
+			    "OpenACC kernels construct will be executed "
+			    "sequentially; will by default avoid offloading "
+			    "to prevent data copy penalty");
+	      DECL_ATTRIBUTES (decl)
+		= tree_cons (get_identifier ("omp avoid offloading"),
+			     NULL_TREE, DECL_ATTRIBUTES (decl));
+
+	    }
+	}
+    }
+
   bool changed = false;
 
   /* The vector size must be 32, unless this is a SEQ routine.  */
diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
index fcc404e..c09fbc5 100644
--- gcc/doc/invoke.texi
+++ gcc/doc/invoke.texi
@@ -180,7 +180,8 @@ in the following sections.
 @gccoptlist{-ansi  -std=@var{standard}  -fgnu89-inline @gol
 -aux-info @var{filename} -fallow-parameterless-variadic-functions @gol
 -fno-asm  -fno-builtin  -fno-builtin-@var{function} @gol
--fhosted  -ffreestanding -fopenacc -fopenmp -fopenmp-simd @gol
+-fhosted  -ffreestanding @gol
+-foffload-force -fopenacc -fopenacc-dim=@var{geom} -fopenmp -fopenmp-simd @gol
 -fms-extensions -fplan9-extensions -fsso-struct=@var{endianness}
 -fallow-single-precision  -fcond-mismatch -flax-vector-conversions @gol
 -fsigned-bitfields  -fsigned-char @gol
@@ -1953,6 +1954,15 @@ This is equivalent to @option{-fno-hosted}.
 @xref{Standards,,Language Standards Supported by GCC}, for details of
 freestanding and hosted environments.
 
+@item -foffload-force
+@opindex -foffload-force
+The option @option{-foffload-force} forces offloading if the compiler
+wanted to avoid it.  For example, when there isn't sufficient
+parallelism in certain offloading constructs, the compiler may come to
+the conclusion that offloading incurs too much overhead (for data
+transfers, for example), and unless overridden with this flag, it then
+suggests to the runtime (libgomp) to avoid offloading.
+
 @item -fopenacc
 @opindex fopenacc
 @cindex OpenACC accelerator programming
diff --git gcc/lto-wrapper.c gcc/lto-wrapper.c
index ced6f2f..702ae47 100644
--- gcc/lto-wrapper.c
+++ gcc/lto-wrapper.c
@@ -275,6 +275,7 @@ merge_and_complain (struct cl_decoded_option **decoded_options,
 	case OPT_fsigned_zeros:
 	case OPT_ftrapping_math:
 	case OPT_fwrapv:
+	case OPT_foffload_force:
 	case OPT_fopenmp:
 	case OPT_fopenacc:
 	case OPT_fcilkplus:
@@ -517,6 +518,7 @@ append_compiler_options (obstack *argv_obstack, struct cl_decoded_option *opts,
 	case OPT_fsigned_zeros:
 	case OPT_ftrapping_math:
 	case OPT_fwrapv:
+	case OPT_foffload_force:
 	case OPT_fopenmp:
 	case OPT_fopenacc:
 	case OPT_fopenacc_dim_:
diff --git libgomp/libgomp.h libgomp/libgomp.h
index 7108a6d..8747b72 100644
--- libgomp/libgomp.h
+++ libgomp/libgomp.h
@@ -984,6 +984,7 @@ extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
 extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_unload_device (struct gomp_device_descr *);
+extern bool gomp_offload_target_available_p (int);
 
 /* work.c */
 
diff --git libgomp/libgomp.texi libgomp/libgomp.texi
index 987ee5f..5795c00 100644
--- libgomp/libgomp.texi
+++ libgomp/libgomp.texi
@@ -1815,6 +1815,14 @@ flag @option{-fopenacc} must be specified.  This enables the OpenACC directive
 arranges for automatic linking of the OpenACC runtime library 
 (@ref{OpenACC Runtime Library Routines}).
 
+Offloading is enabled by default.  In some cases, the compiler may
+come to the conclusion that offloading incurs too much overhead, and
+suggest to the runtime to avoid it.  To counteract that, you can use
+the option @option{-foffload-force} to force offloading in such cases.
+Alternatively, offloading is also enabled if a specific device type is
+requested, in a call to @code{acc_init} or by setting the
+@env{ACC_DEVICE_TYPE} environment variable, for example.
+
 A complete description of all OpenACC directives accepted may be found in 
 the @uref{http://www.openacc.org/, OpenACC} Application Programming
 Interface manual, version 2.0.
diff --git libgomp/oacc-init.c libgomp/oacc-init.c
index 42d005d..2f053f3 100644
--- libgomp/oacc-init.c
+++ libgomp/oacc-init.c
@@ -122,7 +122,10 @@ resolve_device (acc_device_t d, bool fail_is_error)
       {
 	if (goacc_device_type)
 	  {
-	    /* Lookup the named device.  */
+	    /* Lookup the device that has been explicitly named, so do not pay
+	       attention to gomp_offload_target_available_p.  (That is,
+	       enforced usage even with an "avoid offloading" flag set, and
+	       hard error if not actually available.)  */
 	    while (++d != _ACC_device_hwm)
 	      if (dispatchers[d]
 		  && !strcasecmp (goacc_device_type,
@@ -148,8 +151,15 @@ resolve_device (acc_device_t d, bool fail_is_error)
     case acc_device_not_host:
       /* Find the first available device after acc_device_not_host.  */
       while (++d != _ACC_device_hwm)
-	if (dispatchers[d] && dispatchers[d]->get_num_devices_func () > 0)
+	if (dispatchers[d]
+	    && dispatchers[d]->get_num_devices_func () > 0
+	    /* No device has been explicitly named, so pay attention to
+	       gomp_offload_target_available_p, to not decide on an offload
+	       target that we don't have offload data available for, or have an
+	       "avoid offloading" flag set for.  */
+	    && gomp_offload_target_available_p (dispatchers[d]->type))
 	  goto found;
+      /* No non-host device found.  */
       if (d_arg == acc_device_default)
 	{
 	  d = acc_device_host;
@@ -168,7 +178,7 @@ resolve_device (acc_device_t d, bool fail_is_error)
       break;
 
     default:
-      if (d > _ACC_device_hwm)
+      if (d >= _ACC_device_hwm)
 	{
 	  if (fail_is_error)
 	    goto unsupported_device;
@@ -181,7 +191,8 @@ resolve_device (acc_device_t d, bool fail_is_error)
 
   assert (d != acc_device_none
 	  && d != acc_device_default
-	  && d != acc_device_not_host);
+	  && d != acc_device_not_host
+	  && d < _ACC_device_hwm);
 
   if (dispatchers[d] == NULL && fail_is_error)
     {
diff --git libgomp/target.c libgomp/target.c
index 96fe3d5..afcbedb 100644
--- libgomp/target.c
+++ libgomp/target.c
@@ -1165,12 +1165,19 @@ gomp_unload_image_from_device (struct gomp_device_descr *devicep,
 
 /* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
-   the target, and TARGET_DATA needed by target plugin.  */
+   the target, and TARGET_DATA needed by target plugin.
+
+   If HOST_TABLE is NULL, this image (TARGET_DATA) is stored as an "avoid
+   offloading" flag, and the TARGET_TYPE will not be considered by default
+   until this image gets unregistered.  */
 
 void
 GOMP_offload_register_ver (unsigned version, const void *host_table,
 			   int target_type, const void *target_data)
 {
+  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
+	      version, host_table, target_type, target_data);
+
   int i;
 
   if (GOMP_VERSION_LIB (version) > GOMP_VERSION)
@@ -1179,16 +1186,19 @@ GOMP_offload_register_ver (unsigned version, const void *host_table,
   
   gomp_mutex_lock (&register_lock);
 
-  /* Load image to all initialized devices.  */
-  for (i = 0; i < num_devices; i++)
+  if (host_table != NULL)
     {
-      struct gomp_device_descr *devicep = &devices[i];
-      gomp_mutex_lock (&devicep->lock);
-      if (devicep->type == target_type
-	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_load_image_to_device (devicep, version,
-				   host_table, target_data, true);
-      gomp_mutex_unlock (&devicep->lock);
+      /* Load image to all initialized devices.  */
+      for (i = 0; i < num_devices; i++)
+	{
+	  struct gomp_device_descr *devicep = &devices[i];
+	  gomp_mutex_lock (&devicep->lock);
+	  if (devicep->type == target_type
+	      && devicep->state == GOMP_DEVICE_INITIALIZED)
+	    gomp_load_image_to_device (devicep, version,
+				       host_table, target_data, true);
+	  gomp_mutex_unlock (&devicep->lock);
+	}
     }
 
   /* Insert image to array of pending images.  */
@@ -1214,26 +1224,36 @@ GOMP_offload_register (const void *host_table, int target_type,
 
 /* This function should be called from every offload image while unloading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
-   the target, and TARGET_DATA needed by target plugin.  */
+   the target, and TARGET_DATA needed by target plugin.
+
+   If HOST_TABLE is NULL, the "avoid offloading" flag gets cleared for this
+   image (TARGET_DATA), and this TARGET_TYPE may again be considered by
+   default.  */
 
 void
 GOMP_offload_unregister_ver (unsigned version, const void *host_table,
 			     int target_type, const void *target_data)
 {
+  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
+	      version, host_table, target_type, target_data);
+
   int i;
 
   gomp_mutex_lock (&register_lock);
 
-  /* Unload image from all initialized devices.  */
-  for (i = 0; i < num_devices; i++)
+  if (host_table != NULL)
     {
-      struct gomp_device_descr *devicep = &devices[i];
-      gomp_mutex_lock (&devicep->lock);
-      if (devicep->type == target_type
-	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_unload_image_from_device (devicep, version,
-				       host_table, target_data);
-      gomp_mutex_unlock (&devicep->lock);
+      /* Unload image from all initialized devices.  */
+      for (i = 0; i < num_devices; i++)
+	{
+	  struct gomp_device_descr *devicep = &devices[i];
+	  gomp_mutex_lock (&devicep->lock);
+	  if (devicep->type == target_type
+	      && devicep->state == GOMP_DEVICE_INITIALIZED)
+	    gomp_unload_image_from_device (devicep, version,
+					   host_table, target_data);
+	  gomp_mutex_unlock (&devicep->lock);
+	}
     }
 
   /* Remove image from array of pending images.  */
@@ -1267,7 +1287,8 @@ gomp_init_device (struct gomp_device_descr *devicep)
   for (i = 0; i < num_offload_images; i++)
     {
       struct offload_image_descr *image = &offload_images[i];
-      if (image->type == devicep->type)
+      if (image->type == devicep->type
+	  && image->host_table != NULL)
 	gomp_load_image_to_device (devicep, image->version,
 				   image->host_table, image->target_data,
 				   false);
@@ -1287,7 +1308,8 @@ gomp_unload_device (struct gomp_device_descr *devicep)
       for (i = 0; i < num_offload_images; i++)
 	{
 	  struct offload_image_descr *image = &offload_images[i];
-	  if (image->type == devicep->type)
+	  if (image->type == devicep->type
+	      && image->host_table != NULL)
 	    gomp_unload_image_from_device (devicep, image->version,
 					   image->host_table,
 					   image->target_data);
@@ -1311,6 +1333,62 @@ gomp_free_memmap (struct splay_tree_s *mem_map)
     }
 }
 
+/* Do we have offload data available for the given offload target type?
+   Instead of verifying that *all* offload data is available that could
+   possibly be required, we instead just look for *any*.  If we later find any
+   offload data missing, that's user error.  If any offload data of this target
+   type is tagged with an "avoid offloading" flag, do not consider this target
+   type available unless it has been initialized already.  */
+
+attribute_hidden bool
+gomp_offload_target_available_p (int type)
+{
+  bool available = false;
+
+  /* Has the offload target type already been initialized?  */
+  for (int i = 0; !available && i < num_devices; i++)
+    {
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type == type
+	  && devicep->state == GOMP_DEVICE_INITIALIZED)
+	available = true;
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* If the offload target type has been initialized already, we ignore "avoid
+     offloading" flags.  This is important, because data/state may be present
+     on the device, that we must continue to use.  */
+  if (!available)
+    {
+      gomp_mutex_lock (&register_lock);
+      if (num_offload_images == 0)
+	/* If there is no offload data available at all, there is no way to
+	   later fail to find any of it for a specific offload target type.
+	   This is the case where there are no offloaded code regions in user
+	   code, but the target type can be initialized successfully, and
+	   executable directqives be used, or runtime library calls be
+	   made.  */
+	available = true;
+      else
+	{
+	  /* Can the offload target be initialized?  */
+	  for (int i = 0; !available && i < num_offload_images; i++)
+	    if (offload_images[i].type == type
+		&& offload_images[i].host_table != NULL)
+	      available = true;
+	  /* If yes, is an "avoid offloading" flag set?  */
+	  for (int i = 0; available && i < num_offload_images; i++)
+	    if (offload_images[i].type == type
+		&& offload_images[i].host_table == NULL)
+	      available = false;
+	}
+      gomp_mutex_unlock (&register_lock);
+    }
+
+  return available;
+}
+
 /* Host fallback for GOMP_target{,_ext} routines.  */
 
 static void
diff --git libgomp/testsuite/lib/libgomp.exp libgomp/testsuite/lib/libgomp.exp
index a4c9d83..8d2be80 100644
--- libgomp/testsuite/lib/libgomp.exp
+++ libgomp/testsuite/lib/libgomp.exp
@@ -344,6 +344,16 @@ proc check_effective_target_offload_device_nonshared_as { } {
     } ]
 }
 
+# Return 1 if the compiler has been configured for nvptx offloading.
+
+proc check_effective_target_nvptx_offloading_configured { } {
+    # PR libgomp/65099: Currently, we only support offloading in 64-bit
+    # configurations.
+    global offload_targets
+    return [expr [string match "*,nvptx,*" ",$offload_targets,"] \
+		&& [is-effective-target lp64] ]
+}
+
 # Return 1 if at least one nvidia board is present.
 
 proc check_effective_target_openacc_nvidia_accel_present { } {
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
index bca425e..23156d8 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
@@ -1,5 +1,3 @@
-/* { dg-do run } */
-
 #include <stdio.h>
 #include <stdlib.h>
 
@@ -7,7 +5,7 @@ int
 main (void)
 {
   fprintf (stderr, "CheCKpOInT\n");
-#pragma acc kernels
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     abort ();
   }
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
index c29ca3f..f4d6a07 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
@@ -1,12 +1,10 @@
-/* { dg-do run } */
-
 #include <stdlib.h>
 
 int
 main (int argc, char **argv)
 {
 
-#pragma acc kernels
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     if (argc != 1)
       abort ();
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
new file mode 100644
index 0000000..08745fc
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
@@ -0,0 +1,28 @@
+/* Test that the compiler decides to "avoid offloading".  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* The ACC_DEVICE_TYPE environment variable gets set in the testing
+   framework, and that overrides the "avoid offloading" flag at run time.
+   { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+  if (y != 1)
+    __builtin_abort();
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
new file mode 100644
index 0000000..724228a
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
@@ -0,0 +1,38 @@
+/* Test that a user can override the compiler's "avoid offloading"
+   decision at run time.  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  /* Override the compiler's "avoid offloading" decision.  */
+  acc_device_t d;
+#if defined ACC_DEVICE_TYPE_nvidia
+  d = acc_device_nvidia;
+#elif defined ACC_DEVICE_TYPE_host
+  d = acc_device_host;
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+  acc_init (d);
+
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_nvidia
+  if (y != 0)
+    __builtin_abort();
+#else
+  if (y != 1)
+    __builtin_abort();
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
new file mode 100644
index 0000000..2fb5196
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
@@ -0,0 +1,29 @@
+/* Test that a user can override the compiler's "avoid offloading"
+   decision at compile time.  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_nvidia
+  if (y != 0)
+    __builtin_abort();
+#else
+  if (y != 1)
+    __builtin_abort();
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
index dad6d13..87ca378 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
@@ -1,6 +1,6 @@
 /* This test exercises combined directives.  */
 
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
@@ -33,7 +33,7 @@ main (int argc, char **argv)
 	abort ();
     }
 
-#pragma acc kernels loop copy (a[0:N]) copy (b[0:N])
+#pragma acc kernels loop copy (a[0:N]) copy (b[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   for (i = 0; i < N; i++)
     {
       b[i] = 3.0;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
index 1ac0b95..8f0144c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
@@ -1,4 +1,4 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include  <openacc.h>
 
@@ -51,7 +51,7 @@ int test_kernels ()
     ary[i] = ~0;
 
   /* val defaults to copy, ary defaults to copy.  */
-#pragma acc kernels copy(ondev)
+#pragma acc kernels copy(ondev) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     ondev = acc_on_device (acc_device_not_host);
 #pragma acc loop 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
index e271a37..9a5f7b1 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
@@ -1,5 +1,3 @@
-/* { dg-do run } */
-
 #include <stdlib.h>
 
 int main (void)
@@ -10,7 +8,7 @@ int main (void)
   a = A;
 
 #pragma acc data copyout (a_1, a_2)
-#pragma acc kernels deviceptr (a)
+#pragma acc kernels deviceptr (a) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a_1 = a;
     a_2 = &a;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
index 51745ba..3ef6f9b 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
@@ -1,4 +1,5 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
 
 #include <stdlib.h>
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
index 3acfdf5..614ad33 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
@@ -1,4 +1,4 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
@@ -73,7 +73,7 @@ int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     if (i != -1 || j != -2)
       abort ();
@@ -96,7 +96,7 @@ int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     i = 2;
     j = 1;
@@ -110,7 +110,7 @@ int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     if (i != -1 || j != -2)
       abort ();
@@ -126,7 +126,7 @@ int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     i = 2;
     j = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
index 0f323c8..8d5101d 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
@@ -1,4 +1,4 @@
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
 
 #include <stdlib.h>
 
@@ -11,7 +11,7 @@ main (void)
   unsigned int *b = (unsigned int *)malloc (N * sizeof (unsigned int));
   unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
 
-#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 0;
     b[0] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
index 654e750..3726b0c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
@@ -1,4 +1,4 @@
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
 
 #include <stdlib.h>
 
@@ -11,7 +11,7 @@ main (void)
   unsigned int *b = a;
   unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
 
-#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 0;
     b[0] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
index 44d4fd2..eea4f76 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
@@ -1,4 +1,4 @@
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
 
 #include <stdlib.h>
 
@@ -11,7 +11,7 @@ main (void)
   unsigned int b[N];
   unsigned int c[N];
 
-#pragma acc kernels pcopyout (a, b, c)
+#pragma acc kernels pcopyout (a, b, c) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 0;
     b[0] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
index a68a7cd..860b6da 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
@@ -1,6 +1,6 @@
 int
 main (void)
 {
-#pragma acc kernels
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   ;
 }
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
index 2e4100f..5cdc200 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -1,4 +1,3 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,7 +7,7 @@
 unsigned int
 foo (int n, unsigned int *a)
 {
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = a[0] + 1;
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
index b3e736b..2e4d4d2 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
@@ -1,4 +1,3 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,8 +7,7 @@
 unsigned int
 foo (int n, unsigned int *a)
 {
-
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     for (int i = 0; i < n; i++)
       a[i] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
index 8b9affa..5bf00db 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
@@ -1,4 +1,3 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,7 +7,7 @@
 unsigned int
 foo (int n, unsigned int *a)
 {
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 2;
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
index 83d4e7f..d39b667 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -1,4 +1,3 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -9,7 +8,7 @@ unsigned int
 foo (int n, unsigned int *a)
 {
   int r;
-#pragma acc kernels copyout(r) copy (a[0:N])
+#pragma acc kernels copyout(r) copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     r = a[0];
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
index 01d5e5e..bb2e85b 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -1,4 +1,3 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,7 +7,7 @@
 unsigned int
 foo (int n, unsigned int *a)
 {
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     int r = a[0];
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
index 61d1283..e513827 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -1,4 +1,3 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,8 +7,7 @@
 unsigned int
 foo (int n, unsigned int *a)
 {
-
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     for (int i = 0; i < n; i++)
       a[i] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
index f7f04cb..c4791a4 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -1,4 +1,3 @@
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -11,7 +10,7 @@ void __attribute__((noinline, noclone))
 foo (int m, int n)
 {
   int i, j;
-  #pragma acc kernels
+  #pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
 #pragma acc loop collapse(2)
     for (i = 0; i < m; i++)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
index c164598..94a5ae2 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
@@ -1,4 +1,4 @@
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
new file mode 100644
index 0000000..5f18b94
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
@@ -0,0 +1,32 @@
+! Test that the compiler decides to "avoid offloading".
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
+! The ACC_DEVICE_TYPE environment variable gets set in the testing
+! framework, and that overrides the "avoid offloading" flag at run time.
+! { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST);
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+      IF (.NOT. Y) CALL ABORT
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
new file mode 100644
index 0000000..51801ad
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
@@ -0,0 +1,41 @@
+! Test that a user can override the compiler's "avoid offloading"
+! decision at run time.
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER :: D
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!     Override the compiler's "avoid offloading" decision.
+#if defined ACC_DEVICE_TYPE_nvidia
+      D = ACC_DEVICE_NVIDIA
+#elif defined ACC_DEVICE_TYPE_host
+      D = ACC_DEVICE_HOST
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+      CALL ACC_INIT (D)
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_nvidia
+      IF (Y) CALL ABORT
+#else
+      IF (.NOT. Y) CALL ABORT
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
new file mode 100644
index 0000000..bea6ab8
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
@@ -0,0 +1,31 @@
+! Test that a user can override the compiler's "avoid offloading"
+! decision at compile time.
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER :: D
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_nvidia
+      IF (Y) CALL ABORT
+#else
+      IF (.NOT. Y) CALL ABORT
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
index 94100b2..4b52579 100644
--- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
@@ -1,6 +1,9 @@
 ! This test exercises combined directives.
 
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 
 program main
   integer, parameter :: n = 32
@@ -27,7 +30,7 @@ program main
   !$acc kernels loop copy (a(1:n)) copy (b(1:n))
   do i = 1, n
     b(i) = 3.0;
-    a(i) = a(i) + b(i)
+    a(i) = a(i) + b(i) ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
   end do
 
   do i = 1, n
diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
index 4afb562..b9298c7 100644
--- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
@@ -2,6 +2,9 @@
 ! offloaded regions are properly mapped using present_or_copy.
 
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 
 program main
   implicit none
@@ -30,7 +33,7 @@ subroutine kernels (array, n)
   integer, dimension (n) :: array
   integer :: n, i
 
-  !$acc kernels
+  !$acc kernels ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
   do i = 1, n
      array(i) = i
   end do


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-04 14:47               ` Thomas Schwinge
@ 2016-02-10 11:51                 ` Thomas Schwinge
  2016-02-10 13:25                   ` Bernd Schmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Schwinge @ 2016-02-10 11:51 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

Hi!

Ping.

On Thu, 04 Feb 2016 15:47:25 +0100, I wrote:
> Here is the patch re-worked for trunk.  Instead of passing
> -foffload-force in the affected libgomp test cases, I instead chose to
> have them expect the warning.  This way, we're testing more in line to
> what users will be doing, and we'll notice how the OpenACC kernels
> handling improves, when parloops gets able to parallelize more offloaded
> code (and the "avoid offloading" handling will no longer trigger).  OK to
> commit?
> 
> commit acd66946777671486a0f69706b25a3ec5f877306
> Author: Thomas Schwinge <thomas@codesourcery.com>
> Date:   Tue Feb 2 20:41:42 2016 +0100
> 
>     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
>     
>     	gcc/
>     	* common.opt: Add -foffload-force.
>     	* lto-wrapper.c (merge_and_complain, append_compiler_options):
>     	Handle it.
>     	* doc/invoke.texi: Document it.
>     	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
>     	(record_id): Parse, and set it.
>     	(process): Use it.
>     	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
>     	offloading".
>     	(nvptx_record_offload_symbol): Use it.
>     	(nvptx_goacc_validate_dims): Set it.
>     	libgomp/
>     	* libgomp.h (gomp_offload_target_available_p): New function
>     	declaration.
>     	* target.c (gomp_offload_target_available_p): New function
>     	definition.
>     	(GOMP_offload_register_ver, GOMP_offload_unregister_ver)
>     	(gomp_init_device, gomp_unload_device): Handle and document "avoid
>     	offloading" flag ("host_table == NULL").
>     	(resolve_device): Document "avoid offloading".
>     	* oacc-init.c (resolve_device): Likewise.
>     	* libgomp.texi (Enabling OpenACC): Likewise.
>     	* testsuite/lib/libgomp.exp
>     	(check_effective_target_nvptx_offloading_configured): New proc
>     	definition.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
>     	file.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Expect warning.
>     	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
>     
>     	libgomp/
>     	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c: Set
>     	"-ftree-parallelize-loops=32".
>     	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
> ---
>  gcc/common.opt                                     |    4 +
>  gcc/config/nvptx/mkoffload.c                       |   73 +++++++++++-
>  gcc/config/nvptx/nvptx.c                           |   42 ++++++-
>  gcc/doc/invoke.texi                                |   12 +-
>  gcc/lto-wrapper.c                                  |    2 +
>  libgomp/libgomp.h                                  |    1 +
>  libgomp/libgomp.texi                               |    8 ++
>  libgomp/oacc-init.c                                |   19 ++-
>  libgomp/target.c                                   |  122 ++++++++++++++++----
>  libgomp/testsuite/lib/libgomp.exp                  |   10 ++
>  .../testsuite/libgomp.oacc-c-c++-common/abort-3.c  |    4 +-
>  .../testsuite/libgomp.oacc-c-c++-common/abort-4.c  |    4 +-
>  .../libgomp.oacc-c-c++-common/avoid-offloading-1.c |   28 +++++
>  .../libgomp.oacc-c-c++-common/avoid-offloading-2.c |   38 ++++++
>  .../libgomp.oacc-c-c++-common/avoid-offloading-3.c |   29 +++++
>  .../combined-directives-1.c                        |    4 +-
>  .../libgomp.oacc-c-c++-common/default-1.c          |    4 +-
>  .../libgomp.oacc-c-c++-common/deviceptr-1.c        |    4 +-
>  .../libgomp.oacc-c-c++-common/host_data-1.c        |    1 +
>  .../libgomp.oacc-c-c++-common/kernels-1.c          |   10 +-
>  .../kernels-alias-ipa-pta-2.c                      |    4 +-
>  .../kernels-alias-ipa-pta-3.c                      |    4 +-
>  .../kernels-alias-ipa-pta.c                        |    4 +-
>  .../libgomp.oacc-c-c++-common/kernels-empty.c      |    2 +-
>  .../kernels-loop-and-seq-2.c                       |    3 +-
>  .../kernels-loop-and-seq-3.c                       |    4 +-
>  .../kernels-loop-and-seq-4.c                       |    3 +-
>  .../kernels-loop-and-seq-5.c                       |    3 +-
>  .../kernels-loop-and-seq-6.c                       |    3 +-
>  .../kernels-loop-and-seq.c                         |    4 +-
>  .../kernels-loop-collapse.c                        |    3 +-
>  .../testsuite/libgomp.oacc-c-c++-common/nested-2.c |    2 +-
>  .../libgomp.oacc-fortran/avoid-offloading-1.f      |   32 +++++
>  .../libgomp.oacc-fortran/avoid-offloading-2.f      |   41 +++++++
>  .../libgomp.oacc-fortran/avoid-offloading-3.f      |   31 +++++
>  .../libgomp.oacc-fortran/combined-directives-1.f90 |    5 +-
>  .../libgomp.oacc-fortran/non-scalar-data.f90       |    5 +-
>  37 files changed, 494 insertions(+), 78 deletions(-)
> 
> diff --git gcc/common.opt gcc/common.opt
> index 520fa9c..2cf798d 100644
> --- gcc/common.opt
> +++ gcc/common.opt
> @@ -1779,6 +1779,10 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
>  EnumValue
>  Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
>  
> +foffload-force
> +Common Var(flag_offload_force)
> +Force offloading if the compiler wanted to avoid it.
> +
>  fomit-frame-pointer
>  Common Report Var(flag_omit_frame_pointer) Optimization
>  When possible do not generate stack frames.
> diff --git gcc/config/nvptx/mkoffload.c gcc/config/nvptx/mkoffload.c
> index c8eed45..586ee8b 100644
> --- gcc/config/nvptx/mkoffload.c
> +++ gcc/config/nvptx/mkoffload.c
> @@ -41,9 +41,19 @@ const char tool_name[] = "nvptx mkoffload";
>  
>  #define COMMENT_PREFIX "#"
>  
> +enum id_map_flag
> +  {
> +    /* All clear.  */
> +    ID_MAP_FLAG_NONE = 0,
> +    /* Avoid offloading.  For example, because there is no sufficient
> +       parallelism.  */
> +    ID_MAP_FLAG_AVOID_OFFLOADING = 1
> +  };
> +
>  struct id_map
>  {
>    id_map *next;
> +  int flags;
>    char *ptx_name;
>  };
>  
> @@ -107,6 +117,38 @@ record_id (const char *p1, id_map ***where)
>      fatal_error (input_location, "malformed ptx file");
>  
>    id_map *v = XNEW (id_map);
> +
> +  /* Do we have any flags?  */
> +  v->flags = ID_MAP_FLAG_NONE;
> +  if (p1[0] == '(')
> +    {
> +      /* Current flag.  */
> +      const char *cur = p1 + 1;
> +
> +      /* Seek to the beginning of ") ".  */
> +      p1 = strchr (cur, ')');
> +      if (!p1 || p1 > end || p1[1] != ' ')
> +	fatal_error (input_location, "malformed ptx file: "
> +		     "expected \") \" at \"%s\"", cur);
> +
> +      while (cur < p1)
> +	{
> +	  const char *next = strchr (cur, ',');
> +	  if (!next || next > p1)
> +	    next = p1;
> +
> +	  if (strncmp (cur, "avoid offloading", next - cur - 1) == 0)
> +	    v->flags |= ID_MAP_FLAG_AVOID_OFFLOADING;
> +	  else
> +	    fatal_error (input_location, "malformed ptx file: "
> +			 "unknown flag at \"%s\"", cur);
> +
> +	  cur = next;
> +	}
> +
> +      /* Skip past ") ".  */
> +      p1 += 2;
> +    }
>    size_t len = end - p1;
>    v->ptx_name = XNEWVEC (char, len + 1);
>    memcpy (v->ptx_name, p1, len);
> @@ -296,12 +338,17 @@ process (FILE *in, FILE *out)
>    fprintf (out, "\n};\n\n");
>  
>    /* Dump out function idents.  */
> +  bool avoid_offloading_p = false;
>    fprintf (out, "static const struct nvptx_fn {\n"
>  	   "  const char *name;\n"
>  	   "  unsigned short dim[%d];\n"
>  	   "} func_mappings[] = {\n", GOMP_DIM_MAX);
>    for (comma = "", id = func_ids; id; comma = ",", id = id->next)
> -    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
> +    {
> +      if (id->flags & ID_MAP_FLAG_AVOID_OFFLOADING)
> +	avoid_offloading_p = true;
> +      fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
> +    }
>    fprintf (out, "\n};\n\n");
>  
>    fprintf (out,
> @@ -318,7 +365,11 @@ process (FILE *in, FILE *out)
>  	   "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
>  	   "  func_mappings,"
>  	   "  sizeof (func_mappings) / sizeof (func_mappings[0])\n"
> -	   "};\n\n");
> +	   "};\n");
> +  if (avoid_offloading_p)
> +    /* Need a unique handle for target_data.  */
> +    fprintf (out, "static int target_data_avoid_offloading;\n");
> +  fprintf (out, "\n");
>  
>    fprintf (out, "#ifdef __cplusplus\n"
>  	   "extern \"C\" {\n"
> @@ -338,18 +389,28 @@ process (FILE *in, FILE *out)
>    fprintf (out, "static __attribute__((constructor)) void init (void)\n"
>  	   "{\n"
>  	   "  GOMP_offload_register_ver (%#x, __OFFLOAD_TABLE__,"
> -	   "%d/*NVIDIA_PTX*/, &target_data);\n"
> -	   "};\n",
> +	   "%d/*NVIDIA_PTX*/, &target_data);\n",
>  	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
>  	   GOMP_DEVICE_NVIDIA_PTX);
> +  if (avoid_offloading_p)
> +    fprintf (out, "  GOMP_offload_register_ver (%#x, (void *) 0,"
> +	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
> +	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
> +	     GOMP_DEVICE_NVIDIA_PTX);
> +  fprintf (out, "};\n");
>  
>    fprintf (out, "static __attribute__((destructor)) void fini (void)\n"
>  	   "{\n"
>  	   "  GOMP_offload_unregister_ver (%#x, __OFFLOAD_TABLE__,"
> -	   "%d/*NVIDIA_PTX*/, &target_data);\n"
> -	   "};\n",
> +	   "%d/*NVIDIA_PTX*/, &target_data);\n",
>  	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
>  	   GOMP_DEVICE_NVIDIA_PTX);
> +  if (avoid_offloading_p)
> +    fprintf (out, "  GOMP_offload_unregister_ver (%#x, (void *) 0,"
> +	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
> +	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
> +	     GOMP_DEVICE_NVIDIA_PTX);
> +  fprintf (out, "};\n");
>  }
>  
>  static void
> diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
> index 78614f8..fe28154 100644
> --- gcc/config/nvptx/nvptx.c
> +++ gcc/config/nvptx/nvptx.c
> @@ -3803,6 +3803,9 @@ static const struct attribute_spec nvptx_attribute_table[] =
>    /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
>         affects_type_identity } */
>    { "kernel", 0, 0, true, false,  false, nvptx_handle_kernel_attribute, false },
> +  /* Avoid offloading.  For example, because there is no sufficient
> +     parallelism.  */
> +  { "omp avoid offloading", 0, 0, true, false, false, NULL, false },
>    { NULL, 0, 0, false, false, false, NULL, false }
>  };
>  \f
> @@ -3867,7 +3870,10 @@ nvptx_record_offload_symbol (tree decl)
>  	tree dims = TREE_VALUE (attr);
>  	unsigned ix;
>  
> -	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
> +	fprintf (asm_out_file, "//:FUNC_MAP %s\"%s\"",
> +		 (lookup_attribute ("omp avoid offloading",
> +				    DECL_ATTRIBUTES (decl))
> +		  ? "(avoid offloading) " : ""),
>  		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
>  
>  	for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
> @@ -4124,6 +4130,40 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
>  static bool
>  nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
>  {
> +  /* Detect if a function is unsuitable for offloading.  */
> +  if (!flag_offload_force && decl)
> +    {
> +      tree oacc_function_attr = get_oacc_fn_attrib (decl);
> +      if (oacc_function_attr
> +	  && oacc_fn_attrib_kernels_p (oacc_function_attr))
> +	{
> +	  bool avoid_offloading_p = true;
> +	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
> +	    {
> +	      if (dims[ix] > 1)
> +		{
> +		  avoid_offloading_p = false;
> +		  break;
> +		}
> +	    }
> +	  if (avoid_offloading_p)
> +	    {
> +	      /* OpenACC kernels constructs will never be parallelized for
> +		 optimization levels smaller than -O2; avoid the diagnostic in
> +		 this case.  */
> +	      if (optimize >= 2)
> +		warning_at (DECL_SOURCE_LOCATION (decl), 0,
> +			    "OpenACC kernels construct will be executed "
> +			    "sequentially; will by default avoid offloading "
> +			    "to prevent data copy penalty");
> +	      DECL_ATTRIBUTES (decl)
> +		= tree_cons (get_identifier ("omp avoid offloading"),
> +			     NULL_TREE, DECL_ATTRIBUTES (decl));
> +
> +	    }
> +	}
> +    }
> +
>    bool changed = false;
>  
>    /* The vector size must be 32, unless this is a SEQ routine.  */
> diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
> index fcc404e..c09fbc5 100644
> --- gcc/doc/invoke.texi
> +++ gcc/doc/invoke.texi
> @@ -180,7 +180,8 @@ in the following sections.
>  @gccoptlist{-ansi  -std=@var{standard}  -fgnu89-inline @gol
>  -aux-info @var{filename} -fallow-parameterless-variadic-functions @gol
>  -fno-asm  -fno-builtin  -fno-builtin-@var{function} @gol
> --fhosted  -ffreestanding -fopenacc -fopenmp -fopenmp-simd @gol
> +-fhosted  -ffreestanding @gol
> +-foffload-force -fopenacc -fopenacc-dim=@var{geom} -fopenmp -fopenmp-simd @gol
>  -fms-extensions -fplan9-extensions -fsso-struct=@var{endianness}
>  -fallow-single-precision  -fcond-mismatch -flax-vector-conversions @gol
>  -fsigned-bitfields  -fsigned-char @gol
> @@ -1953,6 +1954,15 @@ This is equivalent to @option{-fno-hosted}.
>  @xref{Standards,,Language Standards Supported by GCC}, for details of
>  freestanding and hosted environments.
>  
> +@item -foffload-force
> +@opindex -foffload-force
> +The option @option{-foffload-force} forces offloading if the compiler
> +wanted to avoid it.  For example, when there isn't sufficient
> +parallelism in certain offloading constructs, the compiler may come to
> +the conclusion that offloading incurs too much overhead (for data
> +transfers, for example), and unless overridden with this flag, it then
> +suggests to the runtime (libgomp) to avoid offloading.
> +
>  @item -fopenacc
>  @opindex fopenacc
>  @cindex OpenACC accelerator programming
> diff --git gcc/lto-wrapper.c gcc/lto-wrapper.c
> index ced6f2f..702ae47 100644
> --- gcc/lto-wrapper.c
> +++ gcc/lto-wrapper.c
> @@ -275,6 +275,7 @@ merge_and_complain (struct cl_decoded_option **decoded_options,
>  	case OPT_fsigned_zeros:
>  	case OPT_ftrapping_math:
>  	case OPT_fwrapv:
> +	case OPT_foffload_force:
>  	case OPT_fopenmp:
>  	case OPT_fopenacc:
>  	case OPT_fcilkplus:
> @@ -517,6 +518,7 @@ append_compiler_options (obstack *argv_obstack, struct cl_decoded_option *opts,
>  	case OPT_fsigned_zeros:
>  	case OPT_ftrapping_math:
>  	case OPT_fwrapv:
> +	case OPT_foffload_force:
>  	case OPT_fopenmp:
>  	case OPT_fopenacc:
>  	case OPT_fopenacc_dim_:
> diff --git libgomp/libgomp.h libgomp/libgomp.h
> index 7108a6d..8747b72 100644
> --- libgomp/libgomp.h
> +++ libgomp/libgomp.h
> @@ -984,6 +984,7 @@ extern void gomp_unmap_vars (struct target_mem_desc *, bool);
>  extern void gomp_init_device (struct gomp_device_descr *);
>  extern void gomp_free_memmap (struct splay_tree_s *);
>  extern void gomp_unload_device (struct gomp_device_descr *);
> +extern bool gomp_offload_target_available_p (int);
>  
>  /* work.c */
>  
> diff --git libgomp/libgomp.texi libgomp/libgomp.texi
> index 987ee5f..5795c00 100644
> --- libgomp/libgomp.texi
> +++ libgomp/libgomp.texi
> @@ -1815,6 +1815,14 @@ flag @option{-fopenacc} must be specified.  This enables the OpenACC directive
>  arranges for automatic linking of the OpenACC runtime library 
>  (@ref{OpenACC Runtime Library Routines}).
>  
> +Offloading is enabled by default.  In some cases, the compiler may
> +come to the conclusion that offloading incurs too much overhead, and
> +suggest to the runtime to avoid it.  To counteract that, you can use
> +the option @option{-foffload-force} to force offloading in such cases.
> +Alternatively, offloading is also enabled if a specific device type is
> +requested, in a call to @code{acc_init} or by setting the
> +@env{ACC_DEVICE_TYPE} environment variable, for example.
> +
>  A complete description of all OpenACC directives accepted may be found in 
>  the @uref{http://www.openacc.org/, OpenACC} Application Programming
>  Interface manual, version 2.0.
> diff --git libgomp/oacc-init.c libgomp/oacc-init.c
> index 42d005d..2f053f3 100644
> --- libgomp/oacc-init.c
> +++ libgomp/oacc-init.c
> @@ -122,7 +122,10 @@ resolve_device (acc_device_t d, bool fail_is_error)
>        {
>  	if (goacc_device_type)
>  	  {
> -	    /* Lookup the named device.  */
> +	    /* Lookup the device that has been explicitly named, so do not pay
> +	       attention to gomp_offload_target_available_p.  (That is,
> +	       enforced usage even with an "avoid offloading" flag set, and
> +	       hard error if not actually available.)  */
>  	    while (++d != _ACC_device_hwm)
>  	      if (dispatchers[d]
>  		  && !strcasecmp (goacc_device_type,
> @@ -148,8 +151,15 @@ resolve_device (acc_device_t d, bool fail_is_error)
>      case acc_device_not_host:
>        /* Find the first available device after acc_device_not_host.  */
>        while (++d != _ACC_device_hwm)
> -	if (dispatchers[d] && dispatchers[d]->get_num_devices_func () > 0)
> +	if (dispatchers[d]
> +	    && dispatchers[d]->get_num_devices_func () > 0
> +	    /* No device has been explicitly named, so pay attention to
> +	       gomp_offload_target_available_p, to not decide on an offload
> +	       target that we don't have offload data available for, or have an
> +	       "avoid offloading" flag set for.  */
> +	    && gomp_offload_target_available_p (dispatchers[d]->type))
>  	  goto found;
> +      /* No non-host device found.  */
>        if (d_arg == acc_device_default)
>  	{
>  	  d = acc_device_host;
> @@ -168,7 +178,7 @@ resolve_device (acc_device_t d, bool fail_is_error)
>        break;
>  
>      default:
> -      if (d > _ACC_device_hwm)
> +      if (d >= _ACC_device_hwm)
>  	{
>  	  if (fail_is_error)
>  	    goto unsupported_device;
> @@ -181,7 +191,8 @@ resolve_device (acc_device_t d, bool fail_is_error)
>  
>    assert (d != acc_device_none
>  	  && d != acc_device_default
> -	  && d != acc_device_not_host);
> +	  && d != acc_device_not_host
> +	  && d < _ACC_device_hwm);
>  
>    if (dispatchers[d] == NULL && fail_is_error)
>      {
> diff --git libgomp/target.c libgomp/target.c
> index 96fe3d5..afcbedb 100644
> --- libgomp/target.c
> +++ libgomp/target.c
> @@ -1165,12 +1165,19 @@ gomp_unload_image_from_device (struct gomp_device_descr *devicep,
>  
>  /* This function should be called from every offload image while loading.
>     It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
> -   the target, and TARGET_DATA needed by target plugin.  */
> +   the target, and TARGET_DATA needed by target plugin.
> +
> +   If HOST_TABLE is NULL, this image (TARGET_DATA) is stored as an "avoid
> +   offloading" flag, and the TARGET_TYPE will not be considered by default
> +   until this image gets unregistered.  */
>  
>  void
>  GOMP_offload_register_ver (unsigned version, const void *host_table,
>  			   int target_type, const void *target_data)
>  {
> +  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
> +	      version, host_table, target_type, target_data);
> +
>    int i;
>  
>    if (GOMP_VERSION_LIB (version) > GOMP_VERSION)
> @@ -1179,16 +1186,19 @@ GOMP_offload_register_ver (unsigned version, const void *host_table,
>    
>    gomp_mutex_lock (&register_lock);
>  
> -  /* Load image to all initialized devices.  */
> -  for (i = 0; i < num_devices; i++)
> +  if (host_table != NULL)
>      {
> -      struct gomp_device_descr *devicep = &devices[i];
> -      gomp_mutex_lock (&devicep->lock);
> -      if (devicep->type == target_type
> -	  && devicep->state == GOMP_DEVICE_INITIALIZED)
> -	gomp_load_image_to_device (devicep, version,
> -				   host_table, target_data, true);
> -      gomp_mutex_unlock (&devicep->lock);
> +      /* Load image to all initialized devices.  */
> +      for (i = 0; i < num_devices; i++)
> +	{
> +	  struct gomp_device_descr *devicep = &devices[i];
> +	  gomp_mutex_lock (&devicep->lock);
> +	  if (devicep->type == target_type
> +	      && devicep->state == GOMP_DEVICE_INITIALIZED)
> +	    gomp_load_image_to_device (devicep, version,
> +				       host_table, target_data, true);
> +	  gomp_mutex_unlock (&devicep->lock);
> +	}
>      }
>  
>    /* Insert image to array of pending images.  */
> @@ -1214,26 +1224,36 @@ GOMP_offload_register (const void *host_table, int target_type,
>  
>  /* This function should be called from every offload image while unloading.
>     It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
> -   the target, and TARGET_DATA needed by target plugin.  */
> +   the target, and TARGET_DATA needed by target plugin.
> +
> +   If HOST_TABLE is NULL, the "avoid offloading" flag gets cleared for this
> +   image (TARGET_DATA), and this TARGET_TYPE may again be considered by
> +   default.  */
>  
>  void
>  GOMP_offload_unregister_ver (unsigned version, const void *host_table,
>  			     int target_type, const void *target_data)
>  {
> +  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
> +	      version, host_table, target_type, target_data);
> +
>    int i;
>  
>    gomp_mutex_lock (&register_lock);
>  
> -  /* Unload image from all initialized devices.  */
> -  for (i = 0; i < num_devices; i++)
> +  if (host_table != NULL)
>      {
> -      struct gomp_device_descr *devicep = &devices[i];
> -      gomp_mutex_lock (&devicep->lock);
> -      if (devicep->type == target_type
> -	  && devicep->state == GOMP_DEVICE_INITIALIZED)
> -	gomp_unload_image_from_device (devicep, version,
> -				       host_table, target_data);
> -      gomp_mutex_unlock (&devicep->lock);
> +      /* Unload image from all initialized devices.  */
> +      for (i = 0; i < num_devices; i++)
> +	{
> +	  struct gomp_device_descr *devicep = &devices[i];
> +	  gomp_mutex_lock (&devicep->lock);
> +	  if (devicep->type == target_type
> +	      && devicep->state == GOMP_DEVICE_INITIALIZED)
> +	    gomp_unload_image_from_device (devicep, version,
> +					   host_table, target_data);
> +	  gomp_mutex_unlock (&devicep->lock);
> +	}
>      }
>  
>    /* Remove image from array of pending images.  */
> @@ -1267,7 +1287,8 @@ gomp_init_device (struct gomp_device_descr *devicep)
>    for (i = 0; i < num_offload_images; i++)
>      {
>        struct offload_image_descr *image = &offload_images[i];
> -      if (image->type == devicep->type)
> +      if (image->type == devicep->type
> +	  && image->host_table != NULL)
>  	gomp_load_image_to_device (devicep, image->version,
>  				   image->host_table, image->target_data,
>  				   false);
> @@ -1287,7 +1308,8 @@ gomp_unload_device (struct gomp_device_descr *devicep)
>        for (i = 0; i < num_offload_images; i++)
>  	{
>  	  struct offload_image_descr *image = &offload_images[i];
> -	  if (image->type == devicep->type)
> +	  if (image->type == devicep->type
> +	      && image->host_table != NULL)
>  	    gomp_unload_image_from_device (devicep, image->version,
>  					   image->host_table,
>  					   image->target_data);
> @@ -1311,6 +1333,62 @@ gomp_free_memmap (struct splay_tree_s *mem_map)
>      }
>  }
>  
> +/* Do we have offload data available for the given offload target type?
> +   Instead of verifying that *all* offload data is available that could
> +   possibly be required, we instead just look for *any*.  If we later find any
> +   offload data missing, that's user error.  If any offload data of this target
> +   type is tagged with an "avoid offloading" flag, do not consider this target
> +   type available unless it has been initialized already.  */
> +
> +attribute_hidden bool
> +gomp_offload_target_available_p (int type)
> +{
> +  bool available = false;
> +
> +  /* Has the offload target type already been initialized?  */
> +  for (int i = 0; !available && i < num_devices; i++)
> +    {
> +      struct gomp_device_descr *devicep = &devices[i];
> +      gomp_mutex_lock (&devicep->lock);
> +      if (devicep->type == type
> +	  && devicep->state == GOMP_DEVICE_INITIALIZED)
> +	available = true;
> +      gomp_mutex_unlock (&devicep->lock);
> +    }
> +
> +  /* If the offload target type has been initialized already, we ignore "avoid
> +     offloading" flags.  This is important, because data/state may be present
> +     on the device, that we must continue to use.  */
> +  if (!available)
> +    {
> +      gomp_mutex_lock (&register_lock);
> +      if (num_offload_images == 0)
> +	/* If there is no offload data available at all, there is no way to
> +	   later fail to find any of it for a specific offload target type.
> +	   This is the case where there are no offloaded code regions in user
> +	   code, but the target type can be initialized successfully, and
> +	   executable directqives be used, or runtime library calls be
> +	   made.  */
> +	available = true;
> +      else
> +	{
> +	  /* Can the offload target be initialized?  */
> +	  for (int i = 0; !available && i < num_offload_images; i++)
> +	    if (offload_images[i].type == type
> +		&& offload_images[i].host_table != NULL)
> +	      available = true;
> +	  /* If yes, is an "avoid offloading" flag set?  */
> +	  for (int i = 0; available && i < num_offload_images; i++)
> +	    if (offload_images[i].type == type
> +		&& offload_images[i].host_table == NULL)
> +	      available = false;
> +	}
> +      gomp_mutex_unlock (&register_lock);
> +    }
> +
> +  return available;
> +}
> +
>  /* Host fallback for GOMP_target{,_ext} routines.  */
>  
>  static void
> diff --git libgomp/testsuite/lib/libgomp.exp libgomp/testsuite/lib/libgomp.exp
> index a4c9d83..8d2be80 100644
> --- libgomp/testsuite/lib/libgomp.exp
> +++ libgomp/testsuite/lib/libgomp.exp
> @@ -344,6 +344,16 @@ proc check_effective_target_offload_device_nonshared_as { } {
>      } ]
>  }
>  
> +# Return 1 if the compiler has been configured for nvptx offloading.
> +
> +proc check_effective_target_nvptx_offloading_configured { } {
> +    # PR libgomp/65099: Currently, we only support offloading in 64-bit
> +    # configurations.
> +    global offload_targets
> +    return [expr [string match "*,nvptx,*" ",$offload_targets,"] \
> +		&& [is-effective-target lp64] ]
> +}
> +
>  # Return 1 if at least one nvidia board is present.
>  
>  proc check_effective_target_openacc_nvidia_accel_present { } {
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> index bca425e..23156d8 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> @@ -1,5 +1,3 @@
> -/* { dg-do run } */
> -
>  #include <stdio.h>
>  #include <stdlib.h>
>  
> @@ -7,7 +5,7 @@ int
>  main (void)
>  {
>    fprintf (stderr, "CheCKpOInT\n");
> -#pragma acc kernels
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      abort ();
>    }
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> index c29ca3f..f4d6a07 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> @@ -1,12 +1,10 @@
> -/* { dg-do run } */
> -
>  #include <stdlib.h>
>  
>  int
>  main (int argc, char **argv)
>  {
>  
> -#pragma acc kernels
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      if (argc != 1)
>        abort ();
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> new file mode 100644
> index 0000000..08745fc
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> @@ -0,0 +1,28 @@
> +/* Test that the compiler decides to "avoid offloading".  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +/* The ACC_DEVICE_TYPE environment variable gets set in the testing
> +   framework, and that overrides the "avoid offloading" flag at run time.
> +   { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
> +  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
> +
> +  if (x != 33)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
> +  if (y != 1)
> +    __builtin_abort();
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> new file mode 100644
> index 0000000..724228a
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> @@ -0,0 +1,38 @@
> +/* Test that a user can override the compiler's "avoid offloading"
> +   decision at run time.  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  /* Override the compiler's "avoid offloading" decision.  */
> +  acc_device_t d;
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  d = acc_device_nvidia;
> +#elif defined ACC_DEVICE_TYPE_host
> +  d = acc_device_host;
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +  acc_init (d);
> +
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
> +  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
> +
> +  if (x != 33)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  if (y != 0)
> +    __builtin_abort();
> +#else
> +  if (y != 1)
> +    __builtin_abort();
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> new file mode 100644
> index 0000000..2fb5196
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> @@ -0,0 +1,29 @@
> +/* Test that a user can override the compiler's "avoid offloading"
> +   decision at compile time.  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels
> +  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
> +
> +  if (x != 33)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  if (y != 0)
> +    __builtin_abort();
> +#else
> +  if (y != 1)
> +    __builtin_abort();
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> index dad6d13..87ca378 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> @@ -1,6 +1,6 @@
>  /* This test exercises combined directives.  */
>  
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
>  
> @@ -33,7 +33,7 @@ main (int argc, char **argv)
>  	abort ();
>      }
>  
> -#pragma acc kernels loop copy (a[0:N]) copy (b[0:N])
> +#pragma acc kernels loop copy (a[0:N]) copy (b[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    for (i = 0; i < N; i++)
>      {
>        b[i] = 3.0;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> index 1ac0b95..8f0144c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include  <openacc.h>
>  
> @@ -51,7 +51,7 @@ int test_kernels ()
>      ary[i] = ~0;
>  
>    /* val defaults to copy, ary defaults to copy.  */
> -#pragma acc kernels copy(ondev)
> +#pragma acc kernels copy(ondev) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      ondev = acc_on_device (acc_device_not_host);
>  #pragma acc loop 
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> index e271a37..9a5f7b1 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> @@ -1,5 +1,3 @@
> -/* { dg-do run } */
> -
>  #include <stdlib.h>
>  
>  int main (void)
> @@ -10,7 +8,7 @@ int main (void)
>    a = A;
>  
>  #pragma acc data copyout (a_1, a_2)
> -#pragma acc kernels deviceptr (a)
> +#pragma acc kernels deviceptr (a) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a_1 = a;
>      a_2 = &a;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> index 51745ba..3ef6f9b 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> @@ -1,4 +1,5 @@
>  /* { dg-do run { target openacc_nvidia_accel_selected } } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
>  
>  #include <stdlib.h>
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> index 3acfdf5..614ad33 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
>  
> @@ -73,7 +73,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      if (i != -1 || j != -2)
>        abort ();
> @@ -96,7 +96,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      i = 2;
>      j = 1;
> @@ -110,7 +110,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      if (i != -1 || j != -2)
>        abort ();
> @@ -126,7 +126,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      i = 2;
>      j = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> index 0f323c8..8d5101d 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> @@ -1,4 +1,4 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
>  
>  #include <stdlib.h>
>  
> @@ -11,7 +11,7 @@ main (void)
>    unsigned int *b = (unsigned int *)malloc (N * sizeof (unsigned int));
>    unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
>  
> -#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
> +#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 0;
>      b[0] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> index 654e750..3726b0c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> @@ -1,4 +1,4 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
>  
>  #include <stdlib.h>
>  
> @@ -11,7 +11,7 @@ main (void)
>    unsigned int *b = a;
>    unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
>  
> -#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
> +#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 0;
>      b[0] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> index 44d4fd2..eea4f76 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> @@ -1,4 +1,4 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
>  
>  #include <stdlib.h>
>  
> @@ -11,7 +11,7 @@ main (void)
>    unsigned int b[N];
>    unsigned int c[N];
>  
> -#pragma acc kernels pcopyout (a, b, c)
> +#pragma acc kernels pcopyout (a, b, c) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 0;
>      b[0] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> index a68a7cd..860b6da 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> @@ -1,6 +1,6 @@
>  int
>  main (void)
>  {
> -#pragma acc kernels
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    ;
>  }
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> index 2e4100f..5cdc200 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,7 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = a[0] + 1;
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> index b3e736b..2e4d4d2 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,8 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      for (int i = 0; i < n; i++)
>        a[i] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> index 8b9affa..5bf00db 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,7 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 2;
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> index 83d4e7f..d39b667 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -9,7 +8,7 @@ unsigned int
>  foo (int n, unsigned int *a)
>  {
>    int r;
> -#pragma acc kernels copyout(r) copy (a[0:N])
> +#pragma acc kernels copyout(r) copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      r = a[0];
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> index 01d5e5e..bb2e85b 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,7 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      int r = a[0];
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> index 61d1283..e513827 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,8 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      for (int i = 0; i < n; i++)
>        a[i] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> index f7f04cb..c4791a4 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -11,7 +10,7 @@ void __attribute__((noinline, noclone))
>  foo (int m, int n)
>  {
>    int i, j;
> -  #pragma acc kernels
> +  #pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>  #pragma acc loop collapse(2)
>      for (i = 0; i < m; i++)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> index c164598..94a5ae2 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> new file mode 100644
> index 0000000..5f18b94
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> @@ -0,0 +1,32 @@
> +! Test that the compiler decides to "avoid offloading".
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
> +! The ACC_DEVICE_TYPE environment variable gets set in the testing
> +! framework, and that overrides the "avoid offloading" flag at run time.
> +! { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
> +      X = 33
> +      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST);
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
> +      IF (.NOT. Y) CALL ABORT
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> new file mode 100644
> index 0000000..51801ad
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> @@ -0,0 +1,41 @@
> +! Test that a user can override the compiler's "avoid offloading"
> +! decision at run time.
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER :: D
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!     Override the compiler's "avoid offloading" decision.
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      D = ACC_DEVICE_NVIDIA
> +#elif defined ACC_DEVICE_TYPE_host
> +      D = ACC_DEVICE_HOST
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +      CALL ACC_INIT (D)
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
> +      X = 33
> +      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      IF (Y) CALL ABORT
> +#else
> +      IF (.NOT. Y) CALL ABORT
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> new file mode 100644
> index 0000000..bea6ab8
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> @@ -0,0 +1,31 @@
> +! Test that a user can override the compiler's "avoid offloading"
> +! decision at compile time.
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER :: D
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS
> +      X = 33
> +      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      IF (Y) CALL ABORT
> +#else
> +      IF (.NOT. Y) CALL ABORT
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> index 94100b2..4b52579 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> @@ -1,6 +1,9 @@
>  ! This test exercises combined directives.
>  
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>  
>  program main
>    integer, parameter :: n = 32
> @@ -27,7 +30,7 @@ program main
>    !$acc kernels loop copy (a(1:n)) copy (b(1:n))
>    do i = 1, n
>      b(i) = 3.0;
> -    a(i) = a(i) + b(i)
> +    a(i) = a(i) + b(i) ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
>    end do
>  
>    do i = 1, n
> diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> index 4afb562..b9298c7 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> @@ -2,6 +2,9 @@
>  ! offloaded regions are properly mapped using present_or_copy.
>  
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>  
>  program main
>    implicit none
> @@ -30,7 +33,7 @@ subroutine kernels (array, n)
>    integer, dimension (n) :: array
>    integer :: n, i
>  
> -  !$acc kernels
> +  !$acc kernels ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
>    do i = 1, n
>       array(i) = i
>    end do


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 11:51                 ` Thomas Schwinge
@ 2016-02-10 13:25                   ` Bernd Schmidt
  2016-02-10 14:40                     ` Thomas Schwinge
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schmidt @ 2016-02-10 13:25 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

On 02/10/2016 12:49 PM, Thomas Schwinge wrote:
> Hi!
>
> Ping.

I think this has to be considered after gcc-6. In general, what's the 
state of OpenACC these days?

I'm slightly confused by the interface between offloaded code and 
libgomp. It looks like you're collecting avoid-offloading flags 
per-function, but then when things get registered, it seems like a 
per-image flag. Is that right? It seems like too large a hammer.

>> +	  bool avoid_offloading_p = true;
>> +	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
>> +	    {
>> +	      if (dims[ix] > 1)
>> +		{
>> +		  avoid_offloading_p = false;
>> +		  break;
>> +		}
>> +	    }

Avoid unnecessary braces.

>> +	   executable directqives be used, or runtime library calls be

Typo.


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 13:25                   ` Bernd Schmidt
@ 2016-02-10 14:40                     ` Thomas Schwinge
  2016-02-10 15:27                       ` Bernd Schmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Schwinge @ 2016-02-10 14:40 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

Hi!

On Wed, 10 Feb 2016 14:25:50 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 12:49 PM, Thomas Schwinge wrote:
> > [...]
> 
> I think this has to be considered after gcc-6.

Hmm, I see.


> In general, what's the 
> state of OpenACC these days?

Much improved compared to GCC 5.  :-) Anything specific you'd like me to
elaborate on?  <https://gcc.gnu.org/wiki/OpenACC> should be fairly
accurate.


> I'm slightly confused by the interface between offloaded code and 
> libgomp. It looks like you're collecting avoid-offloading flags 
> per-function, but then when things get registered, it seems like a 
> per-image flag.

(Per-image flag that affects all offloading for a given offloading type,
even.)

> Is that right? It seems like too large a hammer.

Yes, we need a hammer that big: we have to ensure consistency between
data regions on the device and code offloading to the device, as
otherwise we'll very easily run into inconsistencies, because of the
non-shared memory.  In the general case, it's "all or nothing": you
either have to offload all kernels or none of them.


> >> [...]
> 
> Avoid unnecessary braces.
> 
> >> [...]
> 
> Typo.

Thanks for the review; fixed.


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 14:40                     ` Thomas Schwinge
@ 2016-02-10 15:27                       ` Bernd Schmidt
  2016-02-10 16:23                         ` Thomas Schwinge
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schmidt @ 2016-02-10 15:27 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

On 02/10/2016 03:39 PM, Thomas Schwinge wrote:

> Yes, we need a hammer that big: we have to ensure consistency between
> data regions on the device and code offloading to the device, as
> otherwise we'll very easily run into inconsistencies, because of the
> non-shared memory.  In the general case, it's "all or nothing": you
> either have to offload all kernels or none of them.

That's unfortunately not the impression I got from the earlier 
discussion, and this seems to imply that one unprofitable kernel would 
disable all the others - IMO this is not acceptable. There need to be 
more compiler smarts to figure out whether a kernel is a valid candidate 
for skipping the offloading.


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 15:27                       ` Bernd Schmidt
@ 2016-02-10 16:23                         ` Thomas Schwinge
  2016-02-10 16:37                           ` Bernd Schmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Schwinge @ 2016-02-10 16:23 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

Hi!

On Wed, 10 Feb 2016 16:27:40 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 03:39 PM, Thomas Schwinge wrote:
> 
> > Yes, we need a hammer that big: we have to ensure consistency between
> > data regions on the device and code offloading to the device, as
> > otherwise we'll very easily run into inconsistencies, because of the
> > non-shared memory.  In the general case, it's "all or nothing": you
> > either have to offload all kernels or none of them.
> 
> That's unfortunately not the impression I got from the earlier 
> discussion

:-(

> and this seems to imply that one unprofitable kernel would 
> disable all the others

Correct.

> - IMO this is not acceptable.

Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
constructs' code offloaded; the user wants his code to execute as fast as
possible.

If you consider the whole of OpenACC kernels code offloading as a
compiler optimization, then it's fine for GCC to abort this
"optimization" if it's reasonably clear that this transformation (code
offloading) will not be profitable -- just like what GCC does with other
possible code optimizations/transformations.  As I've said before,
profiling the execution times of several real-world codes has shown that
under the assumtion that parloops fails to parallelize one kernel (one
out of possibly many), this one kernel has always been a "hot spot", and
avoiding offloading in this case has always helped prevent performance
degradation below host-fallback performance.

It's of course unfortunate that we have to disable our offloading
machinery for a lot of codes using OpenACC kernels, but given the current
state of OpenACC kernels parallelization analysis (parloops), doing so is
still profitable for a user, compared to regressed performance with
single-threaded offloaded execution.

Of course...

> There need to be 
> more compiler smarts to figure out whether a kernel is a valid candidate 
> for skipping the offloading.

... that would be better, obviously.  But, I suggest we work on that
incrementally, after fixing the performance regression with my "avoid
offloading" patch.

I have difficulties coming up with an algorithm/parametrization to have
the compiler/runtime decide whether offloading will be profitable given
input parameters such as a ratio of parallelized/single-threaded kernels.
So I'm all ears to suggestions in that regard.  Consider: if we encounter
a single-threaded kernel, the compiler (parloops) has just given up
"understanding" the user's code.  And again, implementing such heuristics
to me sounds like incremental follow-up projects, quite possibly in
combination with generally improving OpenACC kernels handling/parloops.


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 16:23                         ` Thomas Schwinge
@ 2016-02-10 16:37                           ` Bernd Schmidt
  2016-02-10 17:39                             ` Thomas Schwinge
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schmidt @ 2016-02-10 16:37 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

On 02/10/2016 05:23 PM, Thomas Schwinge wrote:
> Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
> constructs' code offloaded; the user wants his code to execute as fast as
> possible.
>
> If you consider the whole of OpenACC kernels code offloading as a
> compiler optimization, then it's fine for GCC to abort this
> "optimization" if it's reasonably clear that this transformation (code
> offloading) will not be profitable -- just like what GCC does with other
> possible code optimizations/transformations.

Yes, but if a single kernel (which might not even get executed at 
run-time) can inhibit offloading for the whole program, then we're not 
making an intelligent decision, and IMO violating user expectations. 
IIUC it's also disabling offloading for parallels rather than just 
kernels, which we previously said shouldn't happen.

> As I've said before,
> profiling the execution times of several real-world codes has shown that
> under the assumtion that parloops fails to parallelize one kernel (one
> out of possibly many), this one kernel has always been a "hot spot", and
> avoiding offloading in this case has always helped prevent performance
> degradation below host-fallback performance.

IMO a warning for the specific kernel that's problematic would be better 
so that users can selectively apply -fopenacc to files where it is 
profitable.

> It's of course unfortunate that we have to disable our offloading
> machinery for a lot of codes using OpenACC kernels, but given the current
> state of OpenACC kernels parallelization analysis (parloops), doing so is
> still profitable for a user, compared to regressed performance with
> single-threaded offloaded execution.

How often does this occur on real-world code? Will we end up supporting 
OpenACC by not doing offloading at all in the usual case? The way you 
describe it, it sounds like we should recommend that -fopenacc not be 
used in gcc-6 and restore the previous invoke.texi langauge that marks 
it as experimental.


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 16:37                           ` Bernd Schmidt
@ 2016-02-10 17:39                             ` Thomas Schwinge
  2016-02-10 20:07                               ` Bernd Schmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Schwinge @ 2016-02-10 17:39 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

Hi!

On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 05:23 PM, Thomas Schwinge wrote:
> > Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
> > constructs' code offloaded; the user wants his code to execute as fast as
> > possible.
> >
> > If you consider the whole of OpenACC kernels code offloading as a
> > compiler optimization, then it's fine for GCC to abort this
> > "optimization" if it's reasonably clear that this transformation (code
> > offloading) will not be profitable -- just like what GCC does with other
> > possible code optimizations/transformations.
> 
> Yes, but if a single kernel (which might not even get executed at 
> run-time) can inhibit offloading for the whole program, then we're not 
> making an intelligent decision, and IMO violating user expectations. 

Sure, I agree it's a pretty "rough-grained" decision.  (Owed to the
non-shared-memory offloading architecture -- shared-memory offloading
indeed can make such decisions case by case.)

> IIUC it's also disabling offloading for parallels rather than just 
> kernels, which we previously said shouldn't happen.

Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
understood the earlier discussion to apply to parallel-only codes, where
the "avoid offloading" flag will never be set.  In mixed parallel/kernels
code with one un-parallelized kernels construct, offloading would also
(have to be) disabled for the parallel constructs (for the same data
consistency reasons explained before).  The majority of codes I've seen
use either parallel or kernels constructs, typically not both.

> > As I've said before,
> > profiling the execution times of several real-world codes has shown that
> > under the assumtion that parloops fails to parallelize one kernel (one
> > out of possibly many), this one kernel has always been a "hot spot", and
> > avoiding offloading in this case has always helped prevent performance
> > degradation below host-fallback performance.
> 
> IMO a warning for the specific kernel that's problematic would be better 

That's something Tom suggested,
<http://news.gmane.org/find-root.php?message_id=%3C569D2059.4010105%40mentor.com%3E>,
and which motivated my patch, in going one step further:

> so that users can selectively apply -fopenacc to files where it is 
> profitable.

This puts it into the hands of the user to selectively mark kernels
constructs as suitable for GCC's current parloops processing (for
example, by disabling OpenACC/offloading on a per-file basis) -- which is
something we wanted to avoid, given the idea that in the future, GCC will
improve, and will be able to handle kernels constructs better, and the
user would then have to re-visit/un-do their earlier changes with each
GCC release, instead of just recompiling their code.

> > It's of course unfortunate that we have to disable our offloading
> > machinery for a lot of codes using OpenACC kernels, but given the current
> > state of OpenACC kernels parallelization analysis (parloops), doing so is
> > still profitable for a user, compared to regressed performance with
> > single-threaded offloaded execution.
> 
> How often does this occur on real-world code?

Quite a lot for code using the kernels construct, as discussed before,
given that parloops fails to handle a lot of constructs in real-world
code.

> Will we end up supporting 
> OpenACC by not doing offloading at all in the usual case?

This whole discussion does not at all apply to the body of OpenACC code
using the parallel instead of the kernels construct, which will be
parallelized/offloaded just fine.

> The way you 
> describe it, it sounds like we should recommend that -fopenacc not be 
> used in gcc-6 and restore the previous invoke.texi langauge that marks 
> it as experimental.

Huh?  Like, at random, discouraging users from using GCC's SIMD
vectorizer just because that one fails to vectorize some code that it
could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
vectorizer is much more mature than the OpenACC kernels/parloops
handling; it's seen many more years of development.)

Certainly we should document that there is still a lot of room for
improvement in OpenACC kernels handling (just like it's the case for a
lot of other generic compiler optimizations) -- and we're doing exactly
that on <https://gcc.gnu.org/wiki/OpenACC>.  I don't follow how that
translates to discouraging use of -fopenacc however?


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 17:39                             ` Thomas Schwinge
@ 2016-02-10 20:07                               ` Bernd Schmidt
  2016-02-11 10:02                                 ` Thomas Schwinge
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schmidt @ 2016-02-10 20:07 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

On 02/10/2016 06:37 PM, Thomas Schwinge wrote:
> On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
>> IIUC it's also disabling offloading for parallels rather than just
>> kernels, which we previously said shouldn't happen.
>
> Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
> understood the earlier discussion to apply to parallel-only codes, where
> the "avoid offloading" flag will never be set.  In mixed parallel/kernels
> code with one un-parallelized kernels construct, offloading would also
> (have to be) disabled for the parallel constructs (for the same data
> consistency reasons explained before).  The majority of codes I've seen
> use either parallel or kernels constructs, typically not both.

That's not something I'd want to hard-code into the compiler however. 
Don't know how Jakub feels but to me this approach is way too 
coarse-grained.

> Huh?  Like, at random, discouraging users from using GCC's SIMD
> vectorizer just because that one fails to vectorize some code that it
> could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
> vectorizer is much more mature than the OpenACC kernels/parloops
> handling; it's seen many more years of development.)

Your description sounded like it's not actually not optimizing, but 
actively hurting performance for a large selection of real world codes. 
If I understood that correctly, we need to document this in the manual.


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-10 20:07                               ` Bernd Schmidt
@ 2016-02-11 10:02                                 ` Thomas Schwinge
  2016-02-11 15:58                                   ` Bernd Schmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Schwinge @ 2016-02-11 10:02 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

Hi!

There are two issues here: 1. "avoid offloading" mechanism, and 2. "avoid
offloading" policy.

On Wed, 10 Feb 2016 21:07:29 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 06:37 PM, Thomas Schwinge wrote:
> > On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> >> IIUC it's also disabling offloading for parallels rather than just
> >> kernels, which we previously said shouldn't happen.
> >
> > Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
> > understood the earlier discussion to apply to parallel-only codes, where
> > the "avoid offloading" flag will never be set.  In mixed parallel/kernels
> > code with one un-parallelized kernels construct, offloading would also
> > (have to be) disabled for the parallel constructs (for the same data
> > consistency reasons explained before).

The "avoid offloading" mechanism.  Owed to the non-shared-memory
offloading architecture, if the compiler/runtime decides to "avoid
offloading", then this has to apply to *all* code offloading, for data
consistency reasons.  Do we agree on that?

> > The majority of codes I've seen
> > use either parallel or kernels constructs, typically not both.
> 
> That's not something I'd want to hard-code into the compiler however. 
> Don't know how Jakub feels but to me this approach is way too 
> coarse-grained.

The "avoid offloading" policy.  I'm looking into improving that.


> > Huh?  Like, at random, discouraging users from using GCC's SIMD
> > vectorizer just because that one fails to vectorize some code that it
> > could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
> > vectorizer is much more mature than the OpenACC kernels/parloops
> > handling; it's seen many more years of development.)
> 
> Your description sounded like it's not actually not optimizing, but 
> actively hurting performance for a large selection of real world codes. 

Indeed single-threaded (that is, un-parallelized OpenACC kernels
construct) offloading execution is hurting performance (data copy
overhead; kernel launch overhead; compared to a single CPU core, a single
GPU core has higher memory access latencies and is slower) -- hence the
idea to resort to host-fallback execution in such a situation.

> If I understood that correctly, we need to document this in the manual.

OK; prototyping that on <https://gcc.gnu.org/wiki/OpenACC>.


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-02-11 10:02                                 ` Thomas Schwinge
@ 2016-02-11 15:58                                   ` Bernd Schmidt
  0 siblings, 0 replies; 25+ messages in thread
From: Bernd Schmidt @ 2016-02-11 15:58 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek; +Cc: gcc-patches, Tom de Vries

On 02/11/2016 11:01 AM, Thomas Schwinge wrote:
>
> The "avoid offloading" mechanism.  Owed to the non-shared-memory
> offloading architecture, if the compiler/runtime decides to "avoid
> offloading", then this has to apply to *all* code offloading, for data
> consistency reasons.  Do we agree on that?

Not necessarily, I think. It should be possible to determine whether 
some offloaded code blocks are independent from each other. (That 
doesn't mean we currently have any good way of making such decisions. 
libgomp or even the ptx compiler are probably too late and don't have 
the necessary information anymore).


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-21 21:55   ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Thomas Schwinge
  2016-01-22  7:40     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge
@ 2016-06-30 21:46     ` Thomas Schwinge
  2016-11-03 17:59     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Cesar Philippidis
  2019-01-31 17:16     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge
  3 siblings, 0 replies; 25+ messages in thread
From: Thomas Schwinge @ 2016-06-30 21:46 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 6239 bytes --]

Hi!

On Thu, 21 Jan 2016 22:54:26 +0100, I wrote:
> Committed to gomp-4_0-branch in r232709:
> 
> commit 41a76d233e714fd7b79dc1f40823f607c38306ba
> Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
> Date:   Thu Jan 21 21:52:50 2016 +0000
> 
>     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"

In gomp-4_0-branch r237895, I committed the following patch to 'only
trigger the "avoid offloading" mechanism for -O2 and higher', resolving
the confusing case that for -O0 and -O1 we'D not emit the diagnostic but
still trigger the "avoid offloading" mechanism.

commit 68ce05b476b68b50c2ed341ae6a77279850edbb1
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Jun 30 20:46:07 2016 +0000

    Only trigger the "avoid offloading" mechanism for -O2 and higher
    
    	gcc/
    	* config/nvptx/nvptx.c (nvptx_goacc_validate_dims): Only trigger
    	the "avoid offloading" mechanism for -O2 and higher.
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c:
    	Update.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Update.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@237895 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp                                    |  5 +++++
 gcc/config/nvptx/nvptx.c                              | 19 ++++++++++---------
 libgomp/ChangeLog.gomp                                |  6 ++++++
 .../libgomp.oacc-c-c++-common/avoid-offloading-1.c    | 10 +++++++++-
 .../libgomp.oacc-fortran/avoid-offloading-1.f         | 12 +++++++++++-
 5 files changed, 41 insertions(+), 11 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index 9bc1fbe..8c88119 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2016-06-30  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* config/nvptx/nvptx.c (nvptx_goacc_validate_dims): Only trigger
+	the "avoid offloading" mechanism for -O2 and higher.
+
 2016-06-10  James Norris  <jnorris@codesourcery.com>
 
 	Backport from mainline r236098.
diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
index 6d787b0..09a5a62 100644
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -4152,8 +4152,13 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
   /* Detect if a function is unsuitable for offloading.  */
   if (!flag_offload_force && decl)
     {
+      /* Trigger the "avoid offloading" mechanism if a OpenACC kernels
+	 construct could not be parallelized, but only do that for -O2 and
+	 higher, as otherwise we're not expecting any parallelization to
+	 happen.  */
       tree oacc_function_attr = get_oacc_fn_attrib (decl);
-      if (oacc_function_attr
+      if (optimize >= 2
+	  && oacc_function_attr
 	  && oacc_fn_attrib_kernels_p (oacc_function_attr))
 	{
 	  bool avoid_offloading_p = true;
@@ -4167,14 +4172,10 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 	    }
 	  if (avoid_offloading_p)
 	    {
-	      /* OpenACC kernels constructs will never be parallelized for
-		 optimization levels smaller than -O2; avoid the diagnostic in
-		 this case.  */
-	      if (optimize >= 2)
-		warning_at (DECL_SOURCE_LOCATION (decl), 0,
-			    "OpenACC kernels construct will be executed "
-			    "sequentially; will by default avoid offloading "
-			    "to prevent data copy penalty");
+	      warning_at (DECL_SOURCE_LOCATION (decl), 0,
+			  "OpenACC kernels construct will be executed"
+			  " sequentially; will by default avoid offloading to"
+			  " prevent data copy penalty");
 	      DECL_ATTRIBUTES (decl)
 		= tree_cons (get_identifier ("omp avoid offloading"),
 			     NULL_TREE, DECL_ATTRIBUTES (decl));
diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index af4e0d5..07fe8b7 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,9 @@
+2016-06-30  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c:
+	Update.
+	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Update.
+
 2016-06-10  Thomas Schwinge  <thomas@codesourcery.com>
 
 	PR middle-end/71373
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
index 8f50ba3..d5fff2d 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
@@ -15,7 +15,15 @@ int main(void)
 
   if (x != 33)
     __builtin_abort();
-#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+#if defined ACC_DEVICE_TYPE_nvidia
+# if !defined __OPTIMIZE__
+  if (y != 0)
+    __builtin_abort();
+# else
+  if (y != 1)
+    __builtin_abort();
+# endif
+#elif defined ACC_DEVICE_TYPE_host
   if (y != 1)
     __builtin_abort();
 #else
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
index 452afe1..da89b93 100644
--- libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
@@ -4,6 +4,10 @@
 ! { dg-additional-options "-cpp" }
 ! The warning is only triggered for -O2 and higher.
 ! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }
+! As __OPTIMIZE__ is defined for -O1 and higher, we don't have an (easy) way to
+! distinguish -O1 (where we will offload) from -O2 (where we won't offload), so
+! for -O1 testing, we expect to abort.
+! { dg-xfail-run-if "" { openacc_nvidia_accel_selected } { "-O1" } { "" } }
 
       IMPLICIT NONE
       INCLUDE "openacc_lib.h"
@@ -19,7 +23,13 @@
 !$ACC END DATA
 
       IF (X .NE. 33) CALL ABORT
-#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+#if defined ACC_DEVICE_TYPE_nvidia
+# if !defined __OPTIMIZE__
+      IF (Y) CALL ABORT
+# else
+      IF (.NOT. Y) CALL ABORT
+# endif
+#elif defined ACC_DEVICE_TYPE_host
       IF (.NOT. Y) CALL ABORT
 #else
 # error Not ported to this ACC_DEVICE_TYPE


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc)
  2016-01-21 21:55   ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Thomas Schwinge
  2016-01-22  7:40     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge
  2016-06-30 21:46     ` Thomas Schwinge
@ 2016-11-03 17:59     ` Cesar Philippidis
  2019-01-31 17:16     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge
  3 siblings, 0 replies; 25+ messages in thread
From: Cesar Philippidis @ 2016-11-03 17:59 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches, Jakub Jelinek

This patch has proven to be effective at warning users when the compiler
falls back to host execution due to insufficient parallelism (at least
from parloop's perspective) inside kernels regions. At the moment, acc
kernels are restricted to gang-level parallelism. Consequently, if
parloops fails to detects any parallelism, the kernels region will only
execute on a single GPU thread, which is several orders of magnitude
slower than a single CPU thread.

Before I rebase this patch, are these changes conceptually OK for GCC7?
Looking back at the old email thread, there were some disagreements on
these types of warnings. One significant example where the parloops
fails to detect any parallelism is in SPEC_ACCEL. A test that should
take an hour or so, takes longer than a week with the host fallback. At
least this warning provides the user with some feedback that his/her
program might run slow.

Cesar

On 01/21/2016 01:54 PM, Thomas Schwinge wrote:
> Hi!
> 
> On Mon, 18 Jan 2016 18:26:49 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
>> This patch introduces an option fopt-info-oacc.
>>
>> When using the option like this with a kernels region in kernels-loop.c 
>> that parloops does not manage to parallelize:
>> ...
>> $ gcc kernels-loop.c -S -O2 -fopenacc -fopt-info-oacc-all
>> ...
>>
>> we get a message:
>> ...
>> kernels-loop.c:23:9: note: kernels region executed sequentially. 
>> Consider mapping it to host execution, to avoid data copy penalty.
>> ...
> 
> Yay for helping the user understand what the compiler is doing!
> 
>> Any comments?
> 
> Telling from real-world code that we've been having a look at, when the
> above situation happens, we're -- in the vast majority of all cases -- in
> a situation where we generally want to avoid offloading (unless
> explicitly requested), "to avoid data copy penalty" as well as typically
> much slower single-threaded execution on the GPU.  Obviously, that will
> have to be revisited as parloops (or any other mechanism in GCC) is able
> to better understand/use the parallelism in OpenACC kernels constructs.
> 
> So, building upon Tom's patch, I have implemented an "avoid offloading"
> flag given the presence of one un-parallelized OpenACC kernels construct.
> This is currently only enabled for OpenACC kernels constructs, in
> combination with nvptx offloading, but I think the general scheme will be
> useful also for other constructs as well as other (non-shared memory)
> offloading targets.
> 
> Also, "avoid offloading" is just a default: if a user explicitly
> requested the use of, for example, a Nvidia GPU (with an
> acc_init(acc_device_nvidia) call, or by setting the
> ACC_DEVICE_TYPE=idia environemnt variable, for example), then we cannot
> apply host-fallback execution, because in this case the user can
> rightfully assume Nvidia GPU semantics (async clause works, and so on).
> 
> 
> The new warning (very similar to the one that Tom proposed) also
> uncovered a bunch of OpenACC kernels test cases in libgomp that did not
> have OpenACC kernels processing enabled (-ftree-parallelize-loops), but
> which parloops can handle fine once that is enabled -- and also a bunch
> of OpenACC kernels test cases that parloops doesn't handle but it looked
> as they were meant to be.  (Maybe I'm wrong about that, though.)  Anyway,
> Tom, would you please make a note to audit all use of -foffload-force in
> the libgomp testsuite?  (It is appropriate for all test cases that
> parloops truely is not meant to handle, but for all others, that flag
> should probably be removed and instead an XFAILed dg-bogus directive
> added, so that we will notice (XPASS) once it does handle them.)
> 
> 
> I've also added a new command-line option, -foffload-force, that restores
> the current behavior, inhibits the "avoid offloading" handling.  This is
> primarily meant for GCC (libgomp) testsuite usage, but could occasionally
> also be useful for users.  Considering alternatives (that can be applied
> in a more fine-grained way, case by case per OpenACC kernels construct):
> 
> 1) a new GCC-specific pragma, for example:
> 
>     #pragma GCC force offloading
>     #pragma acc kernels
>       [un-parallelizable stuff]
> 
> 2) a new GCC-specific clause, for example in the implementation
> namespace, starting with "_":
> 
>     #pragma acc kernels _force_offloading
>       [un-parallelizable stuff]
> 
> ..., the -foffload-force flag was the simplest solution.  (Because, if
> you're going to alter the sources anyway, you might as well just remove
> the one offending OpenACC kernels construct...)
> 
> 
> Committed to gomp-4_0-branch in r232709:
> 
> commit 41a76d233e714fd7b79dc1f40823f607c38306ba
> Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
> Date:   Thu Jan 21 21:52:50 2016 +0000
> 
>     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
>     
>     	gcc/
>     	* common.opt: Add -foffload-force.
>     	* lto-wrapper.c (merge_and_complain, append_compiler_options):
>     	Handle it.
>     	* doc/invoke.texi: Document it.
>     	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
>     	(record_id): Parse, and set it.
>     	(process): Use it.
>     	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
>     	offloading".
>     	(nvptx_record_offload_symbol): Use it.
>     	(nvptx_goacc_validate_dims): Set it.
>     	libgomp/
>     	* target.c (GOMP_offload_register_ver)
>     	(GOMP_offload_unregister_ver, gomp_init_device)
>     	(gomp_unload_device, gomp_offload_target_available_p): Handle and
>     	document "avoid offloading" ("host_table =NULL").
>     	(resolve_device): Document "avoid offloading".
>     	* oacc-init.c (resolve_device): Likewise.
>     	* libgomp.texi (Enabling OpenACC): Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
>     	file.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
>     	* testsuite/libgomp.oacc-c++/non-scalar-data.C: Set
>     	"-foffload-force".
>     	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90:
>     	Likewise.
>     
>     	libgomp/
>     	* testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c: Set
>     	"-ftree-parallelize-loops2".
>     	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/if-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
>     	* testsuite/libgomp.oacc-fortran/asyncwait-1.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/asyncwait-2.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/asyncwait-3.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/deviceptr-1.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/kernels-map-1.f90: Likewise.
>     	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
>     
>     git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@232709 138bc75d-0d04-0410-961f-82ee72b054a4
> ---
>  gcc/ChangeLog.gomp                                 |  14 ++
>  gcc/common.opt                                     |   4 +
>  gcc/config/nvptx/mkoffload.c                       |  73 +++++++++-
>  gcc/config/nvptx/nvptx.c                           |  42 +++++-
>  gcc/doc/invoke.texi                                |  11 +-
>  gcc/lto-wrapper.c                                  |   2 +
>  libgomp/ChangeLog.gomp                             | 150 +++++++++++++++++++++
>  libgomp/libgomp.texi                               |   8 ++
>  libgomp/oacc-init.c                                |   8 +-
>  libgomp/target.c                                   |  86 ++++++++----
>  .../testsuite/libgomp.oacc-c++/non-scalar-data.C   |   3 +-
>  .../testsuite/libgomp.oacc-c-c++-common/abort-3.c  |   3 +-
>  .../testsuite/libgomp.oacc-c-c++-common/abort-4.c  |   3 +-
>  .../libgomp.oacc-c-c++-common/asyncwait-1.c        |   1 +
>  .../libgomp.oacc-c-c++-common/avoid-offloading-1.c |  25 ++++
>  .../libgomp.oacc-c-c++-common/avoid-offloading-2.c |  38 ++++++
>  .../libgomp.oacc-c-c++-common/avoid-offloading-3.c |  29 ++++
>  .../combined-directives-1.c                        |   2 +-
>  .../libgomp.oacc-c-c++-common/default-1.c          |   4 +-
>  .../libgomp.oacc-c-c++-common/deviceptr-1.c        |   4 +-
>  .../libgomp.oacc-c-c++-common/host_data-1.c        |   1 +
>  libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c |   2 +-
>  .../libgomp.oacc-c-c++-common/kernels-1.c          |   4 +-
>  .../kernels-alias-ipa-pta-2.c                      |   5 +-
>  .../kernels-alias-ipa-pta-3.c                      |   5 +-
>  .../kernels-alias-ipa-pta.c                        |   5 +-
>  .../libgomp.oacc-c-c++-common/kernels-empty.c      |   3 +
>  .../kernels-loop-and-seq-2.c                       |   3 +-
>  .../kernels-loop-and-seq-5.c                       |   3 +-
>  .../kernels-loop-and-seq-6.c                       |   3 +-
>  .../kernels-loop-and-seq.c                         |   3 +-
>  .../kernels-loop-collapse.c                        |   3 +-
>  .../kernels-private-vars-local-worker-1.c          |   3 +-
>  .../kernels-private-vars-local-worker-2.c          |   3 +-
>  .../kernels-private-vars-local-worker-3.c          |   3 +-
>  .../kernels-private-vars-local-worker-4.c          |   3 +-
>  .../kernels-private-vars-local-worker-5.c          |   3 +-
>  .../kernels-private-vars-loop-gang-1.c             |   3 +-
>  .../kernels-private-vars-loop-gang-2.c             |   3 +-
>  .../kernels-private-vars-loop-gang-3.c             |   3 +-
>  .../kernels-private-vars-loop-gang-4.c             |   3 +-
>  .../kernels-private-vars-loop-gang-5.c             |   3 +-
>  .../kernels-private-vars-loop-gang-6.c             |   4 +
>  .../kernels-private-vars-loop-vector-1.c           |   3 +-
>  .../kernels-private-vars-loop-vector-2.c           |   3 +-
>  .../kernels-private-vars-loop-worker-1.c           |   3 +-
>  .../kernels-private-vars-loop-worker-2.c           |   3 +-
>  .../kernels-private-vars-loop-worker-3.c           |   3 +-
>  .../kernels-private-vars-loop-worker-4.c           |   3 +-
>  .../kernels-private-vars-loop-worker-5.c           |   3 +-
>  .../kernels-private-vars-loop-worker-6.c           |   3 +-
>  .../kernels-private-vars-loop-worker-7.c           |   3 +-
>  .../kernels-reduction-1.c                          |   3 +-
>  .../testsuite/libgomp.oacc-c-c++-common/nested-2.c |   2 +-
>  .../testsuite/libgomp.oacc-fortran/asyncwait-1.f90 |   1 +
>  .../testsuite/libgomp.oacc-fortran/asyncwait-2.f90 |   1 +
>  .../testsuite/libgomp.oacc-fortran/asyncwait-3.f90 |   1 +
>  .../libgomp.oacc-fortran/avoid-offloading-1.f      |  29 ++++
>  .../libgomp.oacc-fortran/avoid-offloading-2.f      |  40 ++++++
>  .../libgomp.oacc-fortran/avoid-offloading-3.f      |  30 +++++
>  .../libgomp.oacc-fortran/combined-directives-1.f90 |   1 +
>  .../testsuite/libgomp.oacc-fortran/default-1.f90   |   3 +
>  .../testsuite/libgomp.oacc-fortran/deviceptr-1.f90 |   5 +-
>  libgomp/testsuite/libgomp.oacc-fortran/if-1.f90    |   5 +-
>  .../kernels-acc-loop-reduction-2.f90               |   5 +
>  .../kernels-acc-loop-reduction.f90                 |   5 +
>  .../libgomp.oacc-fortran/kernels-collapse-3.f90    |   2 +
>  .../libgomp.oacc-fortran/kernels-collapse-4.f90    |   2 +
>  .../libgomp.oacc-fortran/kernels-independent.f90   |   2 +-
>  .../libgomp.oacc-fortran/kernels-map-1.f90         |   3 +
>  .../kernels-private-vars-loop-gang-2.f90           |   2 +
>  .../kernels-private-vars-loop-gang-3.f90           |   2 +
>  .../kernels-private-vars-loop-gang-6.f90           |   2 +
>  .../kernels-private-vars-loop-vector-1.f90         |   2 +
>  .../kernels-private-vars-loop-vector-2.f90         |   2 +
>  .../kernels-private-vars-loop-worker-1.f90         |   2 +
>  .../kernels-private-vars-loop-worker-2.f90         |   2 +
>  .../kernels-private-vars-loop-worker-3.f90         |   2 +
>  .../kernels-private-vars-loop-worker-4.f90         |   2 +
>  .../kernels-private-vars-loop-worker-5.f90         |   2 +
>  .../kernels-private-vars-loop-worker-6.f90         |   2 +
>  .../kernels-private-vars-loop-worker-7.f90         |   2 +
>  .../libgomp.oacc-fortran/kernels-reduction-1.f90   |   2 +
>  .../libgomp.oacc-fortran/non-scalar-data.f90       |   1 +
>  84 files changed, 700 insertions(+), 78 deletions(-)
> 
> diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
> index cdd279b..f991b91 100644
> --- gcc/ChangeLog.gomp
> +++ gcc/ChangeLog.gomp
> @@ -1,3 +1,17 @@
> +2016-01-21  Thomas Schwinge  <thomas@codesourcery.com>
> +
> +	* common.opt: Add -foffload-force.
> +	* lto-wrapper.c (merge_and_complain, append_compiler_options):
> +	Handle it.
> +	* doc/invoke.texi: Document it.
> +	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
> +	(record_id): Parse, and set it.
> +	(process): Use it.
> +	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
> +	offloading".
> +	(nvptx_record_offload_symbol): Use it.
> +	(nvptx_goacc_validate_dims): Set it.
> +
>  2016-01-20  Cesar Philippidis  <cesar@codesourcery.com>
>  
>  	* gimplify.c (gimplify_scan_omp_clauses):  Consider OACC_{DATA,
> diff --git gcc/common.opt gcc/common.opt
> index 793a062..c905f71 100644
> --- gcc/common.opt
> +++ gcc/common.opt
> @@ -1786,6 +1786,10 @@ Enum(offload_alias) String(pointer) Value(OFFLOAD_ALIAS_POINTER)
>  EnumValue
>  Enum(offload_alias) String(none) Value(OFFLOAD_ALIAS_NONE)
>  
> +foffload-force
> +Common Var(flag_offload_force)
> +Force offloading if the compiler wanted to avoid it.
> +
>  fomit-frame-pointer
>  Common Report Var(flag_omit_frame_pointer) Optimization
>  When possible do not generate stack frames.
> diff --git gcc/config/nvptx/mkoffload.c gcc/config/nvptx/mkoffload.c
> index cce562d..de6a8ad 100644
> --- gcc/config/nvptx/mkoffload.c
> +++ gcc/config/nvptx/mkoffload.c
> @@ -41,9 +41,19 @@ const char tool_name[] =nvptx mkoffload";
>  
>  #define COMMENT_PREFIX "#"
>  
> +enum id_map_flag
> +  {
> +    /* All clear.  */
> +    ID_MAP_FLAG_NONE =,
> +    /* Avoid offloading.  For example, because there is no sufficient
> +       parallelism.  */
> +    ID_MAP_FLAG_AVOID_OFFLOADING =
> +  };
> +
>  struct id_map
>  {
>    id_map *next;
> +  int flags;
>    char *ptx_name;
>  };
>  
> @@ -107,6 +117,38 @@ record_id (const char *p1, id_map ***where)
>      fatal_error (input_location, "malformed ptx file");
>  
>    id_map *v =NEW (id_map);
> +
> +  /* Do we have any flags?  */
> +  v->flags =D_MAP_FLAG_NONE;
> +  if (p1[0] ='(')
> +    {
> +      /* Current flag.  */
> +      const char *cur =1 + 1;
> +
> +      /* Seek to the beginning of ") ".  */
> +      p1 =trchr (cur, ')');
> +      if (!p1 || p1 > end || p1[1] != ')
> +	fatal_error (input_location, "malformed ptx file: "
> +		     "expected \") \" at \"%s\"", cur);
> +
> +      while (cur < p1)
> +	{
> +	  const char *next =trchr (cur, ',');
> +	  if (!next || next > p1)
> +	    next =1;
> +
> +	  if (strncmp (cur, "avoid offloading", next - cur - 1) =0)
> +	    v->flags |=D_MAP_FLAG_AVOID_OFFLOADING;
> +	  else
> +	    fatal_error (input_location, "malformed ptx file: "
> +			 "unknown flag at \"%s\"", cur);
> +
> +	  cur =ext;
> +	}
> +
> +      /* Skip past ") ".  */
> +      p1 +=;
> +    }
>    size_t len =nd - p1;
>    v->ptx_name =NEWVEC (char, len + 1);
>    memcpy (v->ptx_name, p1, len);
> @@ -296,12 +338,17 @@ process (FILE *in, FILE *out)
>    fprintf (out, "\n};\n\n");
>  
>    /* Dump out function idents.  */
> +  bool avoid_offloading_p =alse;
>    fprintf (out, "static const struct nvptx_fn {\n"
>  	   "  const char *name;\n"
>  	   "  unsigned short dim[%d];\n"
>  	   "} func_mappings[] =\n", GOMP_DIM_MAX);
>    for (comma =", id = func_ids; id; comma = ",", id = id->next)
> -    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
> +    {
> +      if (id->flags & ID_MAP_FLAG_AVOID_OFFLOADING)
> +	avoid_offloading_p =rue;
> +      fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
> +    }
>    fprintf (out, "\n};\n\n");
>  
>    fprintf (out,
> @@ -318,7 +365,11 @@ process (FILE *in, FILE *out)
>  	   "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
>  	   "  func_mappings,"
>  	   "  sizeof (func_mappings) / sizeof (func_mappings[0])\n"
> -	   "};\n\n");
> +	   "};\n");
> +  if (avoid_offloading_p)
> +    /* Need a unique handle for target_data.  */
> +    fprintf (out, "static int target_data_avoid_offloading;\n");
> +  fprintf (out, "\n");
>  
>    fprintf (out, "#ifdef __cplusplus\n"
>  	   "extern \"C\" {\n"
> @@ -338,18 +389,28 @@ process (FILE *in, FILE *out)
>    fprintf (out, "static __attribute__((constructor)) void init (void)\n"
>  	   "{\n"
>  	   "  GOMP_offload_register_ver (%#x, __OFFLOAD_TABLE__,"
> -	   "%d/*NVIDIA_PTX*/, &target_data);\n"
> -	   "};\n",
> +	   "%d/*NVIDIA_PTX*/, &target_data);\n",
>  	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
>  	   GOMP_DEVICE_NVIDIA_PTX);
> +  if (avoid_offloading_p)
> +    fprintf (out, "  GOMP_offload_register_ver (%#x, (void *) 0,"
> +	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
> +	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
> +	     GOMP_DEVICE_NVIDIA_PTX);
> +  fprintf (out, "};\n");
>  
>    fprintf (out, "static __attribute__((destructor)) void fini (void)\n"
>  	   "{\n"
>  	   "  GOMP_offload_unregister_ver (%#x, __OFFLOAD_TABLE__,"
> -	   "%d/*NVIDIA_PTX*/, &target_data);\n"
> -	   "};\n",
> +	   "%d/*NVIDIA_PTX*/, &target_data);\n",
>  	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
>  	   GOMP_DEVICE_NVIDIA_PTX);
> +  if (avoid_offloading_p)
> +    fprintf (out, "  GOMP_offload_unregister_ver (%#x, (void *) 0,"
> +	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
> +	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
> +	     GOMP_DEVICE_NVIDIA_PTX);
> +  fprintf (out, "};\n");
>  }
>  
>  static void
> diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
> index dfbdcfb..3faacd5 100644
> --- gcc/config/nvptx/nvptx.c
> +++ gcc/config/nvptx/nvptx.c
> @@ -3811,6 +3811,9 @@ static const struct attribute_spec nvptx_attribute_table[]    /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
>         affects_type_identity } */
>    { "kernel", 0, 0, true, false,  false, nvptx_handle_kernel_attribute, false },
> +  /* Avoid offloading.  For example, because there is no sufficient
> +     parallelism.  */
> +  { "omp avoid offloading", 0, 0, true, false, false, NULL, false },
>    { NULL, 0, 0, false, false, false, NULL, false }
>  };
>  \f
> @@ -3875,7 +3878,10 @@ nvptx_record_offload_symbol (tree decl)
>  	tree dims =REE_VALUE (attr);
>  	unsigned ix;
>  
> -	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
> +	fprintf (asm_out_file, "//:FUNC_MAP %s\"%s\"",
> +		 (lookup_attribute ("omp avoid offloading",
> +				    DECL_ATTRIBUTES (decl))
> +		  ? "(avoid offloading) " : ""),
>  		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
>  
>  	for (ix =; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
> @@ -4135,6 +4141,40 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
>  static bool
>  nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
>  {
> +  /* Detect if a function is unsuitable for offloading.  */
> +  if (!flag_offload_force && decl)
> +    {
> +      tree oacc_function_attr =et_oacc_fn_attrib (decl);
> +      if (oacc_function_attr
> +	  && oacc_fn_attrib_kernels_p (oacc_function_attr))
> +	{
> +	  bool avoid_offloading_p =rue;
> +	  for (unsigned ix =; ix != GOMP_DIM_MAX; ix++)
> +	    {
> +	      if (dims[ix] > 1)
> +		{
> +		  avoid_offloading_p =alse;
> +		  break;
> +		}
> +	    }
> +	  if (avoid_offloading_p)
> +	    {
> +	      /* OpenACC kernels constructs will never be parallelized for
> +		 optimization levels smaller than -O2; avoid the diagnostic in
> +		 this case.  */
> +	      if (optimize >=)
> +		warning_at (DECL_SOURCE_LOCATION (decl), 0,
> +			    "OpenACC kernels construct will be executed "
> +			    "sequentially; will by default avoid offloading "
> +			    "to prevent data copy penalty");
> +	      DECL_ATTRIBUTES (decl)
> +		=ree_cons (get_identifier ("omp avoid offloading"),
> +			     NULL_TREE, DECL_ATTRIBUTES (decl));
> +
> +	    }
> +	}
> +    }
> +
>    bool changed =alse;
>  
>    /* The vector size must be 32, unless this is a SEQ routine.  */
> diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
> index c608a36..c9c79fc 100644
> --- gcc/doc/invoke.texi
> +++ gcc/doc/invoke.texi
> @@ -1153,7 +1153,7 @@ See S/390 and zSeries Options.
>  -finstrument-functions-exclude-function-list=ar{sym},@var{sym},@dots{} @gol
>  -finstrument-functions-exclude-file-list=ar{file},@var{file},@dots{} @gol
>  -fno-common  -fno-ident @gol
> --foffload-alias={[}none@r{|}pointer@r{|}all@r{]} @gol
> +-foffload-alias={[}none@r{|}pointer@r{|}all@r{]}  -foffload-force @gol
>  -fpcc-struct-return  -fpic  -fPIC -fpie -fPIE -fno-plt @gol
>  -fno-jump-tables @gol
>  -frecord-gcc-switches @gol
> @@ -24230,6 +24230,15 @@ objects references in an offload region do not alias.  The option
>  aliasing in offload regions.  The default value is
>  @option{-foffload-alias=ne}.
>  
> +@item -foffload-force
> +@opindex -foffload-force
> +The option @option{-foffload-force} forces offloading if the compiler
> +wanted to avoid it.  For example, when there isn't sufficient
> +parallelism in certain offloading constructs, the compiler may come to
> +the conclusion that offloading incurs too much overhead (for data
> +transfers, for example), and unless overridden with this flag, it then
> +suggests to the runtime (libgomp) to avoid offloading.
> +
>  @item -fexceptions
>  @opindex fexceptions
>  Enable exception handling.  Generates extra code needed to propagate
> diff --git gcc/lto-wrapper.c gcc/lto-wrapper.c
> index 91bb1e8..5e03544 100644
> --- gcc/lto-wrapper.c
> +++ gcc/lto-wrapper.c
> @@ -275,6 +275,7 @@ merge_and_complain (struct cl_decoded_option **decoded_options,
>  	case OPT_fsigned_zeros:
>  	case OPT_ftrapping_math:
>  	case OPT_fwrapv:
> +	case OPT_foffload_force:
>  	case OPT_fopenmp:
>  	case OPT_fopenacc:
>  	case OPT_fcheck_pointer_bounds:
> @@ -516,6 +517,7 @@ append_compiler_options (obstack *argv_obstack, struct cl_decoded_option *opts,
>  	case OPT_fsigned_zeros:
>  	case OPT_ftrapping_math:
>  	case OPT_fwrapv:
> +	case OPT_foffload_force:
>  	case OPT_fopenmp:
>  	case OPT_fopenacc:
>  	case OPT_fopenacc_dim_:
> diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
> index 2003a8a..b089e27 100644
> --- libgomp/ChangeLog.gomp
> +++ libgomp/ChangeLog.gomp
> @@ -1,3 +1,153 @@
> +2016-01-21  Thomas Schwinge  <thomas@codesourcery.com>
> +
> +	* target.c (GOMP_offload_register_ver)
> +	(GOMP_offload_unregister_ver, gomp_init_device)
> +	(gomp_unload_device, gomp_offload_target_available_p): Handle and
> +	document "avoid offloading" ("host_table =NULL").
> +	(resolve_device): Document "avoid offloading".
> +	* oacc-init.c (resolve_device): Likewise.
> +	* libgomp.texi (Enabling OpenACC): Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
> +	file.
> +	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
> +	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
> +	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
> +	* testsuite/libgomp.oacc-c++/non-scalar-data.C: Set
> +	"-foffload-force".
> +	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90:
> +	Likewise.
> +
> +	* testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c: Set
> +	"-ftree-parallelize-loops2".
> +	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/if-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
> +	Likewise.
> +	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
> +	* testsuite/libgomp.oacc-fortran/asyncwait-1.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/asyncwait-2.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/asyncwait-3.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/default-1.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/deviceptr-1.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/if-1.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90:
> +	Likewise.
> +	* testsuite/libgomp.oacc-fortran/kernels-map-1.f90: Likewise.
> +	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
> +
>  2016-01-20  Cesar Philippidis  <cesar@codesourcery.com>
>  
>  	* testsuite/libgomp.oacc-c++/non-scalar-data.C: New test.
> diff --git libgomp/libgomp.texi libgomp/libgomp.texi
> index 8870084..2841b2e 100644
> --- libgomp/libgomp.texi
> +++ libgomp/libgomp.texi
> @@ -1818,6 +1818,14 @@ flag @option{-fopenacc} must be specified.  This enables the OpenACC directive
>  arranges for automatic linking of the OpenACC runtime library 
>  (@ref{OpenACC Runtime Library Routines}).
>  
> +Offloading is enabled by default.  In some cases, the compiler may
> +come to the conclusion that offloading incurs too much overhead, and
> +suggest to the runtime to avoid it.  To counteract that, you can use
> +the option @option{-foffload-force} to force offloading in such cases.
> +Alternatively, offloading is also enabled if a specific device type is
> +requested, in a call to @code{acc_init} or by setting the
> +@env{ACC_DEVICE_TYPE} environment variable, for example.
> +
>  A complete description of all OpenACC directives accepted may be found in 
>  the @uref{http://www.openacc.org/, OpenACC} Application Programming
>  Interface manual, version 2.0.
> diff --git libgomp/oacc-init.c libgomp/oacc-init.c
> index a90732d..b3d13a8 100644
> --- libgomp/oacc-init.c
> +++ libgomp/oacc-init.c
> @@ -123,8 +123,9 @@ resolve_device (acc_device_t d, bool fail_is_error)
>  	if (goacc_device_type)
>  	  {
>  	    /* Lookup the device that has been explicitly named, so do not pay
> -	       attention to gomp_offload_target_available_p.  (That is, hard
> -	       error if not actually available.)  */
> +	       attention to gomp_offload_target_available_p.  (That is,
> +	       enforced usage even with an "avoid offloading" flag set, and
> +	       hard error if not actually available.)  */
>  	    while (++d !=ACC_device_hwm)
>  	      if (dispatchers[d]
>  		  && !strcasecmp (goacc_device_type,
> @@ -154,7 +155,8 @@ resolve_device (acc_device_t d, bool fail_is_error)
>  	    && dispatchers[d]->get_num_devices_func () > 0
>  	    /* No device has been explicitly named, so pay attention to
>  	       gomp_offload_target_available_p, to not decide on an offload
> -	       target that we don't have offload data available for.  */
> +	       target that we don't have offload data available for, or have an
> +	       "avoid offloading" flag set for.  */
>  	    && gomp_offload_target_available_p (dispatchers[d]->type))
>  	  goto found;
>        /* No non-host device found.  */
> diff --git libgomp/target.c libgomp/target.c
> index 7adc4d0..c60e52a 100644
> --- libgomp/target.c
> +++ libgomp/target.c
> @@ -130,8 +130,9 @@ resolve_device (int device)
>      }
>    gomp_mutex_unlock (&devices[device_id].lock);
>  
> -  /* If the device-var ICV does not actually have offload data available, don't
> -     try use it (which will fail), and use host fallback instead.  */
> +  /* Use host fallback instead of the device-var ICV if the latter doesn't
> +     actually have offload data available (offloading will fail), or has an
> +     "avoid offloading" flag set.  */
>    if (device =GOMP_DEVICE_ICV
>        && !gomp_offload_target_available_p (devices[device_id].type))
>      return NULL;
> @@ -1139,12 +1140,19 @@ gomp_unload_image_from_device (struct gomp_device_descr *devicep,
>  
>  /* This function should be called from every offload image while loading.
>     It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
> -   the target, and TARGET_DATA needed by target plugin.  */
> +   the target, and TARGET_DATA needed by target plugin.
> +
> +   If HOST_TABLE is NULL, this image (TARGET_DATA) is stored as an "avoid
> +   offloading" flag, and the TARGET_TYPE will not be considered by default
> +   until this image gets unregistered.  */
>  
>  void
>  GOMP_offload_register_ver (unsigned version, const void *host_table,
>  			   int target_type, const void *target_data)
>  {
> +  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
> +	      version, host_table, target_type, target_data);
> +
>    int i;
>  
>    if (GOMP_VERSION_LIB (version) > GOMP_VERSION)
> @@ -1153,16 +1161,19 @@ GOMP_offload_register_ver (unsigned version, const void *host_table,
>    
>    gomp_mutex_lock (&register_lock);
>  
> -  /* Load image to all initialized devices.  */
> -  for (i =; i < num_devices; i++)
> +  if (host_table !=ULL)
>      {
> -      struct gomp_device_descr *devicep =devices[i];
> -      gomp_mutex_lock (&devicep->lock);
> -      if (devicep->type =target_type
> -	  && devicep->state =GOMP_DEVICE_INITIALIZED)
> -	gomp_load_image_to_device (devicep, version,
> -				   host_table, target_data, true);
> -      gomp_mutex_unlock (&devicep->lock);
> +      /* Load image to all initialized devices.  */
> +      for (i =; i < num_devices; i++)
> +	{
> +	  struct gomp_device_descr *devicep =devices[i];
> +	  gomp_mutex_lock (&devicep->lock);
> +	  if (devicep->type =target_type
> +	      && devicep->state =GOMP_DEVICE_INITIALIZED)
> +	    gomp_load_image_to_device (devicep, version,
> +				       host_table, target_data, true);
> +	  gomp_mutex_unlock (&devicep->lock);
> +	}
>      }
>  
>    /* Insert image to array of pending images.  */
> @@ -1188,26 +1199,36 @@ GOMP_offload_register (const void *host_table, int target_type,
>  
>  /* This function should be called from every offload image while unloading.
>     It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
> -   the target, and TARGET_DATA needed by target plugin.  */
> +   the target, and TARGET_DATA needed by target plugin.
> +
> +   If HOST_TABLE is NULL, the "avoid offloading" flag gets cleared for this
> +   image (TARGET_DATA), and this TARGET_TYPE may again be considered by
> +   default.  */
>  
>  void
>  GOMP_offload_unregister_ver (unsigned version, const void *host_table,
>  			     int target_type, const void *target_data)
>  {
> +  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
> +	      version, host_table, target_type, target_data);
> +
>    int i;
>  
>    gomp_mutex_lock (&register_lock);
>  
> -  /* Unload image from all initialized devices.  */
> -  for (i =; i < num_devices; i++)
> +  if (host_table !=ULL)
>      {
> -      struct gomp_device_descr *devicep =devices[i];
> -      gomp_mutex_lock (&devicep->lock);
> -      if (devicep->type =target_type
> -	  && devicep->state =GOMP_DEVICE_INITIALIZED)
> -	gomp_unload_image_from_device (devicep, version,
> -				       host_table, target_data);
> -      gomp_mutex_unlock (&devicep->lock);
> +      /* Unload image from all initialized devices.  */
> +      for (i =; i < num_devices; i++)
> +	{
> +	  struct gomp_device_descr *devicep =devices[i];
> +	  gomp_mutex_lock (&devicep->lock);
> +	  if (devicep->type =target_type
> +	      && devicep->state =GOMP_DEVICE_INITIALIZED)
> +	    gomp_unload_image_from_device (devicep, version,
> +					   host_table, target_data);
> +	  gomp_mutex_unlock (&devicep->lock);
> +	}
>      }
>  
>    /* Remove image from array of pending images.  */
> @@ -1241,7 +1262,8 @@ gomp_init_device (struct gomp_device_descr *devicep)
>    for (i =; i < num_offload_images; i++)
>      {
>        struct offload_image_descr *image =offload_images[i];
> -      if (image->type =devicep->type)
> +      if (image->type =devicep->type
> +	  && image->host_table !=ULL)
>  	gomp_load_image_to_device (devicep, image->version,
>  				   image->host_table, image->target_data,
>  				   false);
> @@ -1261,7 +1283,8 @@ gomp_unload_device (struct gomp_device_descr *devicep)
>        for (i =; i < num_offload_images; i++)
>  	{
>  	  struct offload_image_descr *image =offload_images[i];
> -	  if (image->type =devicep->type)
> +	  if (image->type =devicep->type
> +	      && image->host_table !=ULL)
>  	    gomp_unload_image_from_device (devicep, image->version,
>  					   image->host_table,
>  					   image->target_data);
> @@ -1272,7 +1295,9 @@ gomp_unload_device (struct gomp_device_descr *devicep)
>  /* Do we have offload data available for the given offload target type?
>     Instead of verifying that *all* offload data is available that could
>     possibly be required, we instead just look for *any*.  If we later find any
> -   offload data missing, that's user error.  */
> +   offload data missing, that's user error.  If any offload data of this target
> +   type is tagged with an "avoid offloading" flag, do not consider this target
> +   type available unless it has been initialized already.  */
>  
>  attribute_hidden bool
>  gomp_offload_target_available_p (int type)
> @@ -1290,6 +1315,9 @@ gomp_offload_target_available_p (int type)
>        gomp_mutex_unlock (&devicep->lock);
>      }
>  
> +  /* If the offload target has been initialized already, we ignore "avoid
> +     offloading" flags.  This is important, because data/state may be present
> +     on the device, that we must continue to use.  */
>    if (!available)
>      {
>        gomp_mutex_lock (&register_lock);
> @@ -1303,8 +1331,14 @@ gomp_offload_target_available_p (int type)
>  
>        /* Can the offload target be initialized?  */
>        for (int i =; !available && i < num_offload_images; i++)
> -	if (offload_images[i].type =type)
> +	if (offload_images[i].type =type
> +	    && offload_images[i].host_table !=ULL)
>  	  available =rue;
> +      /* If yes, is an "avoid offloading" flag set?  */
> +      for (int i =; available && i < num_offload_images; i++)
> +	if (offload_images[i].type =type
> +	    && offload_images[i].host_table =NULL)
> +	  available =alse;
>  
>        gomp_mutex_unlock (&register_lock);
>      }
> diff --git libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
> index 180e86f..fe919c8 100644
> --- libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
> +++ libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
> @@ -1,7 +1,8 @@
>  // Ensure that a non-scalar dummy arguments which are implicitly used inside
>  // offloaded regions are properly mapped using present_or_copy.
>  
> -// { dg-do run }
> +// Override the compiler's "avoid offloading" decision.
> +// { dg-additional-options "-foffload-force" }
>  
>  #include <cassert>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> index bca425e..b0da8b7 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> @@ -1,4 +1,5 @@
> -/* { dg-do run } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdio.h>
>  #include <stdlib.h>
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> index c29ca3f..3079b78 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> @@ -1,4 +1,5 @@
> -/* { dg-do run } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> index f3b490a..02e43af 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> @@ -1,6 +1,7 @@
>  /* { dg-do run { target openacc_nvidia_accel_selected } } */
>  /* <http://news.gmane.org/find-root.php?message_id=C87pp0aaksc.fsf%40kepler.schwinge.homeip.net%3E>.
>     { dg-xfail-run-if "TODO" { *-*-* } } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
>  /* { dg-additional-options "-lcuda" } */
>  
>  #include <openacc.h>
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> new file mode 100644
> index 0000000..e614785
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> @@ -0,0 +1,25 @@
> +/* Test that the compiler decides to "avoid offloading".  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
> +  *((volatile int *) &x) =3, y = acc_on_device (acc_device_host);
> +
> +  if (x !=3)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
> +  if (y !=)
> +    __builtin_abort();
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> new file mode 100644
> index 0000000..c13436f
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> @@ -0,0 +1,38 @@
> +/* Test that a user can override the compiler's "avoid offloading"
> +   decision.  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  /* Override the compiler's "avoid offloading" decision.  */
> +  acc_device_t d;
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  d =cc_device_nvidia;
> +#elif defined ACC_DEVICE_TYPE_host
> +  d =cc_device_host;
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +  acc_init (d);
> +
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
> +  *((volatile int *) &x) =3, y = acc_on_device (acc_device_host);
> +
> +  if (x !=3)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  if (y !=)
> +    __builtin_abort();
> +#else
> +  if (y !=)
> +    __builtin_abort();
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> new file mode 100644
> index 0000000..e2301e6
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> @@ -0,0 +1,29 @@
> +/* Test that a user can override the compiler's "avoid offloading"
> +   decision.  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels
> +  *((volatile int *) &x) =3, y = acc_on_device (acc_device_host);
> +
> +  if (x !=3)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  if (y !=)
> +    __builtin_abort();
> +#else
> +  if (y !=)
> +    __builtin_abort();
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> index dad6d13..f8ebbb1 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> @@ -1,6 +1,6 @@
>  /* This test exercises combined directives.  */
>  
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> index 1ac0b95..e512fcf 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> @@ -1,4 +1,6 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include  <openacc.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> index e62c315..b5c29ab 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> @@ -1,4 +1,6 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> index 51745ba..3ef6f9b 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> @@ -1,4 +1,5 @@
>  /* { dg-do run { target openacc_nvidia_accel_selected } } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
>  /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
>  
>  #include <stdlib.h>
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c
> index 2887f66f..7b09917 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/if-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
>  
>  #include <openacc.h>
>  #include <stdlib.h>
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> index aeb0142..a90c9466 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> @@ -1,4 +1,6 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> index 0f323c8..1dc0402 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> @@ -1,4 +1,7 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> index 17a0f3d..baf6662 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> @@ -1,4 +1,7 @@
> -/* { dg-additional-options "-O2 -foffload-alias=l -fipa-pta" } */
> +/* { dg-additional-options "-foffload-alias=l -fipa-pta" } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> index 44d4fd2..efbe43a 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> @@ -1,4 +1,7 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> index a68a7cd..d527e14 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> @@ -1,3 +1,6 @@
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
> +
>  int
>  main (void)
>  {
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> index 2e4100f..6b561e4 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> index 83d4e7f..d965348 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> index 01d5e5e..9548cd6 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> index 61d1283..237d56c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> index f7f04cb..67e75cd 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c
> index 2e920cd..195b2c5 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-1.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c
> index 72249cc..f182a2c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-2.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c
> index 1b0a7cc..4da360c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-3.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c
> index bbe6b3c..1a8fc9c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-4.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c
> index 18e5676..a3f2fb9 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-local-worker-5.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c
> index e424739..eac168c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-1.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c
> index a12e36e..0c0f1e1 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-2.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c
> index f8ec543..0ee0a95 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-3.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c
> index 73561b3..e54873a 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-4.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c
> index 3334830..9660c14 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-5.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c
> index 88ab245..e4d1437 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-gang-6.c
> @@ -1,3 +1,7 @@
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
> +
>  #include <assert.h>
>  
>  /* Test of gang-private aggregate variable declared on loop directive, with
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c
> index 3f7062d..83f52de 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-1.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c
> index dada424..25ceab5 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-vector-2.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c
> index 8d649d1..ac5f24a 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-1.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c
> index a67f90e..a3d18a1 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-2.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c
> index 465a800..3944399 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-3.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c
> index a08ba69..d6dd81b 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-4.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c
> index 1f76345..53293a3 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-5.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c
> index fe2e23a..63b5b51 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-6.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c
> index 12c17e4..65089de 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-private-vars-loop-worker-7.c
> @@ -1,5 +1,6 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <assert.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c
> index 3a2a5b5..ab38f91 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-reduction-1.c
> @@ -1,8 +1,9 @@
>  /* Verify that a simple, explicit acc loop reduction works inside
>   a kernels region.  */
>  
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops2" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> index c164598..94a5ae2 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops2" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90 libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
> index 01728bd..bc1210e 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
> @@ -1,4 +1,5 @@
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
>  
>  program asyncwait
>    integer, parameter :: N =4
> diff --git libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90 libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90
> index fe131b6..2dfed6a 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/asyncwait-2.f90
> @@ -1,4 +1,5 @@
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
>  
>  program asyncwait
>    integer, parameter :: N =4
> diff --git libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90 libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90
> index fa96a01..2c33c0f 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/asyncwait-3.f90
> @@ -1,4 +1,5 @@
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
>  
>  program asyncwait
>    integer, parameter :: N =4
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> new file mode 100644
> index 0000000..0f4edb1
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> @@ -0,0 +1,29 @@
> +! Test that the compiler decides to "avoid offloading".
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +! The warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
> +      X =3
> +      Y =CC_ON_DEVICE (ACC_DEVICE_HOST);
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
> +      IF (.NOT. Y) CALL ABORT
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> new file mode 100644
> index 0000000..4c8ceac
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> @@ -0,0 +1,40 @@
> +! Test that a user can override the compiler's "avoid offloading" decision.
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +! The warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER :: D
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!     Override the compiler's "avoid offloading" decision.
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      D =CC_DEVICE_NVIDIA
> +#elif defined ACC_DEVICE_TYPE_host
> +      D =CC_DEVICE_HOST
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +      CALL ACC_INIT (D)
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
> +      X =3
> +      Y =CC_ON_DEVICE (ACC_DEVICE_HOST)
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      IF (Y) CALL ABORT
> +#else
> +      IF (.NOT. Y) CALL ABORT
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> new file mode 100644
> index 0000000..5f669b7
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> @@ -0,0 +1,30 @@
> +! Test that a user can override the compiler's "avoid offloading" decision.
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +!     Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER :: D
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS
> +      X =3
> +      Y =CC_ON_DEVICE (ACC_DEVICE_HOST)
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      IF (Y) CALL ABORT
> +#else
> +      IF (.NOT. Y) CALL ABORT
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> index 94100b2..3081e7a 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> @@ -1,6 +1,7 @@
>  ! This test exercises combined directives.
>  
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
>  
>  program main
>    integer, parameter :: n =2
> diff --git libgomp/testsuite/libgomp.oacc-fortran/default-1.f90 libgomp/testsuite/libgomp.oacc-fortran/default-1.f90
> index 1059089..07c1e74 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/default-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/default-1.f90
> @@ -1,4 +1,7 @@
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    implicit none
> diff --git libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90 libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
> index 276a172..4646be9 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
> @@ -1,9 +1,10 @@
> -! { dg-do run }
> -
>  ! Test the deviceptr clause with various directives
>  ! and in combination with other directives where
>  ! the deviceptr variable is implied.
>  
> +! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +
>  subroutine subr1 (a, b)
>    implicit none
>    integer, parameter :: N =
> diff --git libgomp/testsuite/libgomp.oacc-fortran/if-1.f90 libgomp/testsuite/libgomp.oacc-fortran/if-1.f90
> index e54c1b2..784f8a1 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/if-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/if-1.f90
> @@ -1,5 +1,8 @@
> -! { dg-do run } */
> +! { dg-do run }
>  ! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    use openacc
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90
> index fdf9409..854fe9c 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction-2.f90
> @@ -1,3 +1,8 @@
> +! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
> +
>  program foo
>  
>    IMPLICIT NONE
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90
> index 912a22b..b120b66 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-acc-loop-reduction.f90
> @@ -1,3 +1,8 @@
> +! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
> +
>  program foo
>    IMPLICIT NONE
>    INTEGER :: vol =
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90
> index 9378b12..1aafefa 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-3.f90
> @@ -2,6 +2,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program collapse3
>    integer :: a(3,3,3), k, kk, kkk, l, ll, lll
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90
> index dfd9cd2..1f2cf97 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-collapse-4.f90
> @@ -2,6 +2,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program collapse4
>    integer :: i, j, k, a(1:7, -3:5, 12:19), b(1:7, -3:5, 12:19)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90
> index 9f17308..f6b2255 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-independent.f90
> @@ -1,4 +1,4 @@
> -! { dg-do run } */
> +! { dg-do run }
>  ! { dg-additional-options "-cpp" }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
>  
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90
> index 01d62f8..14e14ab 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-map-1.f90
> @@ -1,6 +1,9 @@
>  ! Test the copy, copyin, copyout, pcopy, pcopyin, pcopyout, and pcreate
>  ! clauses on kernels constructs.
>  
> +! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
> +
>  program map
>    integer, parameter     :: n =0, c = 10
>    integer                :: i, a(n), b(n), d(n)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90
> index 43a1988..51a57b2 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-2.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: x, i, j, arr(0:32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90
> index e5806ee..948f811 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-3.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: x, i, j, arr(0:32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90
> index 7d19bba..6be2692 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-gang-6.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    type vec3
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90
> index 379bb3a..0312ee7 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-1.f90
> @@ -2,6 +2,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: x, i, j, k, idx, arr(0:32*32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90
> index 8873efe..7ce7f1b 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-vector-2.f90
> @@ -2,6 +2,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: i, j, k, idx, arr(0:32*32*32), pt(2)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90
> index f513ec2..50d13e4 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-1.f90
> @@ -2,6 +2,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: x, i, j, arr(0:32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90
> index e7652d9..328a6b4 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-2.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: x, i, j, k, idx, arr(0:32*32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90
> index c82ced7..a96221d 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-3.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: x, i, j, k, idx, arr(0:32*32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90
> index e30de70..d2b30dd 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-4.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: x, i, j, k, idx, arr(0:32*32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90
> index 20f8579..3cfcbb4 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-5.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: i, j, k, idx, arr(0:32*32*32)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90
> index 48c3bfd..5f65926 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-6.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    type vec2
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90
> index ca63796..27d1b27 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-private-vars-loop-worker-7.f90
> @@ -3,6 +3,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program main
>    integer :: i, j, k, idx, arr(0:32*32*32), pt(2)
> diff --git libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90 libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90
> index e894b6d..dcabe02 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/kernels-reduction-1.f90
> @@ -2,6 +2,8 @@
>  
>  ! { dg-do run }
>  ! { dg-additional-options "-ftree-parallelize-loops2" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
>  
>  program reduction
>    integer, parameter     :: n =0
> diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> index 4afb562..cae39ac 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> @@ -2,6 +2,7 @@
>  ! offloaded regions are properly mapped using present_or_copy.
>  
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops2" }
>  
>  program main
>    implicit none
> 
> 
> Grüße
>  Thomas
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
  2016-01-21 21:55   ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Thomas Schwinge
                       ` (2 preceding siblings ...)
  2016-11-03 17:59     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Cesar Philippidis
@ 2019-01-31 17:16     ` Thomas Schwinge
  3 siblings, 0 replies; 25+ messages in thread
From: Thomas Schwinge @ 2019-01-31 17:16 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

Hi!

On Thu, 21 Jan 2016 22:54:26 +0100, I wrote:
> Committed to gomp-4_0-branch in r232709:
> 
> commit 41a76d233e714fd7b79dc1f40823f607c38306ba
> Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
> Date:   Thu Jan 21 21:52:50 2016 +0000
> 
>     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"

> +! The warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }

> +!$ACC KERNELS /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */

That can actually be done in a better way, so that we match up exactly
when and where the diagnostic appears, and where not; pushed the attached
to openacc-gcc-8-branch in commit
dfae8e0dab1edfc8d8207eafd1b694c4e1fcd680 'Un-parallelized OpenACC kernels
constructs with nvptx offloading: "avoid offloading": better method for
XFAILing specific cases'.


Grüße
 Thomas



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Un-parallelized-OpenACC-kernels-constructs-with-nvpt.patch --]
[-- Type: text/x-diff, Size: 32562 bytes --]

From dfae8e0dab1edfc8d8207eafd1b694c4e1fcd680 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 29 Jan 2019 20:42:13 +0100
Subject: [PATCH] Un-parallelized OpenACC kernels constructs with nvptx
 offloading: "avoid offloading": better method for XFAILing specific cases

	gcc/testsuite/
	* lib/target-supports.exp
	(check_effective_target_opt_levels_2_plus)
	(check_effective_target_opt_levels_size): New.
	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c:
	Update.
	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
	Likewise.
	* testsuite/libgomp.oacc-fortran/asyncwait-1.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
	* testsuite/libgomp.oacc-fortran/deviceptr-1.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/initialize_kernels_loops.f90:
	Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95:
	Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
	Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95:
	Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95:
	Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Likewise.
	* testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95:
	Likewise.
	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
---
 gcc/testsuite/ChangeLog.openacc               |  6 ++++
 gcc/testsuite/lib/target-supports.exp         | 10 ++++++
 libgomp/ChangeLog.openacc                     | 31 +++++++++++++++++++
 .../avoid-offloading-1.c                      |  5 +--
 .../avoid-offloading-2.c                      |  5 +--
 .../combined-directives-1.c                   | 10 +++---
 .../kernels-parallel-loop-data-enter-exit.c   |  8 ++---
 .../libgomp.oacc-fortran/asyncwait-1.f90      |  9 ++----
 .../libgomp.oacc-fortran/avoid-offloading-1.f |  4 +--
 .../libgomp.oacc-fortran/avoid-offloading-2.f |  4 +--
 .../libgomp.oacc-fortran/deviceptr-1.f90      |  7 ++---
 .../initialize_kernels_loops.f90              |  5 +--
 .../libgomp.oacc-fortran/kernels-loop-2.f95   |  9 ++----
 .../kernels-loop-data-2.f95                   |  9 ++----
 .../kernels-loop-data-enter-exit-2.f95        |  9 ++----
 .../kernels-loop-data-enter-exit.f95          |  9 ++----
 .../kernels-loop-data-update.f95              |  7 ++---
 .../kernels-loop-data.f95                     |  9 ++----
 .../libgomp.oacc-fortran/kernels-loop.f95     |  5 +--
 .../kernels-parallel-loop-data-enter-exit.f95 |  7 ++---
 .../libgomp.oacc-fortran/non-scalar-data.f90  |  7 ++---
 21 files changed, 86 insertions(+), 89 deletions(-)

diff --git a/gcc/testsuite/ChangeLog.openacc b/gcc/testsuite/ChangeLog.openacc
index 41913f7fa02..2479367dce4 100644
--- a/gcc/testsuite/ChangeLog.openacc
+++ b/gcc/testsuite/ChangeLog.openacc
@@ -1,3 +1,9 @@
+2019-01-31  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* lib/target-supports.exp
+	(check_effective_target_opt_levels_2_plus)
+	(check_effective_target_opt_levels_size): New.
+
 2019-01-31  Julian Brown  <julian@codesourcery.com>
 
 	* c-c++-common/goacc/deep-copy-arrayofstruct.c: New test.
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index cfc22a22975..206bb2a2b11 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -9326,3 +9326,13 @@ proc check_effective_target_cet { } {
 	}
     } "-O2" ]
 }
+
+# Return 1 if we have optimization level -O2 or higher.
+proc check_effective_target_opt_levels_2_plus { } {
+    return [check-flags [list "" { *-*-* } { "-O*" } { "-O0" "-Og" "-O" "-O1" }]];
+}
+
+# Return 1 if we have optimization level -Os.
+proc check_effective_target_opt_levels_size { } {
+    return [check-flags [list "" { *-*-* } { "-Os" } { "" }]];
+}
diff --git a/libgomp/ChangeLog.openacc b/libgomp/ChangeLog.openacc
index e638c259cba..d558f0df97b 100644
--- a/libgomp/ChangeLog.openacc
+++ b/libgomp/ChangeLog.openacc
@@ -1,3 +1,34 @@
+2019-01-31  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c:
+	Update.
+	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/asyncwait-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
+	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
+	* testsuite/libgomp.oacc-fortran/deviceptr-1.f90: Likewise.
+	* testsuite/libgomp.oacc-fortran/initialize_kernels_loops.f90:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-2.f95: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop-data.f95: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-loop.f95: Likewise.
+	* testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95:
+	Likewise.
+	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
+
 2019-01-31  Julian Brown  <julian@codesourcery.com>
 
 	* testsuite/libgomp.oacc-c++/deep-copy-12.C: New test.
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
index d5fff2d7d8a..72b9ce0ce02 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
@@ -1,8 +1,5 @@
 /* Test that the compiler decides to "avoid offloading".  */
 
-/* The warning is only triggered for -O2 and higher.
-   { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } } */
-
 #include <openacc.h>
 
 int main(void)
@@ -10,7 +7,7 @@ int main(void)
   int x, y;
 
 #pragma acc data copyout(x, y)
-#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target { openacc_nvidia_accel_selected && opt_levels_2_plus } } } */
   *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
 
   if (x != 33)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
index 41bd6d55f26..9e05d84d792 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
@@ -1,9 +1,6 @@
 /* Test that a user can override the compiler's "avoid offloading"
    decision.  */
 
-/* The warning is only triggered for -O2 and higher.
-   { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } } */
-
 #include <openacc.h>
 
 int main(void)
@@ -22,7 +19,7 @@ int main(void)
   int x, y;
 
 #pragma acc data copyout(x, y)
-#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target { openacc_nvidia_accel_selected && opt_levels_2_plus } } } */
   *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
 
   if (x != 33)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
index c6abc1d724a..a54ade363a6 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
@@ -1,11 +1,6 @@
 /* This test exercises combined directives.  */
 
-/* This test falls back to host execution because struct alias
-   analysis is deactivated on OpenACC parallel regions.  Consequently,
-   parloops can no longer disambiguate arrays a and b.  */
-
 /* { dg-do run } */
-/* { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O2" } { "" } } */
 
 #include <stdlib.h>
 
@@ -38,7 +33,10 @@ main (int argc, char **argv)
 	abort ();
     }
 
-#pragma acc kernels loop copy (a[0:N]) copy (b[0:N])
+#pragma acc kernels loop copy (a[0:N]) copy (b[0:N]) /* { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "TODO" { xfail { openacc_nvidia_accel_selected && opt_levels_2_plus } } }
+    This runs into "avoid offloading" because struct alias analysis is
+    deactivated on OpenACC parallel regions.  Consequently, parloops can no
+    longer disambiguate arrays a and b.  */
   for (i = 0; i < N; i++)
     {
       b[i] = 3.0;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
index 8cafbc974c9..374014a1e86 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-parallel-loop-data-enter-exit.c
@@ -1,7 +1,3 @@
-/* FIXME: OpenACC kernels stopped working with the firstprivate subarray
-   changes.  */
-/* { dg-prune-output "OpenACC kernels construct will be executed sequentially" } */
-
 #include <stdlib.h>
 
 #define N (1024 * 512)
@@ -33,7 +29,9 @@ main (void)
       b[i] = i * 4;
   }
 
-#pragma acc kernels present (a[0:N], b[0:N], c[0:N])
+#pragma acc kernels present (a[0:N], b[0:N], c[0:N]) /* { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "TODO" { xfail { openacc_nvidia_accel_selected && opt_levels_2_plus } } }
+    FIXME: OpenACC kernels stopped working with the firstprivate subarray
+    changes.  */
   {
     for (COUNTERTYPE ii = 0; ii < N; ii++)
       c[ii] = a[ii] + b[ii];
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
index f024c1cbe51..ba6809dba7d 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/asyncwait-1.f90
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program asyncwait
   integer, parameter :: N = 64
@@ -183,13 +180,13 @@ program asyncwait
 
   !$acc data copy (a(1:N)) copy (b(1:N)) copy (c(1:N)) copy (d(1:N))
 
-  !$acc kernels async (1)
+  !$acc kernels async (1) ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
   do i = 1, N
      b(i) = (a(i) * a(i) * a(i)) / a(i)
   end do
   !$acc end kernels
 
-  !$acc kernels async (1)
+  !$acc kernels async (1) ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
   do i = 1, N
      c(i) = (a(i) * 4) / a(i)
   end do
@@ -220,7 +217,7 @@ program asyncwait
 
   !$acc data copy (a(1:N), b(1:N), c(1:N), d(1:N), e(1:N))
 
-  !$acc kernels async (1)
+  !$acc kernels async (1) ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
   do i = 1, N
      b(i) = (a(i) * a(i) * a(i)) / a(i)
   end do
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f b/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
index da89b93fd54..fb14be19d8d 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
@@ -2,8 +2,6 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-cpp" }
-! The warning is only triggered for -O2 and higher.
-! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }
 ! As __OPTIMIZE__ is defined for -O1 and higher, we don't have an (easy) way to
 ! distinguish -O1 (where we will offload) from -O2 (where we won't offload), so
 ! for -O1 testing, we expect to abort.
@@ -16,7 +14,7 @@
       LOGICAL :: Y
 
 !$ACC DATA COPYOUT(X, Y)
-!$ACC KERNELS /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target { openacc_nvidia_accel_selected && opt_levels_2_plus } } }
       X = 33
       Y = ACC_ON_DEVICE (ACC_DEVICE_HOST);
 !$ACC END KERNELS
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f b/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
index db72602fb1e..5a064618e51 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
@@ -2,8 +2,6 @@
 
 ! { dg-do run }
 ! { dg-additional-options "-cpp" }
-! The warning is only triggered for -O2 and higher.
-! { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O0" "-O1" } { "" } }
 
       IMPLICIT NONE
       INCLUDE "openacc_lib.h"
@@ -23,7 +21,7 @@
       CALL ACC_INIT (D)
 
 !$ACC DATA COPYOUT(X, Y)
-!$ACC KERNELS /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target openacc_nvidia_accel_selected } } */
+!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target { openacc_nvidia_accel_selected && opt_levels_2_plus } } }
       X = 33
       Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
 !$ACC END KERNELS
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
index 73838510ea4..418bceac3b9 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/deviceptr-1.f90
@@ -3,9 +3,6 @@
 ! the deviceptr variable is implied.
 
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 subroutine subr1 (a, b)
   implicit none
@@ -52,7 +49,7 @@ subroutine subr3 (a, b)
   integer :: b(N)
   integer :: i = 0
 
-  !$acc kernels copy (b)
+  !$acc kernels copy (b) ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
     do i = 1, N
       a(i) = i * 8
       b(i) = a(i)
@@ -84,7 +81,7 @@ subroutine subr5 (a, b)
   integer :: b(N)
   integer :: i = 0
 
-  !$acc kernels deviceptr (a) copy (b)
+  !$acc kernels deviceptr (a) copy (b) ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
     do i = 1, N
       a(i) = i * 32
       b(i) = a(i)
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/initialize_kernels_loops.f90 b/libgomp/testsuite/libgomp.oacc-fortran/initialize_kernels_loops.f90
index 6d1713157b7..8eb02b88d25 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/initialize_kernels_loops.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/initialize_kernels_loops.f90
@@ -1,14 +1,11 @@
 ! { dg-do run }
-!TODO
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "-O0" "-O1" } }
 
 subroutine kernel(lo, hi, a, b, c)
     implicit none
     integer :: lo, hi, i
     real, dimension(lo:hi) :: a, b, c
 
-!$acc kernels
+!$acc kernels ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "TODO" { xfail { openacc_nvidia_accel_selected && opt_levels_2_plus } } }
 !$acc loop independent
     do i = lo, hi
       b(i) = a(i)
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
index af491d053e6..8becc159dd1 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-2.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -11,7 +8,7 @@ program main
 
   ! Parallelism dimensions: compiler/runtime decides.
   !$acc kernels copyout (a(0:n-1))
-  do i = 0, n - 1
+  do i = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      a(i) = i * 2
   end do
   !$acc end kernels
@@ -20,7 +17,7 @@ program main
   !$acc kernels copyout (b(0:n-1)) &
   !$acc num_gangs (3 + a(3)) num_workers (5 + a(5)) vector_length (7 + a(7))
   ! { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" }
-  do i = 0, n -1
+  do i = 0, n -1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      b(i) = i * 4
   end do
   !$acc end kernels
@@ -29,7 +26,7 @@ program main
   !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1)) &
   !$acc num_gangs (3) num_workers (5) vector_length (7)
   ! { dg-prune-output "using vector_length \\(32\\), ignoring 7" }
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
index ca1ac70cd9e..2191ebedee3 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-2.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -11,7 +8,7 @@ program main
 
   !$acc data copyout (a(0:n-1))
   !$acc kernels present (a(0:n-1))
-  do i = 0, n - 1
+  do i = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      a(i) = i * 2
   end do
   !$acc end kernels
@@ -19,7 +16,7 @@ program main
 
   !$acc data copyout (b(0:n-1))
   !$acc kernels present (b(0:n-1))
-  do i = 0, n -1
+  do i = 0, n -1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      b(i) = i * 4
   end do
   !$acc end kernels
@@ -27,7 +24,7 @@ program main
 
   !$acc data copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
   !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
index 5103f3ede04..75fb8a32beb 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit-2.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -11,7 +8,7 @@ program main
 
   !$acc enter data create (a(0:n-1))
   !$acc kernels present (a(0:n-1))
-  do i = 0, n - 1
+  do i = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      a(i) = i * 2
   end do
   !$acc end kernels
@@ -19,7 +16,7 @@ program main
 
   !$acc enter data create (b(0:n-1))
   !$acc kernels present (b(0:n-1))
-  do i = 0, n -1
+  do i = 0, n -1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      b(i) = i * 4
   end do
   !$acc end kernels
@@ -27,7 +24,7 @@ program main
 
   !$acc enter data copyin (a(0:n-1), b(0:n-1)) create (c(0:n-1))
   !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
index 5c1fd52c956..8ea34bf7bf8 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-enter-exit.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -12,19 +9,19 @@ program main
   !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
 
   !$acc kernels present (a(0:n-1))
-  do i = 0, n - 1
+  do i = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      a(i) = i * 2
   end do
   !$acc end kernels
 
   !$acc kernels present (b(0:n-1))
-  do i = 0, n -1
+  do i = 0, n -1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      b(i) = i * 4
   end do
   !$acc end kernels
 
   !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
index 3f889a41a0f..710068a707a 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data-update.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -12,7 +9,7 @@ program main
   !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
 
   !$acc kernels present (a(0:n-1))
-  do i = 0, n - 1
+  do i = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      a(i) = i * 2
   end do
   !$acc end kernels
@@ -24,7 +21,7 @@ program main
   !$acc update device (b(0:n-1))
 
   !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
index e495663a72e..c1dec2c89c5 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop-data.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -12,19 +9,19 @@ program main
   !$acc data copyout (a(0:n-1), b(0:n-1), c(0:n-1))
 
   !$acc kernels present (a(0:n-1))
-  do i = 0, n - 1
+  do i = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      a(i) = i * 2
   end do
   !$acc end kernels
 
   !$acc kernels present (b(0:n-1))
-  do i = 0, n -1
+  do i = 0, n -1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      b(i) = i * 4
   end do
   !$acc end kernels
 
   !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
index 7377ed8780a..c9d3c4adc95 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-loop.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -18,7 +15,7 @@ program main
   end do
 
   !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95 b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
index 8685275046b..99300ec88b1 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
+++ b/libgomp/testsuite/libgomp.oacc-fortran/kernels-parallel-loop-data-enter-exit.f95
@@ -1,7 +1,4 @@
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -12,7 +9,7 @@ program main
   !$acc enter data create (a(0:n-1), b(0:n-1), c(0:n-1))
 
   !$acc kernels present (a(0:n-1))
-  do i = 0, n - 1
+  do i = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      a(i) = i * 2
   end do
   !$acc end kernels
@@ -25,7 +22,7 @@ program main
   !$acc end parallel
 
   !$acc kernels present (a(0:n-1), b(0:n-1), c(0:n-1))
-  do ii = 0, n - 1
+  do ii = 0, n - 1 ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
      c(ii) = a(ii) + b(ii)
   end do
   !$acc end kernels
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 b/libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
index 99bd69207b6..f037b9ac9fd 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
@@ -3,9 +3,6 @@
 ! present.
 
 ! { dg-do run }
-! TODO, <https://gcc.gnu.org/PR80995>.
-! warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty
-! { dg-xfail-if "TODO" { openacc_nvidia_accel_selected } { "-Os" } { "" } }
 
 program main
   implicit none
@@ -54,7 +51,7 @@ subroutine kernels (array, n)
   integer, dimension (n) :: array
   integer :: n, i
 
-  !$acc kernels
+  !$acc kernels ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
   do i = 1, n
      array(i) = i
   end do
@@ -65,7 +62,7 @@ subroutine kernels_default_present (array, n)
   integer, dimension (n) :: array
   integer :: n, i
 
-  !$acc kernels default(present)
+  !$acc kernels default(present) ! { dg-bogus "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "PR80995" { xfail { openacc_nvidia_accel_selected && opt_levels_size } } }
   do i = 1, n
      array(i) = i+1
   end do
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-01-31 16:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87r3hac1w9.fsf@hertz.schwinge.homeip.net>
2016-01-18 17:27 ` [PATCH] Add fopt-info-oacc Tom de Vries
2016-01-18 18:28   ` Sandra Loosemore
2016-01-18 20:30     ` Richard Sandiford
2016-01-21 21:55   ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Thomas Schwinge
2016-01-22  7:40     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge
2016-01-22  8:36       ` Jakub Jelinek
2016-01-22  9:00         ` Thomas Schwinge
2016-01-22 13:18         ` Bernd Schmidt
2016-01-22 13:25           ` Jakub Jelinek
2016-01-22 13:31             ` Bernd Schmidt
2016-02-04 14:47               ` Thomas Schwinge
2016-02-10 11:51                 ` Thomas Schwinge
2016-02-10 13:25                   ` Bernd Schmidt
2016-02-10 14:40                     ` Thomas Schwinge
2016-02-10 15:27                       ` Bernd Schmidt
2016-02-10 16:23                         ` Thomas Schwinge
2016-02-10 16:37                           ` Bernd Schmidt
2016-02-10 17:39                             ` Thomas Schwinge
2016-02-10 20:07                               ` Bernd Schmidt
2016-02-11 10:02                                 ` Thomas Schwinge
2016-02-11 15:58                                   ` Bernd Schmidt
2016-01-26 22:30           ` [gomp4] " Martin Jambor
2016-06-30 21:46     ` Thomas Schwinge
2016-11-03 17:59     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" (was: [PATCH] Add fopt-info-oacc) Cesar Philippidis
2019-01-31 17:16     ` [gomp4] Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).