From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-446126-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 77537 invoked by alias); 13 Jan 2017 18:11:54 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 77486 invoked by uid 89); 13 Jan 2017 18:11:45 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.3 required=5.0 tests=BAYES_00,KAM_ASCII_DIVIDERS,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=nouveau, crosscompiler, cross-compiler, IFS
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 13 Jan 2017 18:11:34 +0000
Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26])	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 5666D3D944;	Fri, 13 Jan 2017 18:11:34 +0000 (UTC)
Received: from tucnak.zalov.cz (ovpn-116-250.ams2.redhat.com [10.36.116.250])	by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id v0DIBVG2026977	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);	Fri, 13 Jan 2017 13:11:33 -0500
Received: from tucnak.zalov.cz (localhost [127.0.0.1])	by tucnak.zalov.cz (8.15.2/8.15.2) with ESMTP id v0DIBRHs031969;	Fri, 13 Jan 2017 19:11:28 +0100
Received: (from jakub@localhost)	by tucnak.zalov.cz (8.15.2/8.15.2/Submit) id v0DIBNee031968;	Fri, 13 Jan 2017 19:11:23 +0100
Date: Fri, 13 Jan 2017 18:11:00 -0000
From: Jakub Jelinek <jakub@redhat.com>
To: Alexander Monakov <amonakov@ispras.ru>,        Thomas Schwinge <thomas@codesourcery.com>,        Cesar Philippidis <cesar@codesourcery.com>,        Chung-Lin Tang <chunglin_tang@mentor.com>, Jeff Law <law@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH] Allow building GCC with PTX offloading even without CUDA being installed (gcc and nvptx-tools patches)
Message-ID: <20170113181123.GA1867@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="TRYliJ5NKNqkz5bu"
Content-Disposition: inline
User-Agent: Mutt/1.7.1 (2016-10-04)
X-IsSubscribed: yes
X-SW-Source: 2017-01/txt/msg00980.txt.bz2


--TRYliJ5NKNqkz5bu
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-length: 30985

Hi!

This is something that has been discussed already during the last Cauldron.
Especially for distributions it is undesirable to need to have proprietary
CUDA libraries and headers installed when building GCC.

These two patches allow building GCC without CUDA around in a way that later
on can offload to PTX if libcuda.so.1 is installed (and the NVidia kernel
driver is installed, haven't tried if it works with nouveau, nor tried some
free CUDA replacements).  This is important because the former step can be
done when building the distribution packages, while the latter is a decision
of the user.  If the nvptx libgomp plugin is installed, but libcuda.so.1
can't be found, then the plugin behaves as if there are no PTX devices
available.  In order to configure gcc to load libcuda.so.1 dynamically,
one has to either configure it --without-cuda-driver, or without
--with-cuda-driver=/--with-cuda-driver-lib=/--with-cuda-driver-include=
options if cuda.h and -lcuda aren't found in the default locations.

I've talked to our lawyers and they said that the cuda.h header included
in this patch doesn't infringe anyone's copyright or is otherwise a fair
use, it has been created by gathering all the cu*/CU* symbols from the
current and older nvptx plugin and some oacc tests, then stubbing the
pointer-ish typedefs, grabing most enum values and function prototypes from
https://raw.githubusercontent.com/shinpei0208/gdev/master/cuda/driver/cuda.h
and verifying assembly with that header against assembly when compiled
against NVidia's cuda.h.

The nvptx-tools change to the nvptx-none-as binary is an important part of
this, although it is not a change to gcc itself - the problem is that by
default nvptx-none-as was calling the ptxas program to verify the assembly
is correct, which of course doesn't work very well when the proprietary
ptxas is not available.  So the patch makes it invoke ptxas always only if
a new --verify option is used, if --no-verify is used, then as before it
is not invoked, and without either of these options the behavior is that if
ptxas is found in $PATH, then it invokes it, if not, it does only minimal
verification good enough for gcc/configure purposes (it turned out to be
sufficient to error out if .version directive is not the first non-comment
token (ptxas errors on that too).

Tested on x86_64-linux, with CUDA around
(--with-cuda-driver=/usr/local/cuda-8.0) as well as without, and tested
in that case also both with libcuda.so.1 available and without.

Can the OpenACC hackers as well as Alex (or his collegues) please also test
it?  Do you have any problems with the GCC patch (if not, I'd commit it
next week before stage3 closes)?  Is the nvptx-tools patch ok (and if so,
can you commit it; I guess I could create a github pull request for this
if needed).

P.S.: not sure what is the cuInit call in nvptx_init good for, doesn't
libgomp always call nvptx_get_num_devices first and thus call cuInit already
there (but I've kept it in the patch)?

2017-01-13  Jakub Jelinek  <jakub@redhat.com>

	* plugin/configfrag.ac: For --without-cuda-driver don't initialize
	CUDA_DRIVER_INCLUDE nor CUDA_DRIVER_LIB.  If both
	CUDA_DRIVER_INCLUDE and CUDA_DRIVER_LIB are empty and linking small
	cuda program fails, define PLUGIN_NVPTX_DYNAMIC to 1 and use
	plugin/include/cuda as include dir and -ldl instead of -lcuda as
	library to link ptx plugin against.
	* plugin/plugin-nvptx.c: Include dlfcn.h if PLUGIN_NVPTX_DYNAMIC.
	(CUDA_CALLS): Define.
	(cuda_lib, cuda_lib_inited): New variables.
	(init_cuda_lib): New function.
	(CUDA_CALL_PREFIX): Define.
	(CUDA_CALL_ERET, CUDA_CALL_ASSERT): Use CUDA_CALL_PREFIX.
	(CUDA_CALL): Use FN instead of (FN).
	(CUDA_CALL_NOCHECK): Define.
	(cuda_error, fini_streams_for_device, select_stream_for_async,
	nvptx_attach_host_thread_to_device, nvptx_open_device, link_ptx,
	event_gc, nvptx_exec, nvptx_async_test, nvptx_async_test_all,
	nvptx_wait_all, nvptx_set_clocktick, GOMP_OFFLOAD_unload_image,
	nvptx_stacks_alloc, nvptx_stacks_free, GOMP_OFFLOAD_run): Use
	CUDA_CALL_NOCHECK.
	(nvptx_init): Call init_cuda_lib, if it fails, return false.  Use
	CUDA_CALL_NOCHECK.
	(nvptx_get_num_devices): Call init_cuda_lib, if it fails, return 0.
	Use CUDA_CALL_NOCHECK.
	* plugin/cuda/cuda.h: New file.
	* config.h.in: Regenerated.
	* configure: Regenerated.
	* Makefile.in: Regenerated.

--- libgomp/plugin/configfrag.ac.jj	2017-01-13 12:07:56.000000000 +0100
+++ libgomp/plugin/configfrag.ac	2017-01-13 17:33:26.608240936 +0100
@@ -58,10 +58,12 @@ AC_ARG_WITH(cuda-driver-include,
 AC_ARG_WITH(cuda-driver-lib,
 	[AS_HELP_STRING([--with-cuda-driver-lib=PATH],
 		[specify directory for the installed CUDA driver library])])
-if test "x$with_cuda_driver" != x; then
-  CUDA_DRIVER_INCLUDE=$with_cuda_driver/include
-  CUDA_DRIVER_LIB=$with_cuda_driver/lib
-fi
+case "x$with_cuda_driver" in
+  x | xno) ;;
+  *) CUDA_DRIVER_INCLUDE=$with_cuda_driver/include
+     CUDA_DRIVER_LIB=$with_cuda_driver/lib
+     ;;
+esac
 if test "x$with_cuda_driver_include" != x; then
   CUDA_DRIVER_INCLUDE=$with_cuda_driver_include
 fi
@@ -79,6 +81,7 @@ PLUGIN_NVPTX=0
 PLUGIN_NVPTX_CPPFLAGS=
 PLUGIN_NVPTX_LDFLAGS=
 PLUGIN_NVPTX_LIBS=
+PLUGIN_NVPTX_DYNAMIC=0
 AC_SUBST(PLUGIN_NVPTX)
 AC_SUBST(PLUGIN_NVPTX_CPPFLAGS)
 AC_SUBST(PLUGIN_NVPTX_LDFLAGS)
@@ -167,9 +170,17 @@ if test x"$enable_offload_targets" != x;
 	LIBS=$PLUGIN_NVPTX_save_LIBS
 	case $PLUGIN_NVPTX in
 	  nvptx*)
-	    PLUGIN_NVPTX=0
-	    AC_MSG_ERROR([CUDA driver package required for nvptx support])
-	    ;;
+	    if test "x$CUDA_DRIVER_INCLUDE" = x \
+	       && test "x$CUDA_DRIVER_LIB" = x; then
+	      PLUGIN_NVPTX=1
+	      PLUGIN_NVPTX_CPPFLAGS='-I$(srcdir)/plugin/cuda'
+	      PLUGIN_NVPTX_LIBS='-ldl'
+	      PLUGIN_NVPTX_DYNAMIC=1
+	    else
+	      PLUGIN_NVPTX=0
+	      AC_MSG_ERROR([CUDA driver package required for nvptx support])
+	    fi
+	  ;;
 	esac
 	;;
       hsa*)
@@ -241,6 +252,8 @@ AC_DEFINE_UNQUOTED(OFFLOAD_TARGETS, "$of
 AM_CONDITIONAL([PLUGIN_NVPTX], [test $PLUGIN_NVPTX = 1])
 AC_DEFINE_UNQUOTED([PLUGIN_NVPTX], [$PLUGIN_NVPTX],
   [Define to 1 if the NVIDIA plugin is built, 0 if not.])
+AC_DEFINE_UNQUOTED([PLUGIN_NVPTX_DYNAMIC], [$PLUGIN_NVPTX_DYNAMIC],
+  [Define to 1 if the NVIDIA plugin should dlopen libcuda.so.1, 0 if it should be linked against it.])
 AM_CONDITIONAL([PLUGIN_HSA], [test $PLUGIN_HSA = 1])
 AC_DEFINE_UNQUOTED([PLUGIN_HSA], [$PLUGIN_HSA],
   [Define to 1 if the HSA plugin is built, 0 if not.])
--- libgomp/plugin/plugin-nvptx.c.jj	2017-01-13 12:07:56.000000000 +0100
+++ libgomp/plugin/plugin-nvptx.c	2017-01-13 18:00:39.693284346 +0100
@@ -48,30 +48,104 @@
 #include <assert.h>
 #include <errno.h>
 
-static const char *
-cuda_error (CUresult r)
-{
-#if CUDA_VERSION < 7000
-  /* Specified in documentation and present in library from at least
-     5.5.  Not declared in header file prior to 7.0.  */
-  extern CUresult cuGetErrorString (CUresult, const char **);
-#endif
-  const char *desc;
+#if PLUGIN_NVPTX_DYNAMIC
+# include <dlfcn.h>
 
-  r = cuGetErrorString (r, &desc);
-  if (r != CUDA_SUCCESS)
-    desc = "unknown cuda error";
+# define CUDA_CALLS \
+CUDA_ONE_CALL (cuCtxCreate)		\
+CUDA_ONE_CALL (cuCtxDestroy)		\
+CUDA_ONE_CALL (cuCtxGetCurrent)		\
+CUDA_ONE_CALL (cuCtxGetDevice)		\
+CUDA_ONE_CALL (cuCtxPopCurrent)		\
+CUDA_ONE_CALL (cuCtxPushCurrent)	\
+CUDA_ONE_CALL (cuCtxSynchronize)	\
+CUDA_ONE_CALL (cuDeviceGet)		\
+CUDA_ONE_CALL (cuDeviceGetAttribute)	\
+CUDA_ONE_CALL (cuDeviceGetCount)	\
+CUDA_ONE_CALL (cuEventCreate)		\
+CUDA_ONE_CALL (cuEventDestroy)		\
+CUDA_ONE_CALL (cuEventElapsedTime)	\
+CUDA_ONE_CALL (cuEventQuery)		\
+CUDA_ONE_CALL (cuEventRecord)		\
+CUDA_ONE_CALL (cuEventSynchronize)	\
+CUDA_ONE_CALL (cuFuncGetAttribute)	\
+CUDA_ONE_CALL (cuGetErrorString)	\
+CUDA_ONE_CALL (cuInit)			\
+CUDA_ONE_CALL (cuLaunchKernel)		\
+CUDA_ONE_CALL (cuLinkAddData)		\
+CUDA_ONE_CALL (cuLinkComplete)		\
+CUDA_ONE_CALL (cuLinkCreate)		\
+CUDA_ONE_CALL (cuLinkDestroy)		\
+CUDA_ONE_CALL (cuMemAlloc)		\
+CUDA_ONE_CALL (cuMemAllocHost)		\
+CUDA_ONE_CALL (cuMemcpy)		\
+CUDA_ONE_CALL (cuMemcpyDtoDAsync)	\
+CUDA_ONE_CALL (cuMemcpyDtoH)		\
+CUDA_ONE_CALL (cuMemcpyDtoHAsync)	\
+CUDA_ONE_CALL (cuMemcpyHtoD)		\
+CUDA_ONE_CALL (cuMemcpyHtoDAsync)	\
+CUDA_ONE_CALL (cuMemFree)		\
+CUDA_ONE_CALL (cuMemFreeHost)		\
+CUDA_ONE_CALL (cuMemGetAddressRange)	\
+CUDA_ONE_CALL (cuMemHostGetDevicePointer)\
+CUDA_ONE_CALL (cuModuleGetFunction)	\
+CUDA_ONE_CALL (cuModuleGetGlobal)	\
+CUDA_ONE_CALL (cuModuleLoad)		\
+CUDA_ONE_CALL (cuModuleLoadData)	\
+CUDA_ONE_CALL (cuModuleUnload)		\
+CUDA_ONE_CALL (cuStreamCreate)		\
+CUDA_ONE_CALL (cuStreamDestroy)		\
+CUDA_ONE_CALL (cuStreamQuery)		\
+CUDA_ONE_CALL (cuStreamSynchronize)	\
+CUDA_ONE_CALL (cuStreamWaitEvent)
+# define CUDA_ONE_CALL(call) \
+  __typeof (call) *call;
+struct cuda_lib_s {
+  CUDA_CALLS
+} cuda_lib;
+
+/* -1 if init_cuda_lib has not been called yet, false
+   if it has been and failed, true if it has been and succeeded.  */
+static char cuda_lib_inited = -1;
 
-  return desc;
+/* Dynamically load the CUDA runtime library and initialize function
+   pointers, return false if unsuccessful, true if successful.  */
+static bool
+init_cuda_lib (void)
+{
+  if (cuda_lib_inited != -1)
+    return cuda_lib_inited;
+  const char *cuda_runtime_lib = "libcuda.so.1";
+  void *h = dlopen (cuda_runtime_lib, RTLD_LAZY);
+  cuda_lib_inited = false;
+  if (h == NULL)
+    return false;
+# undef CUDA_ONE_CALL
+# define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call)
+# define CUDA_ONE_CALL_1(call) \
+  cuda_lib.call = dlsym (h, #call);	\
+  if (cuda_lib.call == NULL)		\
+    return false;
+  CUDA_CALLS
+  cuda_lib_inited = true;
+  return true;
 }
+# undef CUDA_ONE_CALL
+# undef CUDA_ONE_CALL_1
+# define CUDA_CALL_PREFIX cuda_lib.
+#else
+# define CUDA_CALL_PREFIX
+# define init_cuda_lib() true
+#endif
 
 /* Convenience macros for the frequently used CUDA library call and
-   error handling sequence.  This does not capture all the cases we
-   use in this file, but is common enough.  */
+   error handling sequence as well as CUDA library calls that
+   do the error checking themselves or don't do it at all.  */
 
 #define CUDA_CALL_ERET(ERET, FN, ...)		\
   do {						\
-    unsigned __r = FN (__VA_ARGS__);		\
+    unsigned __r				\
+      = CUDA_CALL_PREFIX FN (__VA_ARGS__);	\
     if (__r != CUDA_SUCCESS)			\
       {						\
 	GOMP_PLUGIN_error (#FN " error: %s",	\
@@ -81,11 +155,12 @@ cuda_error (CUresult r)
   } while (0)
 
 #define CUDA_CALL(FN, ...)			\
-  CUDA_CALL_ERET (false, (FN), __VA_ARGS__)
+  CUDA_CALL_ERET (false, FN, __VA_ARGS__)
 
 #define CUDA_CALL_ASSERT(FN, ...)		\
   do {						\
-    unsigned __r = FN (__VA_ARGS__);		\
+    unsigned __r				\
+      = CUDA_CALL_PREFIX FN (__VA_ARGS__);	\
     if (__r != CUDA_SUCCESS)			\
       {						\
 	GOMP_PLUGIN_fatal (#FN " error: %s",	\
@@ -93,6 +168,26 @@ cuda_error (CUresult r)
       }						\
   } while (0)
 
+#define CUDA_CALL_NOCHECK(FN, ...)		\
+  CUDA_CALL_PREFIX FN (__VA_ARGS__)
+
+static const char *
+cuda_error (CUresult r)
+{
+#if CUDA_VERSION < 7000
+  /* Specified in documentation and present in library from at least
+     5.5.  Not declared in header file prior to 7.0.  */
+  extern CUresult cuGetErrorString (CUresult, const char **);
+#endif
+  const char *desc;
+
+  r = CUDA_CALL_NOCHECK (cuGetErrorString, r, &desc);
+  if (r != CUDA_SUCCESS)
+    desc = "unknown cuda error";
+
+  return desc;
+}
+
 static unsigned int instantiated_devices = 0;
 static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
@@ -401,7 +496,7 @@ fini_streams_for_device (struct ptx_devi
 
       ret &= map_fini (s);
 
-      CUresult r = cuStreamDestroy (s->stream);
+      CUresult r = CUDA_CALL_NOCHECK (cuStreamDestroy, s->stream);
       if (r != CUDA_SUCCESS)
 	{
 	  GOMP_PLUGIN_error ("cuStreamDestroy error: %s", cuda_error (r));
@@ -484,7 +579,8 @@ select_stream_for_async (int async, pthr
 	    s->stream = existing;
 	  else
 	    {
-	      r = cuStreamCreate (&s->stream, CU_STREAM_DEFAULT);
+	      r = CUDA_CALL_NOCHECK (cuStreamCreate, &s->stream,
+				     CU_STREAM_DEFAULT);
 	      if (r != CUDA_SUCCESS)
 		{
 		  pthread_mutex_unlock (&ptx_dev->stream_lock);
@@ -554,10 +650,14 @@ nvptx_init (void)
   if (instantiated_devices != 0)
     return true;
 
-  CUDA_CALL (cuInit, 0);
   ptx_events = NULL;
   pthread_mutex_init (&ptx_event_lock, NULL);
 
+  if (!init_cuda_lib ())
+    return false;
+
+  CUDA_CALL (cuInit, 0);
+
   CUDA_CALL (cuDeviceGetCount, &ndevs);
   ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *)
 					    * ndevs);
@@ -575,7 +675,7 @@ nvptx_attach_host_thread_to_device (int
   struct ptx_device *ptx_dev;
   CUcontext thd_ctx;
 
-  r = cuCtxGetDevice (&dev);
+  r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
@@ -623,7 +723,7 @@ nvptx_open_device (int n)
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
 
-  r = cuCtxGetDevice (&ctx_dev);
+  r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
@@ -669,7 +769,7 @@ nvptx_open_device (int n)
 		  &pi, CU_DEVICE_ATTRIBUTE_CLOCK_RATE, dev);
   ptx_dev->clock_khz = pi;
 
-  CUDA_CALL_ERET (NULL,  cuDeviceGetAttribute,
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
 		  &pi, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev);
   ptx_dev->num_sms = pi;
 
@@ -679,7 +779,7 @@ nvptx_open_device (int n)
 
   /* CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82 is defined only
      in CUDA 6.0 and newer.  */
-  r = cuDeviceGetAttribute (&pi, 82, dev);
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi, 82, dev);
   /* Fallback: use limit of registers per block, which is usually equal.  */
   if (r == CUDA_ERROR_INVALID_VALUE)
     pi = ptx_dev->regs_per_block;
@@ -698,8 +798,8 @@ nvptx_open_device (int n)
       return NULL;
     }
 
-  r = cuDeviceGetAttribute (&async_engines,
-			    CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
+			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
 
@@ -746,7 +846,9 @@ nvptx_get_num_devices (void)
      further initialization).  */
   if (instantiated_devices == 0)
     {
-      CUresult r = cuInit (0);
+      if (!init_cuda_lib ())
+	return 0;
+      CUresult r = CUDA_CALL_NOCHECK (cuInit, 0);
       /* This is not an error: e.g. we may have CUDA libraries installed but
          no devices available.  */
       if (r != CUDA_SUCCESS)
@@ -797,8 +899,9 @@ link_ptx (CUmodule *module, const struct
       /* cuLinkAddData's 'data' argument erroneously omits the const
 	 qualifier.  */
       GOMP_PLUGIN_debug (0, "Loading:\n---\n%s\n---\n", ptx_objs->code);
-      r = cuLinkAddData (linkstate, CU_JIT_INPUT_PTX, (char*)ptx_objs->code,
-			 ptx_objs->size, 0, 0, 0, 0);
+      r = CUDA_CALL_NOCHECK (cuLinkAddData, linkstate, CU_JIT_INPUT_PTX,
+			     (char *) ptx_objs->code, ptx_objs->size,
+			     0, 0, 0, 0);
       if (r != CUDA_SUCCESS)
 	{
 	  GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]);
@@ -809,7 +912,7 @@ link_ptx (CUmodule *module, const struct
     }
 
   GOMP_PLUGIN_debug (0, "Linking\n");
-  r = cuLinkComplete (linkstate, &linkout, &linkoutsize);
+  r = CUDA_CALL_NOCHECK (cuLinkComplete, linkstate, &linkout, &linkoutsize);
 
   GOMP_PLUGIN_debug (0, "Link complete: %fms\n", elapsed);
   GOMP_PLUGIN_debug (0, "Link log %s\n", &ilog[0]);
@@ -844,7 +947,7 @@ event_gc (bool memmap_lockable)
       if (e->ord != nvthd->ptx_dev->ord)
 	continue;
 
-      r = cuEventQuery (*e->evt);
+      r = CUDA_CALL_NOCHECK (cuEventQuery, *e->evt);
       if (r == CUDA_SUCCESS)
 	{
 	  bool append_async = false;
@@ -877,7 +980,7 @@ event_gc (bool memmap_lockable)
 	      break;
 	    }
 
-	  cuEventDestroy (*te);
+	  CUDA_CALL_NOCHECK (cuEventDestroy, *te);
 	  free ((void *)te);
 
 	  /* Unlink 'e' from ptx_events list.  */
@@ -1015,10 +1118,14 @@ nvptx_exec (void (*fn), size_t mapnum, v
 	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
 	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
 
-	  if (cuDeviceGetAttribute (&block_size, cu_tpb, dev) == CUDA_SUCCESS
-	      && cuDeviceGetAttribute (&warp_size, cu_ws, dev) == CUDA_SUCCESS
-	      && cuDeviceGetAttribute (&dev_size, cu_mpc, dev) == CUDA_SUCCESS
-	      && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev)  == CUDA_SUCCESS)
+	  if (CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &block_size, cu_tpb,
+				 dev) == CUDA_SUCCESS
+	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &warp_size, cu_ws,
+				    dev) == CUDA_SUCCESS
+	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &dev_size, cu_mpc,
+				    dev) == CUDA_SUCCESS
+	      && CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &cpu_size, cu_tpm,
+				    dev) == CUDA_SUCCESS)
 	    {
 	      GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
 				 " dev_size=%d, cpu_size=%d\n",
@@ -1090,7 +1197,7 @@ nvptx_exec (void (*fn), size_t mapnum, v
 #ifndef DISABLE_ASYNC
   if (async < acc_async_noval)
     {
-      r = cuStreamSynchronize (dev_str->stream);
+      r = CUDA_CALL_NOCHECK (cuStreamSynchronize, dev_str->stream);
       if (r == CUDA_ERROR_LAUNCH_FAILED)
 	GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
 			   maybe_abort_msg);
@@ -1103,7 +1210,7 @@ nvptx_exec (void (*fn), size_t mapnum, v
 
       e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
 
-      r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING);
+      r = CUDA_CALL_NOCHECK (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
       if (r == CUDA_ERROR_LAUNCH_FAILED)
 	GOMP_PLUGIN_fatal ("cuEventCreate error: %s %s\n", cuda_error (r),
 			   maybe_abort_msg);
@@ -1117,7 +1224,7 @@ nvptx_exec (void (*fn), size_t mapnum, v
       event_add (PTX_EVT_KNL, e, (void *)dev_str, 0);
     }
 #else
-  r = cuCtxSynchronize ();
+  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
@@ -1294,7 +1401,7 @@ nvptx_async_test (int async)
   if (!s)
     GOMP_PLUGIN_fatal ("unknown async %d", async);
 
-  r = cuStreamQuery (s->stream);
+  r = CUDA_CALL_NOCHECK (cuStreamQuery, s->stream);
   if (r == CUDA_SUCCESS)
     {
       /* The oacc-parallel.c:goacc_wait function calls this hook to determine
@@ -1325,7 +1432,8 @@ nvptx_async_test_all (void)
   for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
     {
       if ((s->multithreaded || pthread_equal (s->host_thread, self))
-	  && cuStreamQuery (s->stream) == CUDA_ERROR_NOT_READY)
+	  && CUDA_CALL_NOCHECK (cuStreamQuery,
+				s->stream) == CUDA_ERROR_NOT_READY)
 	{
 	  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
 	  return 0;
@@ -1400,7 +1508,7 @@ nvptx_wait_all (void)
     {
       if (s->multithreaded || pthread_equal (s->host_thread, self))
 	{
-	  r = cuStreamQuery (s->stream);
+	  r = CUDA_CALL_NOCHECK (cuStreamQuery, s->stream);
 	  if (r == CUDA_SUCCESS)
 	    continue;
 	  else if (r != CUDA_ERROR_NOT_READY)
@@ -1632,13 +1740,15 @@ static void
 nvptx_set_clocktick (CUmodule module, struct ptx_device *dev)
 {
   CUdeviceptr dptr;
-  CUresult r = cuModuleGetGlobal (&dptr, NULL, module, "__nvptx_clocktick");
+  CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &dptr, NULL,
+				  module, "__nvptx_clocktick");
   if (r == CUDA_ERROR_NOT_FOUND)
     return;
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
   double __nvptx_clocktick = 1e-3 / dev->clock_khz;
-  r = cuMemcpyHtoD (dptr, &__nvptx_clocktick, sizeof (__nvptx_clocktick));
+  r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, dptr, &__nvptx_clocktick,
+			 sizeof (__nvptx_clocktick));
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
 }
@@ -1761,7 +1871,7 @@ GOMP_OFFLOAD_unload_image (int ord, unsi
     if (image->target_data == target_data)
       {
 	*prev_p = image->next;
-	if (cuModuleUnload (image->module) != CUDA_SUCCESS)
+	if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS)
 	  ret = false;
 	free (image->fns);
 	free (image);
@@ -1974,7 +2084,7 @@ static void *
 nvptx_stacks_alloc (size_t size, int num)
 {
   CUdeviceptr stacks;
-  CUresult r = cuMemAlloc (&stacks, size * num);
+  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &stacks, size * num);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r));
   return (void *) stacks;
@@ -1985,7 +2095,7 @@ nvptx_stacks_alloc (size_t size, int num
 static void
 nvptx_stacks_free (void *p, int num)
 {
-  CUresult r = cuMemFree ((CUdeviceptr) p);
+  CUresult r = CUDA_CALL_NOCHECK (cuMemFree, (CUdeviceptr) p);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
 }
@@ -2028,14 +2138,13 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn,
     CU_LAUNCH_PARAM_BUFFER_SIZE, &fn_args_size,
     CU_LAUNCH_PARAM_END
   };
-  r = cuLaunchKernel (function,
-		      teams, 1, 1,
-		      32, threads, 1,
-		      0, ptx_dev->null_stream->stream, NULL, config);
+  r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
+			 32, threads, 1, 0, ptx_dev->null_stream->stream,
+			 NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 
-  r = cuCtxSynchronize ();
+  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
--- libgomp/plugin/cuda/cuda.h.jj	2017-01-13 15:58:00.966544147 +0100
+++ libgomp/plugin/cuda/cuda.h	2017-01-13 17:02:47.355817896 +0100
@@ -0,0 +1,174 @@
+/* CUDA API description.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.
+
+This header provides the minimum amount of typedefs, enums and function
+declarations to be able to compile plugin-nvptx.c if cuda.h and
+libcuda.so.1 are not available.  */
+
+#ifndef GCC_CUDA_H
+#define GCC_CUDA_H
+
+#include <stdlib.h>
+
+#define CUDA_VERSION 8000
+
+typedef void *CUcontext;
+typedef int CUdevice;
+#ifdef __LP64__
+typedef unsigned long long CUdeviceptr;
+#else
+typedef unsigned CUdeviceptr;
+#endif
+typedef void *CUevent;
+typedef void *CUfunction;
+typedef void *CUlinkState;
+typedef void *CUmodule;
+typedef void *CUstream;
+
+typedef enum {
+  CUDA_SUCCESS = 0,
+  CUDA_ERROR_INVALID_VALUE = 1,
+  CUDA_ERROR_OUT_OF_MEMORY = 2,
+  CUDA_ERROR_INVALID_CONTEXT = 201,
+  CUDA_ERROR_NOT_FOUND = 500,
+  CUDA_ERROR_NOT_READY = 600,
+  CUDA_ERROR_LAUNCH_FAILED = 719
+} CUresult;
+
+typedef enum {
+  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 1,
+  CU_DEVICE_ATTRIBUTE_WARP_SIZE = 10,
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12,
+  CU_DEVICE_ATTRIBUTE_CLOCK_RATE = 13,
+  CU_DEVICE_ATTRIBUTE_GPU_OVERLAP = 15,
+  CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT = 16,
+  CU_DEVICE_ATTRIBUTE_INTEGRATED = 18,
+  CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY = 19,
+  CU_DEVICE_ATTRIBUTE_COMPUTE_MODE = 20,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
+  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
+  CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+} CUdevice_attribute;
+
+enum {
+  CU_EVENT_DEFAULT = 0,
+  CU_EVENT_DISABLE_TIMING = 2
+};
+
+typedef enum {
+  CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 0,
+  CU_FUNC_ATTRIBUTE_NUM_REGS = 4
+} CUfunction_attribute;
+
+typedef enum {
+  CU_JIT_WALL_TIME = 2,
+  CU_JIT_INFO_LOG_BUFFER = 3,
+  CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES = 4,
+  CU_JIT_ERROR_LOG_BUFFER = 5,
+  CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES = 6,
+  CU_JIT_LOG_VERBOSE = 12
+} CUjit_option;
+
+typedef enum {
+  CU_JIT_INPUT_PTX = 1
+} CUjitInputType;
+
+enum {
+  CU_CTX_SCHED_AUTO = 0
+};
+
+#define CU_LAUNCH_PARAM_END ((void *) 0)
+#define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
+#define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
+
+enum {
+  CU_STREAM_DEFAULT = 0,
+  CU_STREAM_NON_BLOCKING = 1
+};
+
+#define cuCtxCreate cuCtxCreate_v2
+CUresult cuCtxCreate (CUcontext *, unsigned, CUdevice);
+#define cuCtxDestroy cuCtxDestroy_v2
+CUresult cuCtxDestroy (CUcontext);
+CUresult cuCtxGetCurrent (CUcontext *);
+CUresult cuCtxGetDevice (CUdevice *);
+#define cuCtxPopCurrent cuCtxPopCurrent_v2
+CUresult cuCtxPopCurrent (CUcontext *);
+#define cuCtxPushCurrent cuCtxPushCurrent_v2
+CUresult cuCtxPushCurrent (CUcontext);
+CUresult cuCtxSynchronize (void);
+CUresult cuDeviceGet (CUdevice *, int);
+CUresult cuDeviceGetAttribute (int *, CUdevice_attribute, CUdevice);
+CUresult cuDeviceGetCount (int *);
+CUresult cuEventCreate (CUevent *, unsigned);
+#define cuEventDestroy cuEventDestroy_v2
+CUresult cuEventDestroy (CUevent);
+CUresult cuEventElapsedTime (float *, CUevent, CUevent);
+CUresult cuEventQuery (CUevent);
+CUresult cuEventRecord (CUevent, CUstream);
+CUresult cuEventSynchronize (CUevent);
+CUresult cuFuncGetAttribute (int *, CUfunction_attribute, CUfunction);
+CUresult cuGetErrorString (CUresult, const char **);
+CUresult cuInit (unsigned);
+CUresult cuLaunchKernel (CUfunction, unsigned, unsigned, unsigned, unsigned,
+			 unsigned, unsigned, unsigned, CUstream, void **, void **);
+#define cuLinkAddData cuLinkAddData_v2
+CUresult cuLinkAddData (CUlinkState, CUjitInputType, void *, size_t, const char *,
+			unsigned, CUjit_option *, void **);
+CUresult cuLinkComplete (CUlinkState, void **, size_t *);
+#define cuLinkCreate cuLinkCreate_v2
+CUresult cuLinkCreate (unsigned, CUjit_option *, void **, CUlinkState *);
+CUresult cuLinkDestroy (CUlinkState);
+#define cuMemAlloc cuMemAlloc_v2
+CUresult cuMemAlloc (CUdeviceptr *, size_t);
+#define cuMemAllocHost cuMemAllocHost_v2
+CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
+#define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
+CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
+#define cuMemcpyDtoH cuMemcpyDtoH_v2
+CUresult cuMemcpyDtoH (void *, CUdeviceptr, size_t);
+#define cuMemcpyDtoHAsync cuMemcpyDtoHAsync_v2
+CUresult cuMemcpyDtoHAsync (void *, CUdeviceptr, size_t, CUstream);
+#define cuMemcpyHtoD cuMemcpyHtoD_v2
+CUresult cuMemcpyHtoD (CUdeviceptr, const void *, size_t);
+#define cuMemcpyHtoDAsync cuMemcpyHtoDAsync_v2
+CUresult cuMemcpyHtoDAsync (CUdeviceptr, const void *, size_t, CUstream);
+#define cuMemFree cuMemFree_v2
+CUresult cuMemFree (CUdeviceptr);
+CUresult cuMemFreeHost (void *);
+#define cuMemGetAddressRange cuMemGetAddressRange_v2
+CUresult cuMemGetAddressRange (CUdeviceptr *, size_t *, CUdeviceptr);
+#define cuMemHostGetDevicePointer cuMemHostGetDevicePointer_v2
+CUresult cuMemHostGetDevicePointer (CUdeviceptr *, void *, unsigned);
+CUresult cuModuleGetFunction (CUfunction *, CUmodule, const char *);
+#define cuModuleGetGlobal cuModuleGetGlobal_v2
+CUresult cuModuleGetGlobal (CUdeviceptr *, size_t *, CUmodule, const char *);
+CUresult cuModuleLoad (CUmodule *, const char *);
+CUresult cuModuleLoadData (CUmodule *, const void *);
+CUresult cuModuleUnload (CUmodule);
+CUresult cuStreamCreate (CUstream *, unsigned);
+#define cuStreamDestroy cuStreamDestroy_v2
+CUresult cuStreamDestroy (CUstream);
+CUresult cuStreamQuery (CUstream);
+CUresult cuStreamSynchronize (CUstream);
+CUresult cuStreamWaitEvent (CUstream, CUevent, unsigned);
+
+#endif /* GCC_CUDA_H */
--- libgomp/config.h.in.jj	2017-01-13 12:07:55.000000000 +0100
+++ libgomp/config.h.in	2017-01-13 16:46:37.000000000 +0100
@@ -155,6 +155,10 @@
 /* Define to 1 if the NVIDIA plugin is built, 0 if not. */
 #undef PLUGIN_NVPTX
 
+/* Define to 1 if the NVIDIA plugin should dlopen libcuda.so.1, 0 if it should
+   be linked against it. */
+#undef PLUGIN_NVPTX_DYNAMIC
+
 /* Define if all infrastructure, needed for plugins, is supported. */
 #undef PLUGIN_SUPPORT
 
--- libgomp/configure.jj	2017-01-13 12:07:56.000000000 +0100
+++ libgomp/configure	2017-01-13 17:34:02.384782324 +0100
@@ -15135,7 +15135,7 @@ fi
 
 # Plugins for offload execution, configure.ac fragment.  -*- mode: autoconf -*-
 #
-# Copyright (C) 2014-2016 Free Software Foundation, Inc.
+# Copyright (C) 2014-2017 Free Software Foundation, Inc.
 #
 # Contributed by Mentor Embedded.
 #
@@ -15295,10 +15295,12 @@ if test "${with_cuda_driver_lib+set}" =
   withval=$with_cuda_driver_lib;
 fi
 
-if test "x$with_cuda_driver" != x; then
-  CUDA_DRIVER_INCLUDE=$with_cuda_driver/include
-  CUDA_DRIVER_LIB=$with_cuda_driver/lib
-fi
+case "x$with_cuda_driver" in
+  x | xno) ;;
+  *) CUDA_DRIVER_INCLUDE=$with_cuda_driver/include
+     CUDA_DRIVER_LIB=$with_cuda_driver/lib
+     ;;
+esac
 if test "x$with_cuda_driver_include" != x; then
   CUDA_DRIVER_INCLUDE=$with_cuda_driver_include
 fi
@@ -15316,6 +15318,7 @@ PLUGIN_NVPTX=0
 PLUGIN_NVPTX_CPPFLAGS=
 PLUGIN_NVPTX_LDFLAGS=
 PLUGIN_NVPTX_LIBS=
+PLUGIN_NVPTX_DYNAMIC=0
 
 
@@ -15422,9 +15425,17 @@ rm -f core conftest.err conftest.$ac_obj
 	LIBS=$PLUGIN_NVPTX_save_LIBS
 	case $PLUGIN_NVPTX in
 	  nvptx*)
-	    PLUGIN_NVPTX=0
-	    as_fn_error "CUDA driver package required for nvptx support" "$LINENO" 5
-	    ;;
+	    if test "x$CUDA_DRIVER_INCLUDE" = x \
+	       && test "x$CUDA_DRIVER_LIB" = x; then
+	      PLUGIN_NVPTX=1
+	      PLUGIN_NVPTX_CPPFLAGS='-I$(srcdir)/plugin/cuda'
+	      PLUGIN_NVPTX_LIBS='-ldl'
+	      PLUGIN_NVPTX_DYNAMIC=1
+	    else
+	      PLUGIN_NVPTX=0
+	      as_fn_error "CUDA driver package required for nvptx support" "$LINENO" 5
+	    fi
+	  ;;
 	esac
 	;;
       hsa*)
@@ -15509,6 +15520,11 @@ cat >>confdefs.h <<_ACEOF
 #define PLUGIN_NVPTX $PLUGIN_NVPTX
 _ACEOF
 
+
+cat >>confdefs.h <<_ACEOF
+#define PLUGIN_NVPTX_DYNAMIC $PLUGIN_NVPTX_DYNAMIC
+_ACEOF
+
  if test $PLUGIN_HSA = 1; then
   PLUGIN_HSA_TRUE=
   PLUGIN_HSA_FALSE='#'
--- libgomp/Makefile.in.jj	2017-01-13 12:07:55.000000000 +0100
+++ libgomp/Makefile.in	2017-01-13 16:46:53.769033874 +0100
@@ -17,7 +17,7 @@
 
 # Plugins for offload execution, Makefile.am fragment.
 #
-# Copyright (C) 2014-2016 Free Software Foundation, Inc.
+# Copyright (C) 2014-2017 Free Software Foundation, Inc.
 #
 # Contributed by Mentor Embedded.
 #

	Jakub

--TRYliJ5NKNqkz5bu
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="nvptx-tools.patch"
Content-length: 28878

diff --git a/configure.ac b/configure.ac
index ecc02c4..7bf8a3f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -51,6 +51,7 @@ LIBS="$LIBS -lcuda"
 AC_CHECK_FUNCS([[cuGetErrorName] [cuGetErrorString]])
 AC_CHECK_DECLS([[cuGetErrorName], [cuGetErrorString]],
   [], [], [[#include <cuda.h>]])
+AC_CHECK_HEADERS(unistd.h sys/stat.h)
 
 AC_MSG_CHECKING([for extra programs to build requiring -lcuda])
 NVPTX_RUN=
diff --git a/include/libiberty.h b/include/libiberty.h
index cacde80..29ceafe 100644
--- a/include/libiberty.h
+++ b/include/libiberty.h
@@ -390,6 +390,17 @@ extern void hex_init (void);
 /* Save files used for communication between processes.  */
 #define PEX_SAVE_TEMPS		0x4
 
+/* Max number of alloca bytes per call before we must switch to malloc.
+
+   ?? Swiped from gnulib's regex_internal.h header.  Is this actually
+   the case?  This number seems arbitrary, though sane.
+
+   The OS usually guarantees only one guard page at the bottom of the stack,
+   and a page size can be as small as 4096 bytes.  So we cannot safely
+   allocate anything larger than 4096 bytes.  Also care for the possibility
+   of a few compiler-allocated temporary stack slots.  */
+#define MAX_ALLOCA_SIZE	4032
+
 /* Prepare to execute one or more programs, with standard output of
    each program fed to standard input of the next.
    FLAGS	As above.
diff --git a/nvptx-as.c b/nvptx-as.c
index 53331af..1ad6699 100644
--- a/nvptx-as.c
+++ b/nvptx-as.c
@@ -30,6 +30,9 @@
 #include <string.h>
 #include <wait.h>
 #include <unistd.h>
+#ifdef HAVE_SYS_STAT_H
+#include <sys/stat.h>
+#endif
 #include <errno.h>
 #define obstack_chunk_alloc malloc
 #define obstack_chunk_free free
@@ -42,6 +45,38 @@
 
 #include "version.h"
 
+#ifndef R_OK
+#define R_OK 4
+#define W_OK 2
+#define X_OK 1
+#endif
+
+#ifndef DIR_SEPARATOR
+#  define DIR_SEPARATOR '/'
+#endif
+
+#if defined (_WIN32) || defined (__MSDOS__) \
+    || defined (__DJGPP__) || defined (__OS2__)
+#  define HAVE_DOS_BASED_FILE_SYSTEM
+#  define HAVE_HOST_EXECUTABLE_SUFFIX
+#  define HOST_EXECUTABLE_SUFFIX ".exe"
+#  ifndef DIR_SEPARATOR_2 
+#    define DIR_SEPARATOR_2 '\\'
+#  endif
+#  define PATH_SEPARATOR ';'
+#else
+#  define PATH_SEPARATOR ':'
+#endif
+
+#ifndef DIR_SEPARATOR_2
+#  define IS_DIR_SEPARATOR(ch) ((ch) == DIR_SEPARATOR)
+#else
+#  define IS_DIR_SEPARATOR(ch) \
+	(((ch) == DIR_SEPARATOR) || ((ch) == DIR_SEPARATOR_2))
+#endif
+
+#define DIR_UP ".."
+
 static const char *outname = NULL;
 
 static void __attribute__ ((format (printf, 1, 2)))
@@ -816,7 +851,7 @@ traverse (void **slot, void *data)
 }
 
 static void
-process (FILE *in, FILE *out)
+process (FILE *in, FILE *out, int verify, const char *outname)
 {
   symbol_table = htab_create (500, hash_string_hash, hash_string_eq,
                               NULL);
@@ -824,6 +859,18 @@ process (FILE *in, FILE *out)
   const char *input = read_file (in);
   Token *tok = tokenize (input);
 
+  /* By default, when ptxas is not in PATH, do minimalistic verification,
+     just require that the first non-comment directive is .version.  */
+  if (verify < 0)
+    {
+      size_t i;
+      for (i = 0; tok[i].kind == K_comment; i++)
+	;
+      if (tok[i].kind != K_dotted || !is_keyword (&tok[i], "version"))
+	fatal_error ("missing .version directive at start of file '%s'",
+		     outname);
+    }
+
   do
     tok = parse_file (tok);
   while (tok->kind);
@@ -897,9 +944,83 @@ fork_execute (const char *prog, char *const *argv)
   do_wait (prog, pex);
 }
 
+/* Determine if progname is available in PATH.  */
+static bool
+program_available (const char *progname)
+{
+  char *temp = getenv ("PATH");
+  if (temp)
+    {
+      char *startp, *endp, *nstore, *alloc_ptr = NULL;
+      size_t prefixlen = strlen (temp) + 1;
+      size_t len;
+      if (prefixlen < 2)
+	prefixlen = 2;
+
+      len = prefixlen + strlen (progname) + 1;
+#ifdef HAVE_HOST_EXECUTABLE_SUFFIX
+      len += strlen (HOST_EXECUTABLE_SUFFIX);
+#endif
+      if (len < MAX_ALLOCA_SIZE)
+	nstore = (char *) alloca (len);
+      else
+	alloc_ptr = nstore = (char *) malloc (len);
+
+      startp = endp = temp;
+      while (1)
+	{
+	  if (*endp == PATH_SEPARATOR || *endp == 0)
+	    {
+	      if (endp == startp)
+		{
+		  nstore[0] = '.';
+		  nstore[1] = DIR_SEPARATOR;
+		  nstore[2] = '\0';
+		}
+	      else
+		{
+		  memcpy (nstore, startp, endp - startp);
+		  if (! IS_DIR_SEPARATOR (endp[-1]))
+		    {
+		      nstore[endp - startp] = DIR_SEPARATOR;
+		      nstore[endp - startp + 1] = 0;
+		    }
+		  else
+		    nstore[endp - startp] = 0;
+		}
+	      strcat (nstore, progname);
+	      if (! access (nstore, X_OK)
+#ifdef HAVE_HOST_EXECUTABLE_SUFFIX
+		  || ! access (strcat (nstore, HOST_EXECUTABLE_SUFFIX), X_OK)
+#endif
+		 )
+		{
+#if defined (HAVE_SYS_STAT_H) && defined (S_ISREG)
+		  struct stat st;
+		  if (stat (nstore, &st) >= 0 && S_ISREG (st.st_mode))
+#endif
+		    {
+		      free (alloc_ptr);
+		      return true;
+		    }
+		}
+
+	      if (*endp == 0)
+		break;
+	      endp = startp = endp + 1;
+	    }
+	  else
+	    endp++;
+	}
+      free (alloc_ptr);
+    }
+  return false;
+}
+
 static struct option long_options[] = {
   {"traditional-format",     no_argument, 0,  0 },
   {"save-temps",  no_argument,       0,  0 },
+  {"verify",  no_argument,       0,  0 },
   {"no-verify",  no_argument,       0,  0 },
   {"help", no_argument, 0, 'h' },
   {"version", no_argument, 0, 'V' },
@@ -912,7 +1033,7 @@ main (int argc, char **argv)
   FILE *in = stdin;
   FILE *out = stdout;
   bool verbose __attribute__((unused)) = false;
-  bool verify = true;
+  int verify = -1;
   const char *smver = "sm_30";
 
   int o;
@@ -923,7 +1044,9 @@ main (int argc, char **argv)
 	{
 	case 0:
 	  if (option_index == 2)
-	    verify = false;
+	    verify = 1;
+	  else if (option_index == 3)
+	    verify = 0;
 	  break;
 	case 'v':
 	  verbose = true;
@@ -948,7 +1071,9 @@ Usage: nvptx-none-as [option...] [asmfile]\n\
 Options:\n\
   -o FILE               Write output to FILE\n\
   -v                    Be verbose\n\
+  --verify              Do verify output is acceptable to ptxas\n\
   --no-verify           Do not verify output is acceptable to ptxas\n\
+  --verify              Do verify output is acceptable to ptxas\n\
   --help                Print this help and exit\n\
   --version             Print version number and exit\n\
 \n\
@@ -983,11 +1108,17 @@ This program has absolutely no warranty.\n",
   if (!in)
     fatal_error ("cannot open input ptx file");
 
-  process (in, out);
-  if  (outname)
+  if (outname == NULL)
+    verify = 0;
+  else if (verify == -1)
+    if (program_available ("ptxas"))
+      verify = 1;
+
+  process (in, out, verify, outname);
+  if (outname)
     fclose (out);
 
-  if (verify && outname)
+  if (verify > 0)
     {
       struct obstack argv_obstack;
       obstack_init (&argv_obstack);
diff --git a/configure b/configure
index 9a0794a..4289569 100755
--- a/configure
+++ b/configure
@@ -168,7 +168,8 @@ test x\$exitcode = x0 || exit 1"
   as_suggested="  as_lineno_1=";as_suggested=$as_suggested$LINENO;as_suggested=$as_suggested" as_lineno_1a=\$LINENO
   as_lineno_2=";as_suggested=$as_suggested$LINENO;as_suggested=$as_suggested" as_lineno_2a=\$LINENO
   eval 'test \"x\$as_lineno_1'\$as_run'\" != \"x\$as_lineno_2'\$as_run'\" &&
-  test \"x\`expr \$as_lineno_1'\$as_run' + 1\`\" = \"x\$as_lineno_2'\$as_run'\"' || exit 1"
+  test \"x\`expr \$as_lineno_1'\$as_run' + 1\`\" = \"x\$as_lineno_2'\$as_run'\"' || exit 1
+test \$(( 1 + 1 )) = 2 || exit 1"
   if (eval "$as_required") 2>/dev/null; then :
   as_have_required=yes
 else
@@ -552,11 +553,50 @@ PACKAGE_URL=
 
 ac_unique_file="nvptx-tools"
 ac_unique_file="nvptx-as.c"
+# Factoring default headers for most tests.
+ac_includes_default="\
+#include <stdio.h>
+#ifdef HAVE_SYS_TYPES_H
+# include <sys/types.h>
+#endif
+#ifdef HAVE_SYS_STAT_H
+# include <sys/stat.h>
+#endif
+#ifdef STDC_HEADERS
+# include <stdlib.h>
+# include <stddef.h>
+#else
+# ifdef HAVE_STDLIB_H
+#  include <stdlib.h>
+# endif
+#endif
+#ifdef HAVE_STRING_H
+# if !defined STDC_HEADERS && defined HAVE_MEMORY_H
+#  include <memory.h>
+# endif
+# include <string.h>
+#endif
+#ifdef HAVE_STRINGS_H
+# include <strings.h>
+#endif
+#ifdef HAVE_INTTYPES_H
+# include <inttypes.h>
+#endif
+#ifdef HAVE_STDINT_H
+# include <stdint.h>
+#endif
+#ifdef HAVE_UNISTD_H
+# include <unistd.h>
+#endif"
+
 enable_option_checking=no
 ac_subst_vars='LTLIBOBJS
 LIBOBJS
 subdirs
 NVPTX_RUN
+EGREP
+GREP
+CPP
 CUDA_DRIVER_LDFLAGS
 CUDA_DRIVER_CPPFLAGS
 AR
@@ -635,7 +675,8 @@ LIBS
 CPPFLAGS
 CXX
 CXXFLAGS
-CCC'
+CCC
+CPP'
 ac_subdirs_all='libiberty'
 
 # Initialize some variables set by options.
@@ -1267,6 +1308,7 @@ Some influential environment variables:
               you have headers in a nonstandard directory <include dir>
   CXX         C++ compiler command
   CXXFLAGS    C++ compiler flags
+  CPP         C preprocessor
 
 Use these variables to override the choices made by `configure' or to help
 it to find libraries and programs with nonstandard names/locations.
@@ -1575,6 +1617,203 @@ $as_echo "$ac_res" >&6; }
   eval $as_lineno_stack; test "x$as_lineno_stack" = x && { as_lineno=; unset as_lineno;}
 
 } # ac_fn_c_check_decl
+
+# ac_fn_c_try_cpp LINENO
+# ----------------------
+# Try to preprocess conftest.$ac_ext, and return whether this succeeded.
+ac_fn_c_try_cpp ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  if { { ac_try="$ac_cpp conftest.$ac_ext"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\""
+$as_echo "$ac_try_echo"; } >&5
+  (eval "$ac_cpp conftest.$ac_ext") 2>conftest.err
+  ac_status=$?
+  if test -s conftest.err; then
+    grep -v '^ *+' conftest.err >conftest.er1
+    cat conftest.er1 >&5
+    mv -f conftest.er1 conftest.err
+  fi
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; } >/dev/null && {
+	 test -z "$ac_c_preproc_warn_flag$ac_c_werror_flag" ||
+	 test ! -s conftest.err
+       }; then :
+  ac_retval=0
+else
+  $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+    ac_retval=1
+fi
+  eval $as_lineno_stack; test "x$as_lineno_stack" = x && { as_lineno=; unset as_lineno;}
+  return $ac_retval
+
+} # ac_fn_c_try_cpp
+
+# ac_fn_c_check_header_mongrel LINENO HEADER VAR INCLUDES
+# -------------------------------------------------------
+# Tests whether HEADER exists, giving a warning if it cannot be compiled using
+# the include files in INCLUDES and setting the cache variable VAR
+# accordingly.
+ac_fn_c_check_header_mongrel ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  if { as_var=$3; eval "test \"\${$as_var+set}\" = set"; }; then :
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for $2" >&5
+$as_echo_n "checking for $2... " >&6; }
+if { as_var=$3; eval "test \"\${$as_var+set}\" = set"; }; then :
+  $as_echo_n "(cached) " >&6
+fi
+eval ac_res=\$$3
+	       { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5
+$as_echo "$ac_res" >&6; }
+else
+  # Is the header compilable?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking $2 usability" >&5
+$as_echo_n "checking $2 usability... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+$4
+#include <$2>
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  ac_header_compiler=yes
+else
+  ac_header_compiler=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_header_compiler" >&5
+$as_echo "$ac_header_compiler" >&6; }
+
+# Is the header present?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking $2 presence" >&5
+$as_echo_n "checking $2 presence... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <$2>
+_ACEOF
+if ac_fn_c_try_cpp "$LINENO"; then :
+  ac_header_preproc=yes
+else
+  ac_header_preproc=no
+fi
+rm -f conftest.err conftest.$ac_ext
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_header_preproc" >&5
+$as_echo "$ac_header_preproc" >&6; }
+
+# So?  What about this header?
+case $ac_header_compiler:$ac_header_preproc:$ac_c_preproc_warn_flag in #((
+  yes:no: )
+    { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: accepted by the compiler, rejected by the preprocessor!" >&5
+$as_echo "$as_me: WARNING: $2: accepted by the compiler, rejected by the preprocessor!" >&2;}
+    { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: proceeding with the compiler's result" >&5
+$as_echo "$as_me: WARNING: $2: proceeding with the compiler's result" >&2;}
+    ;;
+  no:yes:* )
+    { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: present but cannot be compiled" >&5
+$as_echo "$as_me: WARNING: $2: present but cannot be compiled" >&2;}
+    { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2:     check for missing prerequisite headers?" >&5
+$as_echo "$as_me: WARNING: $2:     check for missing prerequisite headers?" >&2;}
+    { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: see the Autoconf documentation" >&5
+$as_echo "$as_me: WARNING: $2: see the Autoconf documentation" >&2;}
+    { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2:     section \"Present But Cannot Be Compiled\"" >&5
+$as_echo "$as_me: WARNING: $2:     section \"Present But Cannot Be Compiled\"" >&2;}
+    { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: proceeding with the compiler's result" >&5
+$as_echo "$as_me: WARNING: $2: proceeding with the compiler's result" >&2;}
+    ;;
+esac
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for $2" >&5
+$as_echo_n "checking for $2... " >&6; }
+if { as_var=$3; eval "test \"\${$as_var+set}\" = set"; }; then :
+  $as_echo_n "(cached) " >&6
+else
+  eval "$3=\$ac_header_compiler"
+fi
+eval ac_res=\$$3
+	       { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5
+$as_echo "$ac_res" >&6; }
+fi
+  eval $as_lineno_stack; test "x$as_lineno_stack" = x && { as_lineno=; unset as_lineno;}
+
+} # ac_fn_c_check_header_mongrel
+
+# ac_fn_c_try_run LINENO
+# ----------------------
+# Try to link conftest.$ac_ext, and return whether this succeeded. Assumes
+# that executables *can* be run.
+ac_fn_c_try_run ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  if { { ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\""
+$as_echo "$ac_try_echo"; } >&5
+  (eval "$ac_link") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; } && { ac_try='./conftest$ac_exeext'
+  { { case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\""
+$as_echo "$ac_try_echo"; } >&5
+  (eval "$ac_try") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then :
+  ac_retval=0
+else
+  $as_echo "$as_me: program exited with status $ac_status" >&5
+       $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+       ac_retval=$ac_status
+fi
+  rm -rf conftest.dSYM conftest_ipa8_conftest.oo
+  eval $as_lineno_stack; test "x$as_lineno_stack" = x && { as_lineno=; unset as_lineno;}
+  return $ac_retval
+
+} # ac_fn_c_try_run
+
+# ac_fn_c_check_header_compile LINENO HEADER VAR INCLUDES
+# -------------------------------------------------------
+# Tests whether HEADER exists and can be compiled using the include files in
+# INCLUDES, setting the cache variable VAR accordingly.
+ac_fn_c_check_header_compile ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for $2" >&5
+$as_echo_n "checking for $2... " >&6; }
+if { as_var=$3; eval "test \"\${$as_var+set}\" = set"; }; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+$4
+#include <$2>
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  eval "$3=yes"
+else
+  eval "$3=no"
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+eval ac_res=\$$3
+	       { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5
+$as_echo "$ac_res" >&6; }
+  eval $as_lineno_stack; test "x$as_lineno_stack" = x && { as_lineno=; unset as_lineno;}
+
+} # ac_fn_c_check_header_compile
 cat >config.log <<_ACEOF
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.
@@ -3284,6 +3523,418 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_CUGETERRORSTRING $ac_have_decl
 _ACEOF
 
+ac_ext=c
+ac_cpp='$CPP $CPPFLAGS'
+ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
+ac_compiler_gnu=$ac_cv_c_compiler_gnu
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking how to run the C preprocessor" >&5
+$as_echo_n "checking how to run the C preprocessor... " >&6; }
+# On Suns, sometimes $CPP names a directory.
+if test -n "$CPP" && test -d "$CPP"; then
+  CPP=
+fi
+if test -z "$CPP"; then
+  if test "${ac_cv_prog_CPP+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+      # Double quotes because CPP needs to be expanded
+    for CPP in "$CC -E" "$CC -E -traditional-cpp" "/lib/cpp"
+    do
+      ac_preproc_ok=false
+for ac_c_preproc_warn_flag in '' yes
+do
+  # Use a header file that comes with gcc, so configuring glibc
+  # with a fresh cross-compiler works.
+  # Prefer <limits.h> to <assert.h> if __STDC__ is defined, since
+  # <limits.h> exists even on freestanding compilers.
+  # On the NeXT, cc -E runs the code through the compiler's parser,
+  # not just through cpp. "Syntax error" is here to catch this case.
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#ifdef __STDC__
+# include <limits.h>
+#else
+# include <assert.h>
+#endif
+		     Syntax error
+_ACEOF
+if ac_fn_c_try_cpp "$LINENO"; then :
+
+else
+  # Broken: fails on valid input.
+continue
+fi
+rm -f conftest.err conftest.$ac_ext
+
+  # OK, works on sane cases.  Now check whether nonexistent headers
+  # can be detected and how.
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <ac_nonexistent.h>
+_ACEOF
+if ac_fn_c_try_cpp "$LINENO"; then :
+  # Broken: success on invalid input.
+continue
+else
+  # Passes both tests.
+ac_preproc_ok=:
+break
+fi
+rm -f conftest.err conftest.$ac_ext
+
+done
+# Because of `break', _AC_PREPROC_IFELSE's cleaning code was skipped.
+rm -f conftest.err conftest.$ac_ext
+if $ac_preproc_ok; then :
+  break
+fi
+
+    done
+    ac_cv_prog_CPP=$CPP
+
+fi
+  CPP=$ac_cv_prog_CPP
+else
+  ac_cv_prog_CPP=$CPP
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $CPP" >&5
+$as_echo "$CPP" >&6; }
+ac_preproc_ok=false
+for ac_c_preproc_warn_flag in '' yes
+do
+  # Use a header file that comes with gcc, so configuring glibc
+  # with a fresh cross-compiler works.
+  # Prefer <limits.h> to <assert.h> if __STDC__ is defined, since
+  # <limits.h> exists even on freestanding compilers.
+  # On the NeXT, cc -E runs the code through the compiler's parser,
+  # not just through cpp. "Syntax error" is here to catch this case.
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#ifdef __STDC__
+# include <limits.h>
+#else
+# include <assert.h>
+#endif
+		     Syntax error
+_ACEOF
+if ac_fn_c_try_cpp "$LINENO"; then :
+
+else
+  # Broken: fails on valid input.
+continue
+fi
+rm -f conftest.err conftest.$ac_ext
+
+  # OK, works on sane cases.  Now check whether nonexistent headers
+  # can be detected and how.
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <ac_nonexistent.h>
+_ACEOF
+if ac_fn_c_try_cpp "$LINENO"; then :
+  # Broken: success on invalid input.
+continue
+else
+  # Passes both tests.
+ac_preproc_ok=:
+break
+fi
+rm -f conftest.err conftest.$ac_ext
+
+done
+# Because of `break', _AC_PREPROC_IFELSE's cleaning code was skipped.
+rm -f conftest.err conftest.$ac_ext
+if $ac_preproc_ok; then :
+
+else
+  { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error "C preprocessor \"$CPP\" fails sanity check
+See \`config.log' for more details." "$LINENO" 5; }
+fi
+
+ac_ext=c
+ac_cpp='$CPP $CPPFLAGS'
+ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
+ac_compiler_gnu=$ac_cv_c_compiler_gnu
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if test "${ac_cv_path_GREP+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+  if test -z "$GREP"; then
+  ac_path_GREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in grep ggrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+      { test -f "$ac_path_GREP" && $as_test_x "$ac_path_GREP"; } || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+  # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'GREP' >> "conftest.nl"
+    "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_GREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_GREP="$ac_path_GREP"
+      ac_path_GREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_GREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_GREP"; then
+    as_fn_error "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if test "${ac_cv_path_EGREP+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+  if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+   then ac_cv_path_EGREP="$GREP -E"
+   else
+     if test -z "$EGREP"; then
+  ac_path_EGREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in egrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+      { test -f "$ac_path_EGREP" && $as_test_x "$ac_path_EGREP"; } || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+  # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'EGREP' >> "conftest.nl"
+    "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_EGREP="$ac_path_EGREP"
+      ac_path_EGREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_EGREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_EGREP"; then
+    as_fn_error "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_EGREP=$EGREP
+fi
+
+   fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for ANSI C header files" >&5
+$as_echo_n "checking for ANSI C header files... " >&6; }
+if test "${ac_cv_header_stdc+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <stdlib.h>
+#include <stdarg.h>
+#include <string.h>
+#include <float.h>
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  ac_cv_header_stdc=yes
+else
+  ac_cv_header_stdc=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+
+if test $ac_cv_header_stdc = yes; then
+  # SunOS 4.x string.h does not declare mem*, contrary to ANSI.
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <string.h>
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+  $EGREP "memchr" >/dev/null 2>&1; then :
+
+else
+  ac_cv_header_stdc=no
+fi
+rm -f conftest*
+
+fi
+
+if test $ac_cv_header_stdc = yes; then
+  # ISC 2.0.2 stdlib.h does not declare free, contrary to ANSI.
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <stdlib.h>
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+  $EGREP "free" >/dev/null 2>&1; then :
+
+else
+  ac_cv_header_stdc=no
+fi
+rm -f conftest*
+
+fi
+
+if test $ac_cv_header_stdc = yes; then
+  # /bin/cc in Irix-4.0.5 gets non-ANSI ctype macros unless using -ansi.
+  if test "$cross_compiling" = yes; then :
+  :
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <ctype.h>
+#include <stdlib.h>
+#if ((' ' & 0x0FF) == 0x020)
+# define ISLOWER(c) ('a' <= (c) && (c) <= 'z')
+# define TOUPPER(c) (ISLOWER(c) ? 'A' + ((c) - 'a') : (c))
+#else
+# define ISLOWER(c) \
+		   (('a' <= (c) && (c) <= 'i') \
+		     || ('j' <= (c) && (c) <= 'r') \
+		     || ('s' <= (c) && (c) <= 'z'))
+# define TOUPPER(c) (ISLOWER(c) ? ((c) | 0x40) : (c))
+#endif
+
+#define XOR(e, f) (((e) && !(f)) || (!(e) && (f)))
+int
+main ()
+{
+  int i;
+  for (i = 0; i < 256; i++)
+    if (XOR (islower (i), ISLOWER (i))
+	|| toupper (i) != TOUPPER (i))
+      return 2;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_run "$LINENO"; then :
+
+else
+  ac_cv_header_stdc=no
+fi
+rm -f core *.core core.conftest.* gmon.out bb.out conftest$ac_exeext \
+  conftest.$ac_objext conftest.beam conftest.$ac_ext
+fi
+
+fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_header_stdc" >&5
+$as_echo "$ac_cv_header_stdc" >&6; }
+if test $ac_cv_header_stdc = yes; then
+
+$as_echo "#define STDC_HEADERS 1" >>confdefs.h
+
+fi
+
+# On IRIX 5.3, sys/types and inttypes.h are conflicting.
+for ac_header in sys/types.h sys/stat.h stdlib.h string.h memory.h strings.h \
+		  inttypes.h stdint.h unistd.h
+do :
+  as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
+ac_fn_c_check_header_compile "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default
+"
+eval as_val=\$$as_ac_Header
+   if test "x$as_val" = x""yes; then :
+  cat >>confdefs.h <<_ACEOF
+#define `$as_echo "HAVE_$ac_header" | $as_tr_cpp` 1
+_ACEOF
+
+fi
+
+done
+
+
+for ac_header in unistd.h sys/stat.h
+do :
+  as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
+ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
+eval as_val=\$$as_ac_Header
+   if test "x$as_val" = x""yes; then :
+  cat >>confdefs.h <<_ACEOF
+#define `$as_echo "HAVE_$ac_header" | $as_tr_cpp` 1
+_ACEOF
+
+fi
+
+done
+
 
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for extra programs to build requiring -lcuda" >&5
 $as_echo_n "checking for extra programs to build requiring -lcuda... " >&6; }

--TRYliJ5NKNqkz5bu--