The attached patch teaches libgomp how to use the CUDA thread occupancy calculator built into the CUDA driver. Despite both being based off the CUDA thread occupancy spreadsheet distributed with CUDA, the built in occupancy calculator differs from the occupancy calculator in og8 in two key ways. First, og8 launches twice the number of gangs as the driver thread occupancy calculator. This was my attempt at preventing threads from idling, and it operating on a similar principle of running 'make -jN', where N is twice the number of CPU threads. Second, whereas og8 always attempts to maximize the CUDA block size, the driver may select a smaller block, which effectively decreases num_workers. In terms of performance, there really isn't that much of a difference between the CUDA driver's occupancy calculator and og8's. However, on the tests that are impacted, they are generally within a factor of two from one another, with some tests running faster with the driver occupancy calculator and others with og8's. Unfortunately, support for the CUDA driver API isn't universal; it's only available in CUDA version 6.5 (or 6050) and newer. In this patch, I'm exploiting the fact that init_cuda_lib only checks for errors on the last library function initialized. Therefore it guards the usage of cuOccupancyMaxPotentialBlockSizeWithFlags by checking driver_version. If the driver occupancy calculator isn't available, it falls back to the existing defaults. Maybe the og8 thread occupancy would make a better default for older versions of CUDA, but that's a patch for another day. Is this patch OK for trunk? I bootstrapped and regression tested it using x86_64 with nvptx offloading. Thanks, Cesar