The attached patch teaches libgomp how to use the CUDA thread occupancy
calculator built into the CUDA driver. Despite both being based off the
CUDA thread occupancy spreadsheet distributed with CUDA, the built in
occupancy calculator differs from the occupancy calculator in og8 in two
key ways. First, og8 launches twice the number of gangs as the driver
thread occupancy calculator. This was my attempt at preventing threads
from idling, and it operating on a similar principle of running 'make
-jN', where N is twice the number of CPU threads. Second, whereas og8
always attempts to maximize the CUDA block size, the driver may select a
smaller block, which effectively decreases num_workers.

In terms of performance, there really isn't that much of a difference
between the CUDA driver's occupancy calculator and og8's. However, on
the tests that are impacted, they are generally within a factor of two
from one another, with some tests running faster with the driver
occupancy calculator and others with og8's.

Unfortunately, support for the CUDA driver API isn't universal; it's
only available in CUDA version 6.5 (or 6050) and newer. In this patch,
I'm exploiting the fact that init_cuda_lib only checks for errors on the
last library function initialized. Therefore it guards the usage of

  cuOccupancyMaxPotentialBlockSizeWithFlags

by checking driver_version. If the driver occupancy calculator isn't
available, it falls back to the existing defaults. Maybe the og8 thread
occupancy would make a better default for older versions of CUDA, but
that's a patch for another day.

Is this patch OK for trunk? I bootstrapped and regression tested it
using x86_64 with nvptx offloading.

Thanks,
Cesar