On 07/02/2018 07:14 AM, Tom de Vries wrote:
> On 06/21/2018 03:58 PM, Cesar Philippidis wrote:
>> On 06/20/2018 03:15 PM, Tom de Vries wrote:
>>> On 06/20/2018 11:59 PM, Cesar Philippidis wrote:
>>>> Now it follows the formula contained in
>>>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
>>>
>>> Any reason we're not using the cuda runtime functions to get the
>>> occupancy (see PR85590 - [nvptx, libgomp, openacc] Use cuda runtime fns
>>> to determine launch configuration in nvptx ) ?
>>
>> There are two reasons:
>>
>>   1) cuda_occupancy.h depends on the CUDA runtime to extract the device
>>      properties instead of the CUDA driver API. However, we can always
>>      teach libgomp how to populate the cudaDeviceProp struct using the
>>      driver API.
>>
>>   2) CUDA is not always present on the build host, and that's why
>>      libgomp maintains its own cuda.h. So at the very least, this
>>      functionality would be good to have in libgomp as a fallback
>>      implementation;
> 
> Libgomp maintains its own cuda.h to "allow building GCC with PTX
> offloading even without CUDA being installed" (
> https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00980.html ).
> 
> The libgomp nvptx plugin however uses the cuda driver API to launch
> kernels etc, so we can assume that's always available at launch time.
> And according to the "CUDA Pro Tip: Occupancy API Simplifies Launch
> Configuration", the occupancy API is also available in the driver API.
> 
> What we cannot assume to be available is the occupancy API pre cuda-6.5.
> So it's fine to have a fallback for that (properly isolated in utility
> functions), but for cuda 6.5 and up we want to use the occupancy API.

Here's revision 2 to the patch. I replaced all of my thread occupancy
heuristics with calls to the CUDA driver as you suggested. The
performance is worse than my heuristics, but that's to be expected
because the CUDA driver only guarantees the minimal launch geometry to
to fully utilize the hardware, and not the optimal value. I'll
reintroduce my heuristics later as a follow up patch. The major
advantage of the CUDA thread occupancy calculator is that it allows the
runtime to select sensible default num_workers to avoid those annoying
runtime failures due to insufficient GPU hardware resources.

One thing that may stick out in this patch is how it probes for the
driver version instead of the API version. It turns out that the API
version corresponds to the SM version declared in the PTX sources,
whereas the driver version corresponds to the latest version of CUDA
supported by the driver. At least that's the case with driver version
396.24.

>>      its not good to have program fail due to
>>      insufficient hardware resources errors when it is avoidable.
>>
> 
> Right, in fact there are two separate things you're trying to address
> here: launch failure and occupancy heuristic, so split the patch.

That hunk was small, so I included it with this patch. Although if you
insist, I can remove it.

Is this patch OK for trunk? I tested it x86_64 with nvptx offloading.

Cesar