public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Tips for improving OpenMP offload performance on NVIDIA GPU
@ 2019-11-25 15:53 Eric Raut
  0 siblings, 0 replies; only message in thread
From: Eric Raut @ 2019-11-25 15:53 UTC (permalink / raw)
  To: gcc-help

Hello,

I am having trouble getting good performance with GCC’s OpenMP
offloading to NVPTX targets. I am using the Jacobi iteration, an
example often used for OpenMP offloading.

Computer:
* CPU: 2 x Intel Xeon Gold 5115 (20 cores, 40 threads total)
* System memory: 192 GB
* GPU: NVIDIA Geforce RTX 2080 (CUDA compute capability 7.5)
* OS: Fedora 30 (Linux kernel 5.3.11)

GCC version (trunk):
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/eraut/software/gcc-trunk/gcc_nvptx_offload/libexec/gcc/x86_64-pc-linux-gnu/10.0.0/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
Target: x86_64-pc-linux-gnu
Configured with: ../gcc_source/configure
--prefix=/home/eraut/software/gcc-trunk/gcc_nvptx_offload
--disable-multilib --disable-bootstrap --enable-checking=release
--enable-languages=c,c++,fortran,lto
-enable-offload-targets=nvptx-none
--with-cuda-driver-lib=/opt/cuda/10.1.168/lib64
--with-cuda-driver-include=/opt/cuda/10.1.168/include
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.0.0 20191117 (experimental) (GCC)

The code looks like this:

#pragma omp target data map(to:A[0:N]) map(alloc:Anew[0:N])
{
    const int iter_max = 10000;
    int iter = 0;
    while ( error > tol && iter < iter_max ) {
        error = 0.0;
        #pragma omp target map(error)
        #pragma omp teams distribute parallel for simd
reduction(max:error) collapse(2)
        for (int i = 1; i < n-1; i++) {
            for (int j = 1; j < n-1; j++) {
                Anew[i*n+j] = 0.25 * ( A[i*n+j+1] + A[i*n+j-1] +
                                      A[(i-1)*n+j] + A[(i+1)*n+j] );
                error = fmax(error, fabs(Anew[i*n+j] - A[i*n+j]));
            }
        }
        // Update A
        #pragma omp target teams distribute parallel for simd collapse(2)
        for (int i = 1; i < n-1; i++) {
            for (int j = 1; j < n-1; j++) {
                A[i*n+j] = Anew[i*n+j];
            }
        }
        ++iter;
        if (iter % 100 == 0)
            printf("Iteration %d, error = %g\n", iter, error);
    }
}

With a matrix size of 1000x1000, and the code compiled with "-O3",
this takes about 5 seconds using Clang and about 24 seconds using GCC.
For reference, it only takes about 0.78 seconds using PGI+OpenACC with
the corresponding OpenACC directives.

Output from nvprof, as shown below, indicates that about half of the
time (in the "API calls" section) is being spent on "cuMemAlloc" and
"cuMemFree". This is a bit surprising since this kernel is structured
such that all the data is kept on the GPU and only the "error" scalar
is transferred to and from the device at each timestep. Investigation
with the visual profiler shows that cuMemAlloc and cuMemFree are being
called before and after (respectively) each omp target region, and
that these calls take longer than the kernel itself.

    ==5858== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min
     Max  Name
     GPU activities:   77.87%  8.65573s      4467  1.9377ms  1.8907ms
2.4392ms  main$_omp_fn$0
                       22.00%  2.44538s      4467  547.43us  511.97us
669.15us  main$_omp_fn$2
                        0.08%  8.7558ms      8935     979ns     672ns
1.6478ms  [CUDA memcpy HtoD]
                        0.05%  5.3973ms      4467  1.2080us     960ns
13.344us  [CUDA memcpy DtoH]
          API calls:   48.06%  11.1161s      8934  1.2442ms  512.74us
2.4412ms  cuCtxSynchronize
                       28.45%  6.58100s     17869  368.29us  6.4930us
162.56ms  cuMemAlloc
                       21.10%  4.88023s     17869  273.11us  3.3220us
14.474ms  cuMemFree
                        0.87%  201.07ms         1  201.07ms  201.07ms
201.07ms  cuCtxCreate
                        0.48%  111.48ms      8934  12.478us  9.4160us
599.79us  cuLaunchKernel
    ...

Does anyone have any ideas what could be the problem, or any
suggestions that might improve the performance here? I've tried moving
the "parallel for simd" to the inner for loop and similar changes, but
none of these seem to help very much.

Thanks very much!
Eric

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2019-11-25 15:53 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-25 15:53 Tips for improving OpenMP offload performance on NVIDIA GPU Eric Raut

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).