* Tips for improving OpenMP offload performance on NVIDIA GPU
@ 2019-11-25 15:53 Eric Raut
0 siblings, 0 replies; only message in thread
From: Eric Raut @ 2019-11-25 15:53 UTC (permalink / raw)
To: gcc-help
Hello,
I am having trouble getting good performance with GCC’s OpenMP
offloading to NVPTX targets. I am using the Jacobi iteration, an
example often used for OpenMP offloading.
Computer:
* CPU: 2 x Intel Xeon Gold 5115 (20 cores, 40 threads total)
* System memory: 192 GB
* GPU: NVIDIA Geforce RTX 2080 (CUDA compute capability 7.5)
* OS: Fedora 30 (Linux kernel 5.3.11)
GCC version (trunk):
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/eraut/software/gcc-trunk/gcc_nvptx_offload/libexec/gcc/x86_64-pc-linux-gnu/10.0.0/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
Target: x86_64-pc-linux-gnu
Configured with: ../gcc_source/configure
--prefix=/home/eraut/software/gcc-trunk/gcc_nvptx_offload
--disable-multilib --disable-bootstrap --enable-checking=release
--enable-languages=c,c++,fortran,lto
-enable-offload-targets=nvptx-none
--with-cuda-driver-lib=/opt/cuda/10.1.168/lib64
--with-cuda-driver-include=/opt/cuda/10.1.168/include
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.0.0 20191117 (experimental) (GCC)
The code looks like this:
#pragma omp target data map(to:A[0:N]) map(alloc:Anew[0:N])
{
const int iter_max = 10000;
int iter = 0;
while ( error > tol && iter < iter_max ) {
error = 0.0;
#pragma omp target map(error)
#pragma omp teams distribute parallel for simd
reduction(max:error) collapse(2)
for (int i = 1; i < n-1; i++) {
for (int j = 1; j < n-1; j++) {
Anew[i*n+j] = 0.25 * ( A[i*n+j+1] + A[i*n+j-1] +
A[(i-1)*n+j] + A[(i+1)*n+j] );
error = fmax(error, fabs(Anew[i*n+j] - A[i*n+j]));
}
}
// Update A
#pragma omp target teams distribute parallel for simd collapse(2)
for (int i = 1; i < n-1; i++) {
for (int j = 1; j < n-1; j++) {
A[i*n+j] = Anew[i*n+j];
}
}
++iter;
if (iter % 100 == 0)
printf("Iteration %d, error = %g\n", iter, error);
}
}
With a matrix size of 1000x1000, and the code compiled with "-O3",
this takes about 5 seconds using Clang and about 24 seconds using GCC.
For reference, it only takes about 0.78 seconds using PGI+OpenACC with
the corresponding OpenACC directives.
Output from nvprof, as shown below, indicates that about half of the
time (in the "API calls" section) is being spent on "cuMemAlloc" and
"cuMemFree". This is a bit surprising since this kernel is structured
such that all the data is kept on the GPU and only the "error" scalar
is transferred to and from the device at each timestep. Investigation
with the visual profiler shows that cuMemAlloc and cuMemFree are being
called before and after (respectively) each omp target region, and
that these calls take longer than the kernel itself.
==5858== Profiling result:
Type Time(%) Time Calls Avg Min
Max Name
GPU activities: 77.87% 8.65573s 4467 1.9377ms 1.8907ms
2.4392ms main$_omp_fn$0
22.00% 2.44538s 4467 547.43us 511.97us
669.15us main$_omp_fn$2
0.08% 8.7558ms 8935 979ns 672ns
1.6478ms [CUDA memcpy HtoD]
0.05% 5.3973ms 4467 1.2080us 960ns
13.344us [CUDA memcpy DtoH]
API calls: 48.06% 11.1161s 8934 1.2442ms 512.74us
2.4412ms cuCtxSynchronize
28.45% 6.58100s 17869 368.29us 6.4930us
162.56ms cuMemAlloc
21.10% 4.88023s 17869 273.11us 3.3220us
14.474ms cuMemFree
0.87% 201.07ms 1 201.07ms 201.07ms
201.07ms cuCtxCreate
0.48% 111.48ms 8934 12.478us 9.4160us
599.79us cuLaunchKernel
...
Does anyone have any ideas what could be the problem, or any
suggestions that might improve the performance here? I've tried moving
the "parallel for simd" to the inner for loop and similar changes, but
none of these seem to help very much.
Thanks very much!
Eric
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2019-11-25 15:53 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-25 15:53 Tips for improving OpenMP offload performance on NVIDIA GPU Eric Raut
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).