GSoC - Accelerating Fortran DO CONCURRENT in GCC

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

* GSoC - Accelerating Fortran DO CONCURRENT in GCC
@ 2022-04-21 19:26 Wileam Yonatan Phan
  0 siblings, 0 replies; only message in thread
From: Wileam Yonatan Phan @ 2022-04-21 19:26 UTC (permalink / raw)
  To: gcc, fortran; +Cc: rouson, mjambor, tobias, thomas, jlarkin

Hi everyone,

I submitted a very short proposal for the GCC GSoC this year, specifically to
work on DO CONCURRENT GPU offloading support. I found out about this literally
three days ago (Apr 18) from Thomas Schwinge's post on OpenACC community Slack.
I wish I’d come across this sooner than mere hours before the GSoC proposal
deadline on Apr 19. But I guess almost late is better than late -- hopefully
y’all will forgive me for this transgression.

The submitted version of the proposal can be accessed here on my personal
website:
https://wyphan.github.io/assets/pdf/20220419-AcceleratingFortranDoConcurrentInGCC-GSoC2022.pdf

I personally think that DO CONCURRENT GPU offloading is an ambitious but very
doable project, especially when the plan of action has been laid out:
1. Support Fortran 2018 DO CONCURRENT locality specifiers (LOCAL, LOCAL_INIT,
SHARED) and the DEFAULT clause in the parser.
2. Support Fortran 202X DO CONCURRENT REDUCTION clause in the parser.
3. Implement actual parallelization controlled by `-fdo-concurrent=` compiler
flag with 5 backends (serial, openmp, parallel, openmp-target, openacc).

I’ll be honest: The last two backends in step 3 (OpenMP target offload and
OpenACC) gets me excited.

At the moment DO CONCURRENT GPU offloading is exclusive to NVIDIA nvfortran and
(obviously) NVIDIA GPUs. I think gfortran holds a special place here.

GCC can already offload OpenACC to AMD GPUs. The timing couldn't be more
perfect -- the upcoming Frontier exascale system at ORNL will use AMD GPUs, and
the _only_ compiler that would support OpenACC on that platform will be GCC and
Cray! (Though I recall reading that Cray already pulled the plug on OpenACC for
C/C++ in cc and CC, such that it only works in ftn). This is a strong point to
capitalize on for GCC _and_ AMD, which I wish more people know about.
Especially as OpenMP target offload support still matures across all compilers.

Some background on myself: I recently graduated with an MS in Physics from U of
Tennessee, Knoxville. My thesis work involved porting my advisor's Gordon Bell-
winning (2010) density response code (based on Elk FP-LAPW DFT package) to
Summit at ORNL. The code is modern Fortran (mostly 90/95 but uses some 03/08
features) with MPI and OpenMP (CPU-only); I added OpenACC and calls to the
MAGMA [icl.utk.edu/magma] library. After participating in the OLCF GPU
Hackathon 2020 with the code, as well as adding further optimizations, we were
able to reach up to 12x wall clock time speedup for a test case inspired by the
one used in the CPU-only Gordon Bell version [doi:10.1109/SC.2010.55]. We
ported the hotspot as shown by initial CPU-only profiling in the form of a
nested loop within _one single subroutine_. The defining characteristic of this
hotspot in the subroutine is small-ish (~200x~200 times ~200x~50) batched (~500
to ~4000 batch size) double complex matrix-matrix multiply (ZGEMM), which used
calls to MAGMA library (esp. since MAGMA can interface with both cuBLAS and
rocBLAS). OpenACC is used to manage the device memory and host<->device
transfers. I also wrote 3 OpenACC kernels to support the batching mechanism
(one to index the batches, one to fill in the batches with input data, and one
to fill in the results from the batches).

As an OpenACC practitioner, I was originally attracted to its simplicity and
portability, but have become an advocate for it due to its competitive
performance against native programming models (NVIDIA CUDA and AMD HIP). This
practical knowledge and experience have led to a joint research with Emeritus
Professor Lenore Mullin (SUNY Albany) that was presented as a “lessons
learned”-style talk during the OpenACC Summit 2021. This talk is about our
experience with OpenACC to port her FFT algorithm [arXiv:0811.2535] to target
NVIDIA GPUs starting from a CPU-only OpenMP Fortran code. Since the talk, we've
also collaborated on porting her GEMM code [NREL/CP-2C00-80232] starting from a
CPU-only C code to target NVIDIA V100 and A100 GPUs.

Previously, I’ve submitted a bug to the GCC Bugzilla, which unfortunately is
not-a-bug. Currently I’m working with Damian Rouson at Sourcery Institute on
isolating gfortran bugs on derived type finalization and helping him with
reproducer codes. The main motivation for me with this GSoC project is to learn
not only how to break the compiler, but also how to fix it.

I also started a thread at the Fortran language Discourse forum to discuss
further about this topic:
https://fortran-lang.discourse.group/t/gsoc-2022-accelerating-fortran-do-concurrent-in-gcc/3269

Once again, sorry if I started on the wrong foot. I’m just trying not to cramp
too hard and move along with this project.

Thanks,
Wileam Y. Phan
GitHub: @wyphan
https://phan.codes/

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-04-21 19:26 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-21 19:26 GSoC - Accelerating Fortran DO CONCURRENT in GCC Wileam Yonatan Phan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).