Here's the same patch, but backported to the OG13 branch. There was one "difficult" conflict, but after reading around the problem I don't think that any actual code changes are required and I've updated the comment to explain (see second patch). Both patches committed to devel/omp/gcc-13. Andrew On 12/09/2023 15:27, Andrew Stubbs wrote: > Hi all, > > This patch implements parallel execution of OpenMP reverse offload kernels. > > The first problem was that GPU device kernels may request reverse > offload (via the "ancestor" clause) once for each running offload thread > -- of which there may be thousands -- and the existing implementation > ran each request serially, whilst blocking all other I/O from that > device kernel. > > The second problem was that the NVPTX plugin runs the reverse offload > kernel in the context of whichever host thread sees the request first, > regardless of which kernel originated the request. This is probably > logically harmless, but may lead to surprising timing when it blocks the > wrong kernel from exiting until the reverse offload is done. It was also > only capable of receiving and processing a single request at a time, > across all running kernels. (GCN did not have these problems.) > > Both problems are now solved by making the reverse offload requests > asynchronous. The host threads still recieve the requests in the same > way, but instead of running them inline the request is queued for > execution later in another thread. The requests are then consumed from > the message passing buffer imediately (allowing I/O to continue, in the > case of GCN). The device threads that sent requests are still blocked > waiting for the completion signal, but any other threads may continue as > usual. > > The queued requests are processed by a thread pool created on demand and > limited by a new environment variable GOMP_REVERSE_OFFLOAD_THREADS. By > this means reverse offload should become much less of a bottleneck. > > In the process of this work I have found and fixed a couple of > target-specific issues. NVPTX asynchronous streams were independent of > each other, but still synchronous w.r.t. the default NULL stream. Some > GCN devices (at least gfx908) seem to have a race condition in the > message passing system whereby the cache write-back triggered by > __ATOMIC_RELEASE occurs slower than the atomically written value. > > OK for mainline? > > Andrew