public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
@ 2022-09-21  7:45 Chung-Lin Tang
  2022-09-21  9:01 ` Jakub Jelinek
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Chung-Lin Tang @ 2022-09-21  7:45 UTC (permalink / raw)
  To: gcc-patches, Tom de Vries, Catherine Moore

[-- Attachment #1: Type: text/plain, Size: 2821 bytes --]

Hi Tom,
I had a patch submitted earlier, where I reported that the current way of implementing
barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
benchmarks:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html

That previous patch wasn't accepted well (admittedly, it was kind of a hack).
So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.

Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
barriers are implemented simplistically with bar.* synchronization instructions.
Tasks are processed after threads have joined, and only if team->task_count != 0

(arguably, there might be a little bit of performance forfeited where earlier arriving threads
could've been used to process tasks ahead of other threads. But that again falls into requiring
implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
offloading is usually used for)

Implementation highlight notes:
1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
3. gomp_barrier_wait_last() now is implemented using "bar.arrive"

4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
    The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
    the condition (team->task_count != 0), to enable the task processing down below if any thread
    created a task. (this bar.red usage required the need of the second GCC patch in this series)

This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?

(also suggest backporting to GCC12 branch, if performance regression can be considered a defect)

Thanks,
Chung-Lin

libgomp/ChangeLog:

2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>

	* config/nvptx/bar.c (generation_to_barrier): Remove.
	(futex_wait,futex_wake,do_spin,do_wait): Remove.
	(GOMP_WAIT_H): Remove.
	(#include "../linux/bar.c"): Remove.
	(gomp_barrier_wait_end): New function.
	(gomp_barrier_wait): Likewise.
	(gomp_barrier_wait_last): Likewise.
	(gomp_team_barrier_wait_end): Likewise.
	(gomp_team_barrier_wait): Likewise.
	(gomp_team_barrier_wait_final): Likewise.
	(gomp_team_barrier_wait_cancel_end): Likewise.
	(gomp_team_barrier_wait_cancel): Likewise.
	(gomp_team_barrier_cancel): Likewise.
	* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
	prototype, add new static inline function.

[-- Attachment #2: nvptx-libgomp-barrier.patch --]
[-- Type: text/plain, Size: 8681 bytes --]

diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index eee2107..0b958ed 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -30,137 +30,143 @@
 #include <limits.h>
 #include "libgomp.h"
 
-/* For cpu_relax.  */
-#include "doacross.h"
-
-/* Assuming ADDR is &bar->generation, return bar.  Copied from
-   rtems/bar.c.  */
+void
+gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      bar->awaited = bar->total;
+      __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
+			MEMMODEL_RELEASE);
+    }
+  if (bar->total > 1)
+    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+}
 
-static gomp_barrier_t *
-generation_to_barrier (int *addr)
+void
+gomp_barrier_wait (gomp_barrier_t *bar)
 {
-  char *bar
-    = (char *) addr - __builtin_offsetof (gomp_barrier_t, generation);
-  return (gomp_barrier_t *)bar;
+  gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
 }
 
-/* Implement futex_wait-like behaviour to plug into the linux/bar.c
-   implementation.  Assumes ADDR is &bar->generation.   */
+/* Like gomp_barrier_wait, except that if the encountering thread
+   is not the last one to hit the barrier, it returns immediately.
+   The intended usage is that a thread which intends to gomp_barrier_destroy
+   this barrier calls gomp_barrier_wait, while all other threads
+   call gomp_barrier_wait_last.  When gomp_barrier_wait returns,
+   the barrier can be safely destroyed.  */
 
-static inline void
-futex_wait (int *addr, int val)
+void
+gomp_barrier_wait_last (gomp_barrier_t *bar)
 {
-  gomp_barrier_t *bar = generation_to_barrier (addr);
+  /* The above described behavior matches 'bar.arrive' perfectly.  */
+  if (bar->total > 1)
+    asm ("bar.arrive 1, %0;" : : "r" (32 * bar->total));
+}
 
-  if (bar->total < 2)
-    /* A barrier with less than two threads, nop.  */
-    return;
+void
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  struct gomp_thread *thr = gomp_thread ();
+  struct gomp_team *team = thr->ts.team;
 
-  gomp_mutex_lock (&bar->lock);
+  bool run_tasks = (team->task_count != 0);
+  if (bar->total > 1)
+    run_tasks = __builtin_nvptx_bar_red_or (1, 32 * bar->total, true,
+					    (team->task_count != 0));
 
-  /* Futex semantics: only go to sleep if *addr == val.  */
-  if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_ACQUIRE) != val, 0))
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
     {
-      gomp_mutex_unlock (&bar->lock);
-      return;
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      bar->awaited = bar->total;
+      team->work_share_cancelled = 0;
     }
 
-  /* Register as waiter.  */
-  unsigned int waiters
-    = __atomic_add_fetch (&bar->waiters, 1, MEMMODEL_ACQ_REL);
-  if (waiters == 0)
-    __builtin_abort ();
-  unsigned int waiter_id = waiters;
-
-  if (waiters > 1)
+  if (__builtin_expect (run_tasks == true, 0))
     {
-      /* Wake other threads in bar.sync.  */
-      asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters));
+      while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE)
+	     & BAR_TASK_PENDING)
+	gomp_barrier_handle_tasks (state);
 
-      /* Ensure that they have updated waiters.  */
-      asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters));
+      if (bar->total > 1)
+	asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total));
     }
+}
 
-  gomp_mutex_unlock (&bar->lock);
-
-  while (1)
-    {
-      /* Wait for next thread in barrier.  */
-      asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
-
-      /* Get updated waiters.  */
-      unsigned int updated_waiters
-	= __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE);
-
-      /* Notify that we have updated waiters.  */
-      asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
-
-      waiters = updated_waiters;
-
-      if (waiter_id > waiters)
-	/* A wake happened, and we're in the group of woken threads.  */
-	break;
-
-      /* Continue waiting.  */
-    }
+void
+gomp_team_barrier_wait (gomp_barrier_t *bar)
+{
+  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
 }
 
-/* Implement futex_wake-like behaviour to plug into the linux/bar.c
-   implementation.  Assumes ADDR is &bar->generation.  */
+void
+gomp_team_barrier_wait_final (gomp_barrier_t *bar)
+{
+  gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    bar->awaited_final = bar->total;
+  gomp_team_barrier_wait_end (bar, state);
+}
 
-static inline void
-futex_wake (int *addr, int count)
+bool
+gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
+				   gomp_barrier_state_t state)
 {
-  gomp_barrier_t *bar = generation_to_barrier (addr);
+  struct gomp_thread *thr = gomp_thread ();
+  struct gomp_team *team = thr->ts.team;
 
-  if (bar->total < 2)
-    /* A barrier with less than two threads, nop.  */
-    return;
+  bool run_tasks = (team->task_count != 0);
+  if (bar->total > 1)
+    run_tasks = __builtin_nvptx_bar_red_or (1, 32 * bar->total, true,
+					    (team->task_count != 0));
+  if (state & BAR_CANCELLED)
+    return true;
 
-  gomp_mutex_lock (&bar->lock);
-  unsigned int waiters = __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE);
-  if (waiters == 0)
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
     {
-      /* No threads to wake.  */
-      gomp_mutex_unlock (&bar->lock);
-      return;
+      /* Note: BAR_CANCELLED should never be set in state here, because
+	 cancellation means that at least one of the threads has been
+	 cancelled, thus on a cancellable barrier we should never see
+	 all threads to arrive.  */
+
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      bar->awaited = bar->total;
+      team->work_share_cancelled = 0;
     }
 
-  if (count == INT_MAX)
-    /* Release all threads.  */
-    __atomic_store_n (&bar->waiters, 0, MEMMODEL_RELEASE);
-  else if (count < bar->total)
-    /* Release count threads.  */
-    __atomic_add_fetch (&bar->waiters, -count, MEMMODEL_ACQ_REL);
-  else
-    /* Count has an illegal value.  */
-    __builtin_abort ();
-
-  /* Wake other threads in bar.sync.  */
-  asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
+  if (__builtin_expect (run_tasks == true, 0))
+    {
+      while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE)
+	     & BAR_TASK_PENDING)
+	gomp_barrier_handle_tasks (state);
 
-  /* Let them get the updated waiters.  */
-  asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
+      if (bar->total > 1)
+	asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+    }
 
-  gomp_mutex_unlock (&bar->lock);
+  return false;
 }
 
-/* Copied from linux/wait.h.  */
-
-static inline int do_spin (int *addr, int val)
+bool
+gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
 {
-  /* The current implementation doesn't spin.  */
-  return 1;
+  return gomp_team_barrier_wait_cancel_end (bar, gomp_barrier_wait_start (bar));
 }
 
-/* Copied from linux/wait.h.  */
-
-static inline void do_wait (int *addr, int val)
+void
+gomp_team_barrier_cancel (struct gomp_team *team)
 {
-  if (do_spin (addr, val))
-    futex_wait (addr, val);
-}
+  gomp_mutex_lock (&team->task_lock);
+  if (team->barrier.generation & BAR_CANCELLED)
+    {
+      gomp_mutex_unlock (&team->task_lock);
+      return;
+    }
+  team->barrier.generation |= BAR_CANCELLED;
+  gomp_mutex_unlock (&team->task_lock);
 
-/* Reuse the linux implementation.  */
-#define GOMP_WAIT_H 1
-#include "../linux/bar.c"
+  /* The 'exit' instruction cancels this thread and also fullfills any other
+     CTA threads waiting on barriers.  */
+  asm volatile ("exit;");
+}
diff --git a/libgomp/config/nvptx/bar.h b/libgomp/config/nvptx/bar.h
index 28bf7f4..ddda33e 100644
--- a/libgomp/config/nvptx/bar.h
+++ b/libgomp/config/nvptx/bar.h
@@ -83,10 +83,16 @@ extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
 extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
 extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
 					       gomp_barrier_state_t);
-extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
 struct gomp_team;
 extern void gomp_team_barrier_cancel (struct gomp_team *);
 
+static inline void
+gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
+{
+  /* We never "wake up" threads on nvptx.  Threads wait at barrier
+     instructions till barrier fullfilled.  Do nothing here.  */
+}
+
 static inline gomp_barrier_state_t
 gomp_barrier_wait_start (gomp_barrier_t *bar)
 {

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-09-21  7:45 [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx Chung-Lin Tang
@ 2022-09-21  9:01 ` Jakub Jelinek
  2022-09-21 10:02   ` Chung-Lin Tang
  2022-10-17 14:29 ` Chung-Lin Tang
  2022-12-16 14:51 ` Tom de Vries
  2 siblings, 1 reply; 11+ messages in thread
From: Jakub Jelinek @ 2022-09-21  9:01 UTC (permalink / raw)
  To: Chung-Lin Tang; +Cc: gcc-patches, Tom de Vries, Catherine Moore

On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote:
> Hi Tom,
> I had a patch submitted earlier, where I reported that the current way of implementing
> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
> benchmarks:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
> 
> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
> 
> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
> barriers are implemented simplistically with bar.* synchronization instructions.
> Tasks are processed after threads have joined, and only if team->task_count != 0
> 
> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
> could've been used to process tasks ahead of other threads. But that again falls into requiring
> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
> offloading is usually used for)

I admit I don't have a good picture if people in real-world actually use
tasking in offloading regions and how much and in what way, but the above
definitely would be a show-stopper for typical tasking workloads, where
one thread (usually from master/masked/single construct's body) creates lots
of tasks and can spend considerable amount of time in those preparations,
while other threads are expected to handle those tasks.

Do we have an idea how are other implementations handling this?
I think it should be easily observable with atomics, have
master/masked/single that creates lots of tasks and then spends a long time
doing something, have very small task bodies that just increment some atomic
counter and at the end of the master/masked/single see how many tasks were
already encountered.

Note, I don't have any smart ideas how to handle this instead and what
you posted might be ok for what people usually do on offloading targets
in OpenMP if they use tasking at all, just wanted to mention that there
could be workloads where the above is a serious problem.  If there are
say hundreds of threads doing nothing until a single thread reaches a
barrier and there are hundreds of pending tasks...
E.g. note we have that 64 pending task limit after which we start to
create undeferred tasks, so if we never start handling tasks until
one thread is done with them, that would mean the single thread
would create 64 deferred tasks and then handle all the others itself
making it even longer until the other tasks can deal with it.

	Jakub


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-09-21  9:01 ` Jakub Jelinek
@ 2022-09-21 10:02   ` Chung-Lin Tang
  0 siblings, 0 replies; 11+ messages in thread
From: Chung-Lin Tang @ 2022-09-21 10:02 UTC (permalink / raw)
  To: Jakub Jelinek, Chung-Lin Tang; +Cc: gcc-patches, Tom de Vries, Catherine Moore



On 2022/9/21 5:01 PM, Jakub Jelinek wrote:
> On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote:
>> Hi Tom,
>> I had a patch submitted earlier, where I reported that the current way of implementing
>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>
>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>
>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>> barriers are implemented simplistically with bar.* synchronization instructions.
>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>
>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>> offloading is usually used for)
> 
> I admit I don't have a good picture if people in real-world actually use
> tasking in offloading regions and how much and in what way, but the above
> definitely would be a show-stopper for typical tasking workloads, where
> one thread (usually from master/masked/single construct's body) creates lots
> of tasks and can spend considerable amount of time in those preparations,
> while other threads are expected to handle those tasks.

I think the most common use case for target offloading is "parallel for".

Really, not simply removing tasking altogether from target regions in the specification is just looking for trouble.

If asynchronous offloaded tasks are to be supported, something at the whole GPU offload region level
is much more reasonable, like the async clause functionality in OpenACC.

> Do we have an idea how are other implementations handling this?
> I think it should be easily observable with atomics, have
> master/masked/single that creates lots of tasks and then spends a long time
> doing something, have very small task bodies that just increment some atomic
> counter and at the end of the master/masked/single see how many tasks were
> already encountered.

This could be an interesting test...

> Note, I don't have any smart ideas how to handle this instead and what
> you posted might be ok for what people usually do on offloading targets
> in OpenMP if they use tasking at all, just wanted to mention that there
> could be workloads where the above is a serious problem.  If there are
> say hundreds of threads doing nothing until a single thread reaches a
> barrier and there are hundreds of pending tasks...

I think it might still be doable, just not in the very fine "wake one thread" style
that the Linux-based implementation was doing.

> E.g. note we have that 64 pending task limit after which we start to
> create undeferred tasks, so if we never start handling tasks until
> one thread is done with them, that would mean the single thread
> would create 64 deferred tasks and then handle all the others itself
> making it even longer until the other tasks can deal with it.

Okay, thanks for reminding that.

Chung-Lin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-09-21  7:45 [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx Chung-Lin Tang
  2022-09-21  9:01 ` Jakub Jelinek
@ 2022-10-17 14:29 ` Chung-Lin Tang
  2022-10-31 14:18   ` [Ping x2] " Chung-Lin Tang
  2022-12-16 14:51 ` Tom de Vries
  2 siblings, 1 reply; 11+ messages in thread
From: Chung-Lin Tang @ 2022-10-17 14:29 UTC (permalink / raw)
  To: Chung-Lin Tang, gcc-patches, Tom de Vries, Catherine Moore

Ping.

On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
> Hi Tom,
> I had a patch submitted earlier, where I reported that the current way of implementing
> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
> benchmarks:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
> 
> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
> 
> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
> barriers are implemented simplistically with bar.* synchronization instructions.
> Tasks are processed after threads have joined, and only if team->task_count != 0
> 
> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
> could've been used to process tasks ahead of other threads. But that again falls into requiring
> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
> offloading is usually used for)
> 
> Implementation highlight notes:
> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
> 
> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>     The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>     the condition (team->task_count != 0), to enable the task processing down below if any thread
>     created a task. (this bar.red usage required the need of the second GCC patch in this series)
> 
> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
> 
> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
> 
> Thanks,
> Chung-Lin
> 
> libgomp/ChangeLog:
> 
> 2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>
> 
> 	* config/nvptx/bar.c (generation_to_barrier): Remove.
> 	(futex_wait,futex_wake,do_spin,do_wait): Remove.
> 	(GOMP_WAIT_H): Remove.
> 	(#include "../linux/bar.c"): Remove.
> 	(gomp_barrier_wait_end): New function.
> 	(gomp_barrier_wait): Likewise.
> 	(gomp_barrier_wait_last): Likewise.
> 	(gomp_team_barrier_wait_end): Likewise.
> 	(gomp_team_barrier_wait): Likewise.
> 	(gomp_team_barrier_wait_final): Likewise.
> 	(gomp_team_barrier_wait_cancel_end): Likewise.
> 	(gomp_team_barrier_wait_cancel): Likewise.
> 	(gomp_team_barrier_cancel): Likewise.
> 	* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
> 	prototype, add new static inline function.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Ping x2] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-10-17 14:29 ` Chung-Lin Tang
@ 2022-10-31 14:18   ` Chung-Lin Tang
  2022-11-07 16:34     ` [Ping x3] " Chung-Lin Tang
  0 siblings, 1 reply; 11+ messages in thread
From: Chung-Lin Tang @ 2022-10-31 14:18 UTC (permalink / raw)
  To: Chung-Lin Tang, gcc-patches, Tom de Vries, Catherine Moore

Ping x2.

On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
> Ping.
> 
> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>> Hi Tom,
>> I had a patch submitted earlier, where I reported that the current way of implementing
>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>
>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>
>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>> barriers are implemented simplistically with bar.* synchronization instructions.
>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>
>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>> offloading is usually used for)
>>
>> Implementation highlight notes:
>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>
>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>     The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>     the condition (team->task_count != 0), to enable the task processing down below if any thread
>>     created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>
>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>
>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>
>> Thanks,
>> Chung-Lin
>>
>> libgomp/ChangeLog:
>>
>> 2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>
>>
>> 	* config/nvptx/bar.c (generation_to_barrier): Remove.
>> 	(futex_wait,futex_wake,do_spin,do_wait): Remove.
>> 	(GOMP_WAIT_H): Remove.
>> 	(#include "../linux/bar.c"): Remove.
>> 	(gomp_barrier_wait_end): New function.
>> 	(gomp_barrier_wait): Likewise.
>> 	(gomp_barrier_wait_last): Likewise.
>> 	(gomp_team_barrier_wait_end): Likewise.
>> 	(gomp_team_barrier_wait): Likewise.
>> 	(gomp_team_barrier_wait_final): Likewise.
>> 	(gomp_team_barrier_wait_cancel_end): Likewise.
>> 	(gomp_team_barrier_wait_cancel): Likewise.
>> 	(gomp_team_barrier_cancel): Likewise.
>> 	* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>> 	prototype, add new static inline function.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Ping x3] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-10-31 14:18   ` [Ping x2] " Chung-Lin Tang
@ 2022-11-07 16:34     ` Chung-Lin Tang
  2022-11-21 16:24       ` [Ping x4] " Chung-Lin Tang
  0 siblings, 1 reply; 11+ messages in thread
From: Chung-Lin Tang @ 2022-11-07 16:34 UTC (permalink / raw)
  To: Chung-Lin Tang, Chung-Lin Tang, gcc-patches, Tom de Vries,
	Catherine Moore

Ping x3.

On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
> Ping x2.
> 
> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>> Ping.
>>
>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>> Hi Tom,
>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>> benchmarks:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>>
>>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>
>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>
>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>> offloading is usually used for)
>>>
>>> Implementation highlight notes:
>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>
>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>     The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>>     the condition (team->task_count != 0), to enable the task processing down below if any thread
>>>     created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>
>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>
>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>
>>> Thanks,
>>> Chung-Lin
>>>
>>> libgomp/ChangeLog:
>>>
>>> 2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>
>>>
>>> 	* config/nvptx/bar.c (generation_to_barrier): Remove.
>>> 	(futex_wait,futex_wake,do_spin,do_wait): Remove.
>>> 	(GOMP_WAIT_H): Remove.
>>> 	(#include "../linux/bar.c"): Remove.
>>> 	(gomp_barrier_wait_end): New function.
>>> 	(gomp_barrier_wait): Likewise.
>>> 	(gomp_barrier_wait_last): Likewise.
>>> 	(gomp_team_barrier_wait_end): Likewise.
>>> 	(gomp_team_barrier_wait): Likewise.
>>> 	(gomp_team_barrier_wait_final): Likewise.
>>> 	(gomp_team_barrier_wait_cancel_end): Likewise.
>>> 	(gomp_team_barrier_wait_cancel): Likewise.
>>> 	(gomp_team_barrier_cancel): Likewise.
>>> 	* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>> 	prototype, add new static inline function.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Ping x4] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-11-07 16:34     ` [Ping x3] " Chung-Lin Tang
@ 2022-11-21 16:24       ` Chung-Lin Tang
  2022-12-05 16:21         ` [Ping x5] " Chung-Lin Tang
  0 siblings, 1 reply; 11+ messages in thread
From: Chung-Lin Tang @ 2022-11-21 16:24 UTC (permalink / raw)
  To: Chung-Lin Tang, Chung-Lin Tang, gcc-patches, Tom de Vries,
	Catherine Moore

Ping x4

On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
> Ping x3.
> 
> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>> Ping x2.
>>
>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>> Ping.
>>>
>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>> Hi Tom,
>>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>>> benchmarks:
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>>
>>>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>>
>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>>
>>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>>> offloading is usually used for)
>>>>
>>>> Implementation highlight notes:
>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>
>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>>     The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>>>     the condition (team->task_count != 0), to enable the task processing down below if any thread
>>>>     created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>>
>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>
>>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>>
>>>> Thanks,
>>>> Chung-Lin
>>>>
>>>> libgomp/ChangeLog:
>>>>
>>>> 2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>
>>>>
>>>> 	* config/nvptx/bar.c (generation_to_barrier): Remove.
>>>> 	(futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>> 	(GOMP_WAIT_H): Remove.
>>>> 	(#include "../linux/bar.c"): Remove.
>>>> 	(gomp_barrier_wait_end): New function.
>>>> 	(gomp_barrier_wait): Likewise.
>>>> 	(gomp_barrier_wait_last): Likewise.
>>>> 	(gomp_team_barrier_wait_end): Likewise.
>>>> 	(gomp_team_barrier_wait): Likewise.
>>>> 	(gomp_team_barrier_wait_final): Likewise.
>>>> 	(gomp_team_barrier_wait_cancel_end): Likewise.
>>>> 	(gomp_team_barrier_wait_cancel): Likewise.
>>>> 	(gomp_team_barrier_cancel): Likewise.
>>>> 	* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>> 	prototype, add new static inline function.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Ping x5] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-11-21 16:24       ` [Ping x4] " Chung-Lin Tang
@ 2022-12-05 16:21         ` Chung-Lin Tang
  2022-12-12 11:13           ` [Ping x6] " Chung-Lin Tang
  0 siblings, 1 reply; 11+ messages in thread
From: Chung-Lin Tang @ 2022-12-05 16:21 UTC (permalink / raw)
  To: Chung-Lin Tang, Chung-Lin Tang, gcc-patches, Tom de Vries,
	Catherine Moore

Ping x5

On 2022/11/22 12:24 上午, Chung-Lin Tang wrote:
> Ping x4
> 
> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
>> Ping x3.
>>
>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>>> Ping x2.
>>>
>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>>> Ping.
>>>>
>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>>> Hi Tom,
>>>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>>>> benchmarks:
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>>>
>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>>>
>>>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>>>> offloading is usually used for)
>>>>>
>>>>> Implementation highlight notes:
>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>>
>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>>>     The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>>>>     the condition (team->task_count != 0), to enable the task processing down below if any thread
>>>>>     created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>>>
>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>>
>>>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>>>
>>>>> Thanks,
>>>>> Chung-Lin
>>>>>
>>>>> libgomp/ChangeLog:
>>>>>
>>>>> 2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>
>>>>>
>>>>> 	* config/nvptx/bar.c (generation_to_barrier): Remove.
>>>>> 	(futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>>> 	(GOMP_WAIT_H): Remove.
>>>>> 	(#include "../linux/bar.c"): Remove.
>>>>> 	(gomp_barrier_wait_end): New function.
>>>>> 	(gomp_barrier_wait): Likewise.
>>>>> 	(gomp_barrier_wait_last): Likewise.
>>>>> 	(gomp_team_barrier_wait_end): Likewise.
>>>>> 	(gomp_team_barrier_wait): Likewise.
>>>>> 	(gomp_team_barrier_wait_final): Likewise.
>>>>> 	(gomp_team_barrier_wait_cancel_end): Likewise.
>>>>> 	(gomp_team_barrier_wait_cancel): Likewise.
>>>>> 	(gomp_team_barrier_cancel): Likewise.
>>>>> 	* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>>> 	prototype, add new static inline function.
>>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Ping x6] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-12-05 16:21         ` [Ping x5] " Chung-Lin Tang
@ 2022-12-12 11:13           ` Chung-Lin Tang
  0 siblings, 0 replies; 11+ messages in thread
From: Chung-Lin Tang @ 2022-12-12 11:13 UTC (permalink / raw)
  To: Chung-Lin Tang, Chung-Lin Tang, gcc-patches, Tom de Vries,
	Catherine Moore

Ping x6

On 2022/12/6 12:21 AM, Chung-Lin Tang wrote:
> Ping x5
> 
> On 2022/11/22 12:24 上午, Chung-Lin Tang wrote:
>> Ping x4
>>
>> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
>>> Ping x3.
>>>
>>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>>>> Ping x2.
>>>>
>>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>>>> Ping.
>>>>>
>>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>>>> Hi Tom,
>>>>>> I had a patch submitted earlier, where I reported that the current way of implementing
>>>>>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>>>>>> benchmarks:
>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>>>>
>>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>>>>>> barriers are implemented simplistically with bar.* synchronization instructions.
>>>>>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>>>>>
>>>>>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>>>>>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>>>>>> offloading is usually used for)
>>>>>>
>>>>>> Implementation highlight notes:
>>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
>>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>>>
>>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>>>>     The main synchronization is done using a 'bar.red' instruction. This reduces across all threads
>>>>>>     the condition (team->task_count != 0), to enable the task processing down below if any thread
>>>>>>     created a task. (this bar.red usage required the need of the second GCC patch in this series)
>>>>>>
>>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests,
>>>>>> and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t
>>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
>>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>>>
>>>>>> (also suggest backporting to GCC12 branch, if performance regression can be considered a defect)
>>>>>>
>>>>>> Thanks,
>>>>>> Chung-Lin
>>>>>>
>>>>>> libgomp/ChangeLog:
>>>>>>
>>>>>> 2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>
>>>>>>
>>>>>> 	* config/nvptx/bar.c (generation_to_barrier): Remove.
>>>>>> 	(futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>>>> 	(GOMP_WAIT_H): Remove.
>>>>>> 	(#include "../linux/bar.c"): Remove.
>>>>>> 	(gomp_barrier_wait_end): New function.
>>>>>> 	(gomp_barrier_wait): Likewise.
>>>>>> 	(gomp_barrier_wait_last): Likewise.
>>>>>> 	(gomp_team_barrier_wait_end): Likewise.
>>>>>> 	(gomp_team_barrier_wait): Likewise.
>>>>>> 	(gomp_team_barrier_wait_final): Likewise.
>>>>>> 	(gomp_team_barrier_wait_cancel_end): Likewise.
>>>>>> 	(gomp_team_barrier_wait_cancel): Likewise.
>>>>>> 	(gomp_team_barrier_cancel): Likewise.
>>>>>> 	* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>>>> 	prototype, add new static inline function.
>>>
>>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-09-21  7:45 [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx Chung-Lin Tang
  2022-09-21  9:01 ` Jakub Jelinek
  2022-10-17 14:29 ` Chung-Lin Tang
@ 2022-12-16 14:51 ` Tom de Vries
  2022-12-19 12:13   ` Thomas Schwinge
  2 siblings, 1 reply; 11+ messages in thread
From: Tom de Vries @ 2022-12-16 14:51 UTC (permalink / raw)
  To: Chung-Lin Tang, gcc-patches, Catherine Moore

On 9/21/22 09:45, Chung-Lin Tang wrote:
> Hi Tom,
> I had a patch submitted earlier, where I reported that the current way 
> of implementing
> barriers in libgomp on nvptx created a quite significant performance 
> drop on some SPEChpc2021
> benchmarks:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
> 
> That previous patch wasn't accepted well (admittedly, it was kind of a 
> hack).
> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
> 

Ack.

> Basically, instead of trying to have the GPU do CPU-with-OS-like things 
> that it isn't suited for,
> barriers are implemented simplistically with bar.* synchronization 
> instructions.
> Tasks are processed after threads have joined, and only if 
> team->task_count != 0
> 
> (arguably, there might be a little bit of performance forfeited where 
> earlier arriving threads
> could've been used to process tasks ahead of other threads. But that 
> again falls into requiring
> implementing complex futex-wait/wake like behavior. Really, that kind of 
> tasking is not what target
> offloading is usually used for)
> 

Please try to add this insight somewhere as a comment in the code, f.i. 
in the header comment of bar.h.

> Implementation highlight notes:
> 1. gomp_team_barrier_wake() is now an empty function (threads never 
> "wake" in the usual manner)
> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
> 
> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>     The main synchronization is done using a 'bar.red' instruction. This 
> reduces across all threads
>     the condition (team->task_count != 0), to enable the task processing 
> down below if any thread
>     created a task. (this bar.red usage required the need of the second 
> GCC patch in this series)
> 
> This patch has been tested on x86_64/powerpc64le with nvptx offloading, 
> using libgomp, ovo, omptests,
> and sollve_vv testsuites, all without regressions. Also verified that 
> the SPEChpc 2021 521.miniswp_t
> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 
> cycle has been restored to
> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
> 

AFAIU the waiters and lock fields are longer used, so they can be removed.

Yes, LGTM, please apply (after the other one).

Thanks for addressing this.

FWIW, tested on NVIDIA RTX A2000 with driver 525.60.11.

> (also suggest backporting to GCC12 branch, if performance regression can 
> be considered a defect)
> 

That's ok, but wait a while after applying on trunk before doing that, 
say a month.

Thanks,
- Tom

> Thanks,
> Chung-Lin
> 
> libgomp/ChangeLog:
> 
> 2022-09-21  Chung-Lin Tang  <cltang@codesourcery.com>
> 
>      * config/nvptx/bar.c (generation_to_barrier): Remove.
>      (futex_wait,futex_wake,do_spin,do_wait): Remove.
>      (GOMP_WAIT_H): Remove.
>      (#include "../linux/bar.c"): Remove.
>      (gomp_barrier_wait_end): New function.
>      (gomp_barrier_wait): Likewise.
>      (gomp_barrier_wait_last): Likewise.
>      (gomp_team_barrier_wait_end): Likewise.
>      (gomp_team_barrier_wait): Likewise.
>      (gomp_team_barrier_wait_final): Likewise.
>      (gomp_team_barrier_wait_cancel_end): Likewise.
>      (gomp_team_barrier_wait_cancel): Likewise.
>      (gomp_team_barrier_cancel): Likewise.
>      * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>      prototype, add new static inline function.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
  2022-12-16 14:51 ` Tom de Vries
@ 2022-12-19 12:13   ` Thomas Schwinge
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Schwinge @ 2022-12-19 12:13 UTC (permalink / raw)
  To: Tom de Vries, Chung-Lin Tang, Chung-Lin Tang; +Cc: gcc-patches, Catherine Moore

Hi!

On 2022-12-16T15:51:35+0100, Tom de Vries via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> On 9/21/22 09:45, Chung-Lin Tang wrote:
>> I had a patch submitted earlier, where I reported that the current way
>> of implementing
>> barriers in libgomp on nvptx created a quite significant performance
>> drop on some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html

Also: my 2022-03-17 report in <https://gcc.gnu.org/PR99555>
"[OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs":
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99555#c13>.

>> This patch has been tested on x86_64/powerpc64le with nvptx offloading,
>> using libgomp, ovo, omptests,
>> and sollve_vv testsuites, all without regressions. Also verified that
>> the SPEChpc 2021 521.miniswp_t
>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12
>> cycle has been restored to
>> devel/omp/gcc-11 (OG11) branch levels.

I'm happy to confirm that this also does resolve the PR99555 issue
mentioned above, so please do reference PR99555 in the commit log.

>> Is this okay for trunk?

> Yes, LGTM, please apply (after the other one).
>
> Thanks for addressing this.

Thanks, Chung-Lin and Tom!

>> (also suggest backporting to GCC12 branch, if performance regression can
>> be considered a defect)
>
> That's ok, but wait a while after applying on trunk before doing that,
> say a month.


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-12-19 12:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-21  7:45 [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx Chung-Lin Tang
2022-09-21  9:01 ` Jakub Jelinek
2022-09-21 10:02   ` Chung-Lin Tang
2022-10-17 14:29 ` Chung-Lin Tang
2022-10-31 14:18   ` [Ping x2] " Chung-Lin Tang
2022-11-07 16:34     ` [Ping x3] " Chung-Lin Tang
2022-11-21 16:24       ` [Ping x4] " Chung-Lin Tang
2022-12-05 16:21         ` [Ping x5] " Chung-Lin Tang
2022-12-12 11:13           ` [Ping x6] " Chung-Lin Tang
2022-12-16 14:51 ` Tom de Vries
2022-12-19 12:13   ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).