PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
@ 2013-01-22 11:03 Alan Modra
  2013-01-22 11:19 ` Jakub Jelinek
  2013-01-22 13:53 ` Torvald Riegel
  0 siblings, 2 replies; 8+ messages in thread
From: Alan Modra @ 2013-01-22 11:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek

SPEComp2012 376.kdtree shows a huge regression after my PR51376 fix,
with the benchmark, which finishes normally in a few minutes, failing
to complete after hours of running on a power7 machine.  Using a
reduced data set showed typically over a 200x slowdown.  So, why is
this happening?  Well, it appears that the benchmark hits exactly the
libgomp code paths that previously accessed task->children without
locking and now we need to obtain task_lock.  The slowdown is due to
massive task_lock contention.

I tried the solution in
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00235.html and found it
gave a small improvement, but there was still far too much
contention due to hitting task_lock in GOMP_task.  The question then
was whether locking was really necessary there, and on analysing it
properly I believe I was far too conservative in forcing the
"children" access to be inside a task_lock held region.  That
particular task.children will only be set if the current thread
creates new threads in its task work function!  ie. Some other thread
won't set it, so there's no need to worry that the current thread
might see a stale NULL value and miss calling gomp_clear_parent.
(It's true that the current thread might see a stale non-NULL value,
but that doesn't matter.  Gaining the lock will ensure
gomp_clear_parent sees the real value of task.children.)

With this patch, PR51376 stays fixed and we're back to a reasonable
time for kdtree.  I'm seeing a 20% slowdown in my quick and dirty
testing, but some of that will be due to different optimisation and
tuning in the libgomp builds.

I did consider (and test) another small refinement.  The release
barrier in gomp_sem_post is sufficient to ensure correct memory
ordering, so you can write:

		      if (parent->in_taskwait)
			{
			  gomp_sem_post (&parent->taskwait_sem);
			  parent->children = NULL;
			}
		      else
			__atomic_store_n (&parent->children, NULL,
					  MEMMODEL_RELEASE);

However, this reorders posting the semaphore and writing
parent->children.  I think doing so is OK but am wary of trying to be
too clever where multiple threads are involved..

Bootstrapped and regression tested powerpc64-linux.  OK to apply?

	PR libgomp/51376
	PR libgomp/56073
	* task.c (GOMP_task): Revert 2011-12-09 change.
	(GOMP_taskwait): Likewise.  Instead use atomic load with acquire
	barrier to read task->children..
	(gomp_barrier_handle_tasks): ..and matching atomic store with
	release barrier here when setting parent->children to NULL.

Index: libgomp/task.c
===================================================================
--- libgomp/task.c	(revision 195354)
+++ libgomp/task.c	(working copy)
@@ -116,11 +116,10 @@ GOMP_task (void (*fn) (void *), void *data, void (
 	}
       else
 	fn (data);
-      if (team != NULL)
+      if (task.children != NULL)
 	{
 	  gomp_mutex_lock (&team->task_lock);
-	  if (task.children != NULL)
-	    gomp_clear_parent (task.children);
+	  gomp_clear_parent (task.children);
 	  gomp_mutex_unlock (&team->task_lock);
 	}
       gomp_end_task ();
@@ -258,7 +257,13 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t st
 		    parent->children = child_task->next_child;
 		  else
 		    {
-		      parent->children = NULL;
+		      /* We access task->children in GOMP_taskwait
+			 outside of the task lock mutex region, so
+			 need a release barrier here to ensure memory
+			 written by child_task->fn above is flushed
+			 before the NULL is written.  */
+		      __atomic_store_n (&parent->children, NULL,
+					MEMMODEL_RELEASE);
 		      if (parent->in_taskwait)
 			gomp_sem_post (&parent->taskwait_sem);
 		    }
@@ -291,7 +296,8 @@ GOMP_taskwait (void)
   struct gomp_task *child_task = NULL;
   struct gomp_task *to_free = NULL;

-  if (task == NULL || team == NULL)
+  if (task == NULL
+      || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)
     return;

   gomp_mutex_lock (&team->task_lock);

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
  2013-01-22 11:03 PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix Alan Modra
@ 2013-01-22 11:19 ` Jakub Jelinek
  2013-01-22 11:52   ` Alan Modra
  2013-01-22 13:53 ` Torvald Riegel
  1 sibling, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2013-01-22 11:19 UTC (permalink / raw)
  To: gcc-patches; +Cc: Torvald Riegel, Richard Henderson

On Tue, Jan 22, 2013 at 09:33:25PM +1030, Alan Modra wrote:
> Bootstrapped and regression tested powerpc64-linux.  OK to apply?
> 
> 	PR libgomp/51376
> 	PR libgomp/56073
> 	* task.c (GOMP_task): Revert 2011-12-09 change.
> 	(GOMP_taskwait): Likewise.  Instead use atomic load with acquire
> 	barrier to read task->children..
> 	(gomp_barrier_handle_tasks): ..and matching atomic store with
> 	release barrier here when setting parent->children to NULL.

Looks good to me.

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
  2013-01-22 11:19 ` Jakub Jelinek
@ 2013-01-22 11:52   ` Alan Modra
  2013-01-22 12:01     ` Jakub Jelinek
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Modra @ 2013-01-22 11:52 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Torvald Riegel, Richard Henderson

On Tue, Jan 22, 2013 at 12:19:21PM +0100, Jakub Jelinek wrote:
> Looks good to me.

Thanks for the amazingly quick review!  Committed revision 195370.
Is the patch OK for 4.7 too?

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
  2013-01-22 11:52   ` Alan Modra
@ 2013-01-22 12:01     ` Jakub Jelinek
  2013-01-22 14:27       ` Jakub Jelinek
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2013-01-22 12:01 UTC (permalink / raw)
  To: gcc-patches, Torvald Riegel, Richard Henderson

On Tue, Jan 22, 2013 at 10:22:00PM +1030, Alan Modra wrote:
> On Tue, Jan 22, 2013 at 12:19:21PM +0100, Jakub Jelinek wrote:
> > Looks good to me.
> 
> Thanks for the amazingly quick review!  Committed revision 195370.

Actually, there is one thing I'm worried about, -lgomp doesn't link against
-latomic, and for !HAVE_SYNC_BUILTINS targets supposedly __atomic_load_n
resp. __atomic_store_n might not be supported.  Not sure what targets
are still !HAVE_SYNC_BUILTIN targets, but if there are any that support
libgomp, either we should use normal loads/stores for those (on the
assumption that targets without sync builtins supposedly don't have very
relaxed consistency model), or would need to take the lock always for
!HAVE_SYNC_BUILTINS and use normal loads/stores.

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
  2013-01-22 11:03 PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix Alan Modra
  2013-01-22 11:19 ` Jakub Jelinek
@ 2013-01-22 13:53 ` Torvald Riegel
  2013-01-23  1:12   ` Alan Modra
  1 sibling, 1 reply; 8+ messages in thread
From: Torvald Riegel @ 2013-01-22 13:53 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches, Jakub Jelinek, Richard Henderson

On Tue, 2013-01-22 at 21:33 +1030, Alan Modra wrote:
> SPEComp2012 376.kdtree shows a huge regression after my PR51376 fix,
> with the benchmark, which finishes normally in a few minutes, failing
> to complete after hours of running on a power7 machine.  Using a
> reduced data set showed typically over a 200x slowdown.  So, why is
> this happening?  Well, it appears that the benchmark hits exactly the
> libgomp code paths that previously accessed task->children without
> locking and now we need to obtain task_lock.  The slowdown is due to
> massive task_lock contention.
> 
> I tried the solution in
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00235.html and found it
> gave a small improvement, but there was still far too much
> contention due to hitting task_lock in GOMP_task.  The question then
> was whether locking was really necessary there, and on analysing it
> properly I believe I was far too conservative in forcing the
> "children" access to be inside a task_lock held region.  That
> particular task.children will only be set if the current thread
> creates new threads in its task work function!  ie. Some other thread
> won't set it, so there's no need to worry that the current thread
> might see a stale NULL value and miss calling gomp_clear_parent.
> (It's true that the current thread might see a stale non-NULL value,
> but that doesn't matter.  Gaining the lock will ensure
> gomp_clear_parent sees the real value of task.children.)
> 
> With this patch, PR51376 stays fixed and we're back to a reasonable
> time for kdtree.  I'm seeing a 20% slowdown in my quick and dirty
> testing, but some of that will be due to different optimisation and
> tuning in the libgomp builds.
> 
> I did consider (and test) another small refinement.  The release
> barrier in gomp_sem_post is sufficient to ensure correct memory
> ordering, so you can write:
> 
> 		      if (parent->in_taskwait)
> 			{
> 			  gomp_sem_post (&parent->taskwait_sem);
> 			  parent->children = NULL;
> 			}
> 		      else
> 			__atomic_store_n (&parent->children, NULL,
> 					  MEMMODEL_RELEASE);
> 
> However, this reorders posting the semaphore and writing
> parent->children.  I think doing so is OK but am wary of trying to be
> too clever where multiple threads are involved..
> 
> Bootstrapped and regression tested powerpc64-linux.  OK to apply?

Could you please document the synchronization bits in more detail?  For
example, if you have release stores, document with which acquire loads
they synchronize with, and vice versa.  This makes it much easier for
others to read and review the code, and check the (intent of) the
synchronization bits.  Also, it is often useful to document which cases
cannot happen (and thus would allow for using weaker memory orders, for
example); this absence of problems is usually difficult to prove, so
it's easier for others if you document this if you've already checked
it.  Thanks.

More comments below.

> 	PR libgomp/51376
> 	PR libgomp/56073
> 	* task.c (GOMP_task): Revert 2011-12-09 change.
> 	(GOMP_taskwait): Likewise.  Instead use atomic load with acquire
> 	barrier to read task->children..
> 	(gomp_barrier_handle_tasks): ..and matching atomic store with
> 	release barrier here when setting parent->children to NULL.
> 
> Index: libgomp/task.c
> ===================================================================
> --- libgomp/task.c	(revision 195354)
> +++ libgomp/task.c	(working copy)
> @@ -116,11 +116,10 @@ GOMP_task (void (*fn) (void *), void *data, void (
>  	}
>        else
>  	fn (data);
> -      if (team != NULL)
> +      if (task.children != NULL)

Why does this not need an acquire barrier?

>  	{
>  	  gomp_mutex_lock (&team->task_lock);
> -	  if (task.children != NULL)

Are you sure that you can remove this check to task.children here,
especially if you don't use an acquire barrier previously?

Can there be several threads trying to call gomp_clear_parent?

> -	    gomp_clear_parent (task.children);
> +	  gomp_clear_parent (task.children);
>  	  gomp_mutex_unlock (&team->task_lock);

Please add comments or reference the comments in the other pieces of
synchronizing code related to this.

>  	}
>        gomp_end_task ();
> @@ -258,7 +257,13 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t st
>  		    parent->children = child_task->next_child;
>  		  else
>  		    {
> -		      parent->children = NULL;
> +		      /* We access task->children in GOMP_taskwait
> +			 outside of the task lock mutex region, so
> +			 need a release barrier here to ensure memory
> +			 written by child_task->fn above is flushed
> +			 before the NULL is written.  */
> +		      __atomic_store_n (&parent->children, NULL,
> +					MEMMODEL_RELEASE);
>  		      if (parent->in_taskwait)
>  			gomp_sem_post (&parent->taskwait_sem);
>  		    }
> @@ -291,7 +296,8 @@ GOMP_taskwait (void)
>    struct gomp_task *child_task = NULL;
>    struct gomp_task *to_free = NULL;
>  
> -  if (task == NULL || team == NULL)
> +  if (task == NULL
> +      || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)

Please mention with which stores this acquire load synchronizes.

>      return;
>  
>    gomp_mutex_lock (&team->task_lock);
> 

Torvald

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
  2013-01-22 12:01     ` Jakub Jelinek
@ 2013-01-22 14:27       ` Jakub Jelinek
  0 siblings, 0 replies; 8+ messages in thread
From: Jakub Jelinek @ 2013-01-22 14:27 UTC (permalink / raw)
  To: gcc-patches, Torvald Riegel, Richard Henderson

On Tue, Jan 22, 2013 at 01:01:24PM +0100, Jakub Jelinek wrote:
> On Tue, Jan 22, 2013 at 10:22:00PM +1030, Alan Modra wrote:
> > On Tue, Jan 22, 2013 at 12:19:21PM +0100, Jakub Jelinek wrote:
> > > Looks good to me.
> > 
> > Thanks for the amazingly quick review!  Committed revision 195370.
> 
> Actually, there is one thing I'm worried about, -lgomp doesn't link against
> -latomic, and for !HAVE_SYNC_BUILTINS targets supposedly __atomic_load_n
> resp. __atomic_store_n might not be supported.  Not sure what targets
> are still !HAVE_SYNC_BUILTIN targets, but if there are any that support
> libgomp, either we should use normal loads/stores for those (on the
> assumption that targets without sync builtins supposedly don't have very
> relaxed consistency model), or would need to take the lock always for
> !HAVE_SYNC_BUILTINS and use normal loads/stores.

Seems for loads/stores <= wordsize we just assume they are atomic and expand
it as normal load or store (with optional barriers if target has any).
So supposedly it can work as is.

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
  2013-01-22 13:53 ` Torvald Riegel
@ 2013-01-23  1:12   ` Alan Modra
  2013-01-23 10:39     ` Torvald Riegel
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Modra @ 2013-01-23  1:12 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: gcc-patches, Jakub Jelinek, Richard Henderson

On Tue, Jan 22, 2013 at 02:20:23PM +0100, Torvald Riegel wrote:
> > @@ -116,11 +116,10 @@ GOMP_task (void (*fn) (void *), void *data, void (
> >  	}
> >        else
> >  	fn (data);
> > -      if (team != NULL)
> > +      if (task.children != NULL)
> 
> Why does this not need an acquire barrier?

I explained that in my patch submission.  Briefly, the only way this
particular task.children can be set is if this thread's task work
function creates children.  So since the setter is *this* thread, we
need no barriers.  We can have task.children set by the current
thread then changed by a child thread, but seeing a stale non-NULL
value is not a problem here.  Once past the task_lock acquisition,
this thread will see the real value of task.children.

> >  	{
> >  	  gomp_mutex_lock (&team->task_lock);
> > -	  if (task.children != NULL)
> 
> Are you sure that you can remove this check to task.children here,
> especially if you don't use an acquire barrier previously?
> 
> Can there be several threads trying to call gomp_clear_parent?

Yes and yes.  gomp_clear_parent handles a NULL arg, and as you can
see, runs inside a task_lock mutex region here.

> > -	    gomp_clear_parent (task.children);
> > +	  gomp_clear_parent (task.children);
> >  	  gomp_mutex_unlock (&team->task_lock);
> 
> Please add comments or reference the comments in the other pieces of
> synchronizing code related to this.

Please note that all of the above is a reversion!  Adding a comment
has the risk that the comment is *wrong*.  That said, I appreciate the
need to document threaded code well..

> > -  if (task == NULL || team == NULL)
> > +  if (task == NULL
> > +      || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)
> 
> Please mention with which stores this acquire load synchronizes.

How does this look?

	* task.c (GOMP_task, GOMP_taskwait): Comment.

Index: libgomp/task.c
===================================================================
--- libgomp/task.c	(revision 195370)
+++ libgomp/task.c	(working copy)
@@ -116,6 +116,15 @@
 	}
       else
 	fn (data);
+      /* Access to "children" is normally done inside a task_lock
+	 mutex region, but the only way this particular task.children
+	 can be set is if this thread's task work function (fn)
+	 creates children.  So since the setter is *this* thread, we
+	 need no barriers here when testing for non-NULL.  We can have
+	 task.children set by the current thread then changed by a
+	 child thread, but seeing a stale non-NULL value is not a
+	 problem.  Once past the task_lock acquisition, this thread
+	 will see the real value of task.children.  */
       if (task.children != NULL)
 	{
 	  gomp_mutex_lock (&team->task_lock);
@@ -296,6 +305,12 @@
   struct gomp_task *child_task = NULL;
   struct gomp_task *to_free = NULL;
 
+  /* The acquire barrier on load of task->children here synchronizes
+     with the write of a NULL in gomp_barrier_handle_tasks.  It is
+     not necessary that we synchronize with other non-NULL writes at
+     this point, but we must ensure that all writes to memory by a
+     child thread task work function are seen before we exit from
+     GOMP_taskwait.  */
   if (task == NULL
       || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)
     return;

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
  2013-01-23  1:12   ` Alan Modra
@ 2013-01-23 10:39     ` Torvald Riegel
  0 siblings, 0 replies; 8+ messages in thread
From: Torvald Riegel @ 2013-01-23 10:39 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches, Jakub Jelinek, Richard Henderson

On Wed, 2013-01-23 at 11:41 +1030, Alan Modra wrote:
> On Tue, Jan 22, 2013 at 02:20:23PM +0100, Torvald Riegel wrote:
> > > @@ -116,11 +116,10 @@ GOMP_task (void (*fn) (void *), void *data, void (
> > >  	}
> > >        else
> > >  	fn (data);
> > > -      if (team != NULL)
> > > +      if (task.children != NULL)
> > 
> > Why does this not need an acquire barrier?
> 
> I explained that in my patch submission.

Right, I missed this when looking at just the code later.  Sorry.

> Briefly, the only way this
> particular task.children can be set is if this thread's task work
> function creates children.  So since the setter is *this* thread, we
> need no barriers.  We can have task.children set by the current
> thread then changed by a child thread, but seeing a stale non-NULL
> value is not a problem here.  Once past the task_lock acquisition,
> this thread will see the real value of task.children.
> 
> > >  	{
> > >  	  gomp_mutex_lock (&team->task_lock);
> > > -	  if (task.children != NULL)
> > 
> > Are you sure that you can remove this check to task.children here,
> > especially if you don't use an acquire barrier previously?
> > 
> > Can there be several threads trying to call gomp_clear_parent?
> 
> Yes and yes.  gomp_clear_parent handles a NULL arg, and as you can
> see, runs inside a task_lock mutex region here.
> 
> > > -	    gomp_clear_parent (task.children);
> > > +	  gomp_clear_parent (task.children);
> > >  	  gomp_mutex_unlock (&team->task_lock);
> > 
> > Please add comments or reference the comments in the other pieces of
> > synchronizing code related to this.
> 
> Please note that all of the above is a reversion!

I understand.  But you took the time to look at this in detail, so it
would be a shame if this would be "hidden" in the email for the patch,
not in the code.

> Adding a comment
> has the risk that the comment is *wrong*.

That's true, but I think this would be still better than adding no
comment.  If the comments summarize the intent/understanding of you or
someone else, and the code doesn't match, then it's easier for everyone
to see that something might be wrong.  This might be unnecessary or even
counter-productive for sequential code, but I think that concurrent code
is often sufficiently complex for the additional redundancy to be
worthwhile (e.g., because you often need global reasoning in contrast to
typically local reasoning for sequential code, etc.).

> That said, I appreciate the
> need to document threaded code well..
> 
> > > -  if (task == NULL || team == NULL)
> > > +  if (task == NULL
> > > +      || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)
> > 
> > Please mention with which stores this acquire load synchronizes.
> 
> How does this look?

Thanks!  This looks good.

Torvald

> 	* task.c (GOMP_task, GOMP_taskwait): Comment.
> 
> Index: libgomp/task.c
> ===================================================================
> --- libgomp/task.c	(revision 195370)
> +++ libgomp/task.c	(working copy)
> @@ -116,6 +116,15 @@
>  	}
>        else
>  	fn (data);
> +      /* Access to "children" is normally done inside a task_lock
> +	 mutex region, but the only way this particular task.children
> +	 can be set is if this thread's task work function (fn)
> +	 creates children.  So since the setter is *this* thread, we
> +	 need no barriers here when testing for non-NULL.  We can have
> +	 task.children set by the current thread then changed by a
> +	 child thread, but seeing a stale non-NULL value is not a
> +	 problem.  Once past the task_lock acquisition, this thread
> +	 will see the real value of task.children.  */
>        if (task.children != NULL)
>  	{
>  	  gomp_mutex_lock (&team->task_lock);
> @@ -296,6 +305,12 @@
>    struct gomp_task *child_task = NULL;
>    struct gomp_task *to_free = NULL;
>  
> +  /* The acquire barrier on load of task->children here synchronizes
> +     with the write of a NULL in gomp_barrier_handle_tasks.  It is
> +     not necessary that we synchronize with other non-NULL writes at
> +     this point, but we must ensure that all writes to memory by a
> +     child thread task work function are seen before we exit from
> +     GOMP_taskwait.  */
>    if (task == NULL
>        || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)
>      return;
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-01-23 10:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-22 11:03 PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix Alan Modra
2013-01-22 11:19 ` Jakub Jelinek
2013-01-22 11:52   ` Alan Modra
2013-01-22 12:01     ` Jakub Jelinek
2013-01-22 14:27       ` Jakub Jelinek
2013-01-22 13:53 ` Torvald Riegel
2013-01-23  1:12   ` Alan Modra
2013-01-23 10:39     ` Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).