public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix
@ 2013-01-22 11:03 Alan Modra
  2013-01-22 11:19 ` Jakub Jelinek
  2013-01-22 13:53 ` Torvald Riegel
  0 siblings, 2 replies; 8+ messages in thread
From: Alan Modra @ 2013-01-22 11:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek

SPEComp2012 376.kdtree shows a huge regression after my PR51376 fix,
with the benchmark, which finishes normally in a few minutes, failing
to complete after hours of running on a power7 machine.  Using a
reduced data set showed typically over a 200x slowdown.  So, why is
this happening?  Well, it appears that the benchmark hits exactly the
libgomp code paths that previously accessed task->children without
locking and now we need to obtain task_lock.  The slowdown is due to
massive task_lock contention.

I tried the solution in
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00235.html and found it
gave a small improvement, but there was still far too much
contention due to hitting task_lock in GOMP_task.  The question then
was whether locking was really necessary there, and on analysing it
properly I believe I was far too conservative in forcing the
"children" access to be inside a task_lock held region.  That
particular task.children will only be set if the current thread
creates new threads in its task work function!  ie. Some other thread
won't set it, so there's no need to worry that the current thread
might see a stale NULL value and miss calling gomp_clear_parent.
(It's true that the current thread might see a stale non-NULL value,
but that doesn't matter.  Gaining the lock will ensure
gomp_clear_parent sees the real value of task.children.)

With this patch, PR51376 stays fixed and we're back to a reasonable
time for kdtree.  I'm seeing a 20% slowdown in my quick and dirty
testing, but some of that will be due to different optimisation and
tuning in the libgomp builds.

I did consider (and test) another small refinement.  The release
barrier in gomp_sem_post is sufficient to ensure correct memory
ordering, so you can write:

		      if (parent->in_taskwait)
			{
			  gomp_sem_post (&parent->taskwait_sem);
			  parent->children = NULL;
			}
		      else
			__atomic_store_n (&parent->children, NULL,
					  MEMMODEL_RELEASE);

However, this reorders posting the semaphore and writing
parent->children.  I think doing so is OK but am wary of trying to be
too clever where multiple threads are involved..

Bootstrapped and regression tested powerpc64-linux.  OK to apply?

	PR libgomp/51376
	PR libgomp/56073
	* task.c (GOMP_task): Revert 2011-12-09 change.
	(GOMP_taskwait): Likewise.  Instead use atomic load with acquire
	barrier to read task->children..
	(gomp_barrier_handle_tasks): ..and matching atomic store with
	release barrier here when setting parent->children to NULL.

Index: libgomp/task.c
===================================================================
--- libgomp/task.c	(revision 195354)
+++ libgomp/task.c	(working copy)
@@ -116,11 +116,10 @@ GOMP_task (void (*fn) (void *), void *data, void (
 	}
       else
 	fn (data);
-      if (team != NULL)
+      if (task.children != NULL)
 	{
 	  gomp_mutex_lock (&team->task_lock);
-	  if (task.children != NULL)
-	    gomp_clear_parent (task.children);
+	  gomp_clear_parent (task.children);
 	  gomp_mutex_unlock (&team->task_lock);
 	}
       gomp_end_task ();
@@ -258,7 +257,13 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t st
 		    parent->children = child_task->next_child;
 		  else
 		    {
-		      parent->children = NULL;
+		      /* We access task->children in GOMP_taskwait
+			 outside of the task lock mutex region, so
+			 need a release barrier here to ensure memory
+			 written by child_task->fn above is flushed
+			 before the NULL is written.  */
+		      __atomic_store_n (&parent->children, NULL,
+					MEMMODEL_RELEASE);
 		      if (parent->in_taskwait)
 			gomp_sem_post (&parent->taskwait_sem);
 		    }
@@ -291,7 +296,8 @@ GOMP_taskwait (void)
   struct gomp_task *child_task = NULL;
   struct gomp_task *to_free = NULL;
 
-  if (task == NULL || team == NULL)
+  if (task == NULL
+      || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)
     return;
 
   gomp_mutex_lock (&team->task_lock);

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-01-23 10:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-22 11:03 PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix Alan Modra
2013-01-22 11:19 ` Jakub Jelinek
2013-01-22 11:52   ` Alan Modra
2013-01-22 12:01     ` Jakub Jelinek
2013-01-22 14:27       ` Jakub Jelinek
2013-01-22 13:53 ` Torvald Riegel
2013-01-23  1:12   ` Alan Modra
2013-01-23 10:39     ` Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).