public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp4.1] Doacross library implementation
@ 2015-09-24 20:04 Jakub Jelinek
  2015-09-29 18:33 ` [gomp4.1] Fixup handling of doacross loops with noreturn body Jakub Jelinek
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Jakub Jelinek @ 2015-09-24 20:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: Torvald Riegel, Aldy Hernandez, Richard Henderson

Hi!

This patch implements DOACROSS in the library, so far only as busy waiting
and even without exponential (or some guess based on distance) backoff.
Torvald, can you please have a look at it, if I got all the atomics / memory
models right?  The testcase obviously is not a good benchmark, we'll need
some more realistic one.  But obviously when asking for oversubscription, it
is quite expensive.  The question is how to implement a non-busy waiting
fallback, whether we put some mutex and queue guarded by the mutex into the
same (or some other?) cache-line, or just use atomics to queue it and how to
make it cheap for the case where busy waiting is sufficient.  I'd say
it should be sufficient to implement non-busy waiting in the flattened
variant.

As for the compiler side, I'll first adjust for the pending ticket (which
changes meaning of the ordered(n) clause if collapse(m) m > 1 is present),
then there is a bug with ordered loops that have noreturn body (need to add
some edge for that case and condition checking), lastprivate also needs
checking for all the cases, and finally more thinking on the conservative
dependence folding, where there are just too many issues unresolved right
now.

2015-09-24  Jakub Jelinek  <jakub@redhat.com>

	* gimplify.c (gimplify_omp_for): Don't adjust lastprivate
	on ordered loops above collapse.
	* omp-low.c (expand_omp_ordered_source): Rewritten to pass
	address of an array of indices.
	(expand_omp_ordered_source_sink): Create the VAR_DECL for it.
	(expand_omp_for_ordered_loops): Initialize and update the
	array elements.
	(expand_omp_for_generic): Likewise.  Move counts array one
	element back, so that collapsed loops are multiplied by correct
	counts.
	(lower_omp_ordered): Avoid the conservative dependence folding
	for now, it has too many issues.
	* omp-builtins.def (BUILT_IN_GOMP_DOACROSS_POST): Change
	type to BT_FN_VOID_PTR.
gcc/testsuite/
	* gcc.dg/gomp/sink-fold-1.c: Xfail.
	* gcc.dg/gomp/sink-fold-2.c: Likewise.
libgomp/
	* ordered.c: Include string.h and doacross.h.
	(gomp_doacross_init): New function.
	(GOMP_doacross_wait): Implement.
	(GOMP_doacross_post): Likewise.  Change arguments to
	pointer to long array.
	* loop.c (gomp_loop_doacross_static_start,
	gomp_loop_doacross_dynamic_start,
	gomp_loop_doacross_guided_start): Call gomp_doacross_init.
	* libgomp_g.h (GOMP_doacross_post): Adjust prototype.
	* libgomp.h (struct gomp_doacross_work_share): New type.
	(struct gomp_work_share): Put ordered_team_ids into anonymous
	union with new doacross field.
	* config/linux/doacross.h: New file.
	* config/posix/doacross.h: New file.
	* testsuite/libgomp.c/doacross-1.c: New test.

--- gcc/gimplify.c.jj	2015-09-18 18:38:17.000000000 +0200
+++ gcc/gimplify.c	2015-09-24 19:05:43.607556246 +0200
@@ -7788,6 +7788,10 @@ gimplify_omp_for (tree *expr_p, gimple_s
 						 (OMP_FOR_INIT (for_stmt))
 					       * 2);
     }
+  int collapse = 1;
+  c = find_omp_clause (OMP_FOR_CLAUSES (for_stmt), OMP_CLAUSE_COLLAPSE);
+  if (c)
+    collapse = tree_to_shwi (OMP_CLAUSE_COLLAPSE_EXPR (c));
   for (i = 0; i < TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)); i++)
     {
       t = TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), i);
@@ -8104,8 +8108,9 @@ gimplify_omp_for (tree *expr_p, gimple_s
 	  OMP_CLAUSE_LINEAR_STEP (c2) = OMP_CLAUSE_LINEAR_STEP (c);
 	}
 
-      if ((var != decl || TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) > 1)
-	  && orig_for_stmt == for_stmt)
+      if ((var != decl || collapse > 1)
+	  && orig_for_stmt == for_stmt
+	  && i < collapse)
 	{
 	  for (c = OMP_FOR_CLAUSES (for_stmt); c ; c = OMP_CLAUSE_CHAIN (c))
 	    if (((OMP_CLAUSE_CODE (c) == OMP_CLAUSE_LASTPRIVATE
--- gcc/omp-low.c.jj	2015-09-18 18:38:17.000000000 +0200
+++ gcc/omp-low.c	2015-09-24 18:06:31.174495644 +0200
@@ -7071,26 +7071,11 @@ static void
 expand_omp_ordered_source (gimple_stmt_iterator *gsi, struct omp_for_data *fd,
 			   tree *counts, location_t loc)
 {
-  auto_vec<tree, 10> args;
   enum built_in_function source_ix = BUILT_IN_GOMP_DOACROSS_POST;
-  tree t;
-  int i;
-
-  for (i = fd->collapse - 1; i < fd->collapse + fd->ordered - 1; i++)
-    if (i == fd->collapse - 1 && fd->collapse > 1)
-      args.quick_push (fd->loop.v);
-    else if (counts[i])
-      args.safe_push (counts[i]);
-    else
-      {
-	t = fold_build2_loc (loc, MINUS_EXPR, TREE_TYPE (fd->loops[i].v),
-			     fd->loops[i].v, fd->loops[i].n1);
-	t = fold_convert_loc (loc, fd->iter_type, t);
-	t = force_gimple_operand_gsi (gsi, t, true, NULL_TREE,
-				      true, GSI_SAME_STMT);
-	args.safe_push (t);
-      }
-  gimple g = gimple_build_call_vec (builtin_decl_explicit (source_ix), args);
+  gimple g
+    = gimple_build_call (builtin_decl_explicit (source_ix), 1,
+			 build_fold_addr_expr (counts[fd->collapse
+						      + fd->ordered - 1]));
   gimple_set_location (g, loc);
   gsi_insert_before (gsi, g, GSI_SAME_STMT);
 }
@@ -7276,6 +7261,9 @@ expand_omp_ordered_source_sink (struct o
       counts[i] = NULL_TREE;
     else
       counts[i] = create_tmp_var (fd->iter_type, ".orditer");
+  tree atype = build_array_type_nelts (fd->iter_type, fd->ordered);
+  counts[fd->collapse + fd->ordered - 1] = create_tmp_var (atype, ".orditera");
+  TREE_ADDRESSABLE (counts[fd->collapse + fd->ordered - 1]) = 1;
 
   for (inner = region->inner; inner; inner = inner->next)
     if (inner->type == GIMPLE_OMP_ORDERED)
@@ -7315,6 +7303,11 @@ expand_omp_for_ordered_loops (struct omp
 	  tree type = TREE_TYPE (fd->loops[i].v);
 	  tree n1 = fold_convert (type, fd->loops[i].n1);
 	  expand_omp_build_assign (&gsi, fd->loops[i].v, n1);
+	  tree aref = build4 (ARRAY_REF, fd->iter_type,
+			      counts[fd->collapse + fd->ordered - 1],
+			      size_int (i - fd->collapse + 1),
+			      NULL_TREE, NULL_TREE);
+	  expand_omp_build_assign (&gsi, aref, build_zero_cst (fd->iter_type));
 	}
       return NULL;
     }
@@ -7328,6 +7321,11 @@ expand_omp_for_ordered_loops (struct omp
       if (counts[i])
 	expand_omp_build_assign (&gsi, counts[i],
 				 build_zero_cst (fd->iter_type));
+      tree aref = build4 (ARRAY_REF, fd->iter_type,
+			  counts[fd->collapse + fd->ordered - 1],
+			  size_int (i - fd->collapse + 1),
+			  NULL_TREE, NULL_TREE);
+      expand_omp_build_assign (&gsi, aref, build_zero_cst (fd->iter_type));
       if (!gsi_end_p (gsi))
 	gsi_prev (&gsi);
       else
@@ -7350,7 +7348,20 @@ expand_omp_for_ordered_loops (struct omp
 	  t = fold_build2 (PLUS_EXPR, fd->iter_type, counts[i],
 			   build_int_cst (fd->iter_type, 1));
 	  expand_omp_build_assign (&gsi, counts[i], t);
+	  t = counts[i];
 	}
+      else
+	{
+	  t = fold_build2 (MINUS_EXPR, TREE_TYPE (fd->loops[i].v),
+			   fd->loops[i].v, fd->loops[i].n1);
+	  t = fold_convert (fd->iter_type, t);
+	  t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
+					true, GSI_SAME_STMT);
+	}
+      aref = build4 (ARRAY_REF, fd->iter_type,
+		     counts[fd->collapse + fd->ordered - 1],
+		     size_int (i - fd->collapse + 1), NULL_TREE, NULL_TREE);
+      expand_omp_build_assign (&gsi, aref, t);
       gsi_prev (&gsi);
       edge e2 = split_block (cont_bb, gsi_stmt (gsi));
       basic_block new_header = e2->dest;
@@ -7525,7 +7536,7 @@ expand_omp_for_generic (struct omp_regio
       basic_block zero_iter1_bb = NULL, zero_iter2_bb = NULL, l2_dom_bb = NULL;
 
       counts = XALLOCAVEC (tree, fd->collapse
-				 + (fd->ordered ? fd->ordered - 1 : 0));
+				 + (fd->ordered ? fd->ordered - 1 + 1 : 0));
       expand_omp_for_init_counts (fd, &gsi, entry_bb, counts,
 				  zero_iter1_bb, first_zero_iter1,
 				  zero_iter2_bb, first_zero_iter2, l2_dom_bb);
@@ -7910,17 +7921,50 @@ expand_omp_for_generic (struct omp_regio
     expand_omp_for_init_vars (fd, &gsi, counts, inner_stmt, startvar);
 
   if (fd->ordered)
-    expand_omp_ordered_source_sink (region, fd, counts, cont_bb);
-  cont_bb = expand_omp_for_ordered_loops (fd, counts, cont_bb, l1_bb);
-  if (fd->ordered && counts[fd->collapse - 1])
     {
-      gcc_assert (fd->collapse == 1);
+      /* Until now, counts array contained number of iterations or
+	 variable containing it for ith loop.  From now on, we need
+	 those counts only for collapsed loops, and only for the 2nd
+	 till the last collapsed one.  Move those one element earlier,
+	 we'll use counts[fd->collapse - 1] for the first source/sink
+	 iteration counter and so on and counts[fd->collapse + fd->ordered - 1]
+	 as the array holding the current counter values for
+	 depend(source).  */
+      if (fd->collapse > 1)
+	memmove (counts, counts + 1, (fd->collapse - 1) * sizeof (counts[0]));
+      expand_omp_ordered_source_sink (region, fd, counts, cont_bb);
+      cont_bb = expand_omp_for_ordered_loops (fd, counts, cont_bb, l1_bb);
+      if (counts[fd->collapse - 1])
+	{
+	  gcc_assert (fd->collapse == 1);
+	  gsi = gsi_last_bb (l0_bb);
+	  expand_omp_build_assign (&gsi, counts[fd->collapse - 1],
+				   istart0, true);
+	  gsi = gsi_last_bb (cont_bb);
+	  t = fold_build2 (PLUS_EXPR, fd->iter_type, counts[fd->collapse - 1],
+			   build_int_cst (fd->iter_type, 1));
+	  expand_omp_build_assign (&gsi, counts[fd->collapse - 1], t);
+	  tree aref = build4 (ARRAY_REF, fd->iter_type,
+			      counts[fd->collapse + fd->ordered - 1],
+			      size_zero_node, NULL_TREE, NULL_TREE);
+	  expand_omp_build_assign (&gsi, aref, counts[fd->collapse - 1]);
+	  t = counts[fd->collapse - 1];
+	}
+      else if (fd->collapse > 1)
+	t = fd->loop.v;
+      else
+	{
+	  t = fold_build2 (MINUS_EXPR, TREE_TYPE (fd->loops[0].v),
+			   fd->loops[0].v, fd->loops[0].n1);
+	  t = fold_convert (fd->iter_type, t);
+	}
       gsi = gsi_last_bb (l0_bb);
-      expand_omp_build_assign (&gsi, counts[fd->collapse - 1], istart0, true);
-      gsi = gsi_last_bb (cont_bb);
-      t = fold_build2 (PLUS_EXPR, fd->iter_type, counts[fd->collapse - 1],
-		       build_int_cst (fd->iter_type, 1));
-      expand_omp_build_assign (&gsi, counts[fd->collapse - 1], t);
+      tree aref = build4 (ARRAY_REF, fd->iter_type,
+			  counts[fd->collapse + fd->ordered - 1],
+			  size_zero_node, NULL_TREE, NULL_TREE);
+      t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
+				    false, GSI_CONTINUE_LINKING);
+      expand_omp_build_assign (&gsi, aref, t, true);
     }
 
   if (!broken_loop)
@@ -7946,6 +7990,24 @@ expand_omp_for_generic (struct omp_regio
 	  assign_stmt = gimple_build_assign (vback, t);
 	  gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT);
 
+	  if (fd->ordered && counts[fd->collapse - 1] == NULL_TREE)
+	    {
+	      if (fd->collapse > 1)
+		t = fd->loop.v;
+	      else
+		{
+		  t = fold_build2 (MINUS_EXPR, TREE_TYPE (fd->loops[0].v),
+				   fd->loops[0].v, fd->loops[0].n1);
+		  t = fold_convert (fd->iter_type, t);
+		}
+	      tree aref = build4 (ARRAY_REF, fd->iter_type,
+				  counts[fd->collapse + fd->ordered - 1],
+				  size_zero_node, NULL_TREE, NULL_TREE);
+	      t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
+					    true, GSI_SAME_STMT);
+	      expand_omp_build_assign (&gsi, aref, t);
+	    }
+
 	  t = build2 (fd->loop.cond_code, boolean_type_node,
 		      DECL_P (vback) && TREE_ADDRESSABLE (vback) ? t : vback,
 		      iend);
@@ -7986,6 +8048,14 @@ expand_omp_for_generic (struct omp_regio
   if (gimple_omp_return_lhs (gsi_stmt (gsi)))
     gimple_call_set_lhs (call_stmt, gimple_omp_return_lhs (gsi_stmt (gsi)));
   gsi_insert_after (&gsi, call_stmt, GSI_SAME_STMT);
+  if (fd->ordered)
+    {
+      tree arr = counts[fd->collapse + fd->ordered - 1];
+      tree clobber = build_constructor (TREE_TYPE (arr), NULL);
+      TREE_THIS_VOLATILE (clobber) = 1;
+      gsi_insert_after (&gsi, gimple_build_assign (arr, clobber),
+			GSI_SAME_STMT);
+    }
   gsi_remove (&gsi, true);
 
   /* Connect the new blocks.  */
@@ -12801,7 +12871,12 @@ lower_omp_ordered (gimple_stmt_iterator
   if (find_omp_clause (gimple_omp_ordered_clauses (ord_stmt),
 		       OMP_CLAUSE_DEPEND))
     {
-      lower_omp_ordered_clauses (gsi_p, ord_stmt, ctx);
+      /* FIXME: This is needs to be moved to the expansion to verify various
+	 conditions only testable on cfg with dominators computed, and also
+	 all the depend clauses to be merged still might need to be available
+	 for the runtime checks.  */
+      if (0)
+	lower_omp_ordered_clauses (gsi_p, ord_stmt, ctx);
       return;
     }
 
--- gcc/omp-builtins.def.jj	2015-09-17 09:23:39.000000000 +0200
+++ gcc/omp-builtins.def	2015-09-24 13:33:02.542783166 +0200
@@ -247,7 +247,7 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ORDERED_
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ORDERED_END, "GOMP_ordered_end",
 		  BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_DOACROSS_POST, "GOMP_doacross_post",
-		  BT_FN_VOID_LONG_VAR, ATTR_NOTHROW_LEAF_LIST)
+		  BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_DOACROSS_WAIT, "GOMP_doacross_wait",
 		  BT_FN_VOID_LONG_VAR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_PARALLEL, "GOMP_parallel",
--- gcc/testsuite/gcc.dg/gomp/sink-fold-1.c.jj	2015-09-18 18:34:26.000000000 +0200
+++ gcc/testsuite/gcc.dg/gomp/sink-fold-1.c	2015-09-24 19:35:17.463625989 +0200
@@ -28,4 +28,4 @@ funk ()
     }
 }
 
-/* { dg-final { scan-tree-dump-times "omp ordered depend\\(sink:i-2,j-2,k\\+2\\)" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "omp ordered depend\\(sink:i-2,j-2,k\\+2\\)" 1 "omplower" { xfail *-*-* } } } */
--- gcc/testsuite/gcc.dg/gomp/sink-fold-2.c.jj	2015-08-24 14:32:06.000000000 +0200
+++ gcc/testsuite/gcc.dg/gomp/sink-fold-2.c	2015-09-24 19:37:06.054038766 +0200
@@ -11,8 +11,8 @@ funk ()
   for (i=0; i < N; i += 3)
     for (j=0; j < N; ++j)
     {
-#pragma omp ordered depend(sink:i-8,j-1) /* { dg-warning "ignoring sink clause with offset that is not a multiple" } */
-#pragma omp ordered depend(sink:i+3,j-1) /* { dg-error "first offset must be in opposite direction" } */
+#pragma omp ordered depend(sink:i-8,j-1) /* { dg-warning "ignoring sink clause with offset that is not a multiple" "" { xfail *-*-* } } */
+#pragma omp ordered depend(sink:i+3,j-1) /* { dg-error "first offset must be in opposite direction" "" { xfail *-*-* } } */
         bar();
 #pragma omp ordered depend(source)
     }
--- libgomp/ordered.c.jj	2015-09-18 18:36:42.000000000 +0200
+++ libgomp/ordered.c	2015-09-24 18:20:28.286244397 +0200
@@ -27,6 +27,8 @@
 
 #include "libgomp.h;"
 #include <stdarg.h>
+#include <string.h>
+#include "doacross.h"
 
 
 /* This function is called when first allocating an iteration block.  That
@@ -252,14 +254,146 @@ GOMP_ordered_end (void)
 {
 }
 
+/* DOACROSS initialization.  */
+
+#define MAX_COLLAPSED_BITS (__SIZEOF_LONG__ * __CHAR_BIT__)
+
+void
+gomp_doacross_init (unsigned ncounts, long *counts, long chunk_size)
+{
+  struct gomp_thread *thr = gomp_thread ();
+  struct gomp_team *team = thr->ts.team;
+  struct gomp_work_share *ws = thr->ts.work_share;
+  unsigned int i, bits[MAX_COLLAPSED_BITS], num_bits = 0;
+  unsigned long ent, num_ents, elt_sz, shift_sz;
+  struct gomp_doacross_work_share *doacross;
+
+  if (team == NULL || team->nthreads == 1)
+    return;
+
+  for (i = 0; i < ncounts; i++)
+    {
+      /* If any count is 0, GOMP_doacross_{post,wait} can't be called.  */
+      if (counts[i] == 0)
+	return;
+
+      if (num_bits <= MAX_COLLAPSED_BITS)
+	{
+	  unsigned int this_bits;
+	  if (counts[i] == 1)
+	    this_bits = 1;
+	  else
+	    this_bits = __SIZEOF_LONG__ * __CHAR_BIT__
+			- __builtin_clzl (counts[i] - 1);
+	  if (num_bits + this_bits <= MAX_COLLAPSED_BITS)
+	    {
+	      bits[i] = this_bits;
+	      num_bits += this_bits;
+	    }
+	  else
+	    num_bits = MAX_COLLAPSED_BITS + 1;
+	}
+    }
+
+  if (ws->sched == GFS_STATIC)
+    num_ents = team->nthreads;
+  else
+    num_ents = (counts[0] - 1) / chunk_size + 1;
+  if (num_bits <= MAX_COLLAPSED_BITS)
+    {
+      elt_sz = sizeof (unsigned long);
+      shift_sz = ncounts * sizeof (unsigned int);
+    }
+  else
+    {
+      elt_sz = sizeof (unsigned long) * ncounts;
+      shift_sz = 0;
+    }
+  elt_sz = (elt_sz + 63) & ~63UL;
+
+  doacross = gomp_malloc (sizeof (*doacross) + 63 + num_ents * elt_sz
+			  + shift_sz);
+  doacross->chunk_size = chunk_size;
+  doacross->elt_sz = elt_sz;
+  doacross->ncounts = ncounts;
+  doacross->flattened = false;
+  doacross->boundary = 0;
+  doacross->array = (unsigned char *)
+		    ((((uintptr_t) (doacross + 1)) + 63 + shift_sz)
+		     & ~(uintptr_t) 63);
+  if (num_bits <= MAX_COLLAPSED_BITS)
+    {
+      unsigned int shift_count = 0;
+      doacross->flattened = true;
+      for (i = ncounts; i > 0; i--)
+	{
+	  doacross->shift_counts[i - 1] = shift_count;
+	  shift_count += bits[i - 1];
+	}
+      for (ent = 0; ent < num_ents; ent++)
+	*(unsigned long *) (doacross->array + ent * elt_sz) = 0;
+    }
+  else
+    for (ent = 0; ent < num_ents; ent++)
+      memset (doacross->array + ent * elt_sz, '\0',
+	      sizeof (unsigned long) * ncounts);
+  if (ws->sched == GFS_STATIC && chunk_size == 0)
+    {
+      unsigned long q = counts[0] / num_ents;
+      unsigned long t = counts[0] % num_ents;
+      doacross->boundary = t * (q + 1);
+      doacross->q = q;
+      doacross->t = t;
+    }
+  ws->doacross = doacross;
+}
+
 /* DOACROSS POST operation.  */
 
 void
-GOMP_doacross_post (long first, ...)
+GOMP_doacross_post (long *counts)
 {
-  va_list ap;
-  va_start (ap, first);
-  va_end (ap);
+  struct gomp_thread *thr = gomp_thread ();
+  struct gomp_work_share *ws = thr->ts.work_share;
+  struct gomp_doacross_work_share *doacross = ws->doacross;
+  unsigned long ent;
+  unsigned int i;
+
+  if (__builtin_expect (doacross == NULL, 0))
+    {
+      __sync_synchronize ();
+      return;
+    }
+
+  if (__builtin_expect (ws->sched == GFS_STATIC, 1))
+    ent = thr->ts.team_id;
+  else
+    ent = counts[0] / doacross->chunk_size;
+  unsigned long *array = (unsigned long *) (doacross->array
+					    + ent * doacross->elt_sz);
+
+  if (__builtin_expect (doacross->flattened, 1))
+    {
+      unsigned long flattened
+	= (unsigned long) counts[0] << doacross->shift_counts[0];
+
+      for (i = 1; i < doacross->ncounts; i++)
+	flattened |= (unsigned long) counts[i]
+		     << doacross->shift_counts[i];
+      flattened++;
+      if (flattened == __atomic_load_n (array, MEMMODEL_ACQUIRE))
+	__atomic_thread_fence (MEMMODEL_RELEASE);
+      else
+	__atomic_store_n (array, flattened, MEMMODEL_RELEASE);
+      return;
+    }
+
+  __atomic_thread_fence (MEMMODEL_ACQUIRE);
+  for (i = doacross->ncounts; i-- > 0; )
+    {
+      if (counts[i] + 1UL != __atomic_load_n (&array[i], MEMMODEL_RELAXED))
+	__atomic_store_n (&array[i], counts[i] + 1UL, MEMMODEL_RELEASE);
+    }
 }
 
 /* DOACROSS WAIT operation.  */
@@ -267,7 +401,81 @@ GOMP_doacross_post (long first, ...)
 void
 GOMP_doacross_wait (long first, ...)
 {
+  struct gomp_thread *thr = gomp_thread ();
+  struct gomp_work_share *ws = thr->ts.work_share;
+  struct gomp_doacross_work_share *doacross = ws->doacross;
   va_list ap;
-  va_start (ap, first);
-  va_end (ap);
+  unsigned long ent;
+  unsigned int i;
+
+  if (__builtin_expect (doacross == NULL, 0))
+    {
+      __sync_synchronize ();
+      return;
+    }
+
+  if (__builtin_expect (ws->sched == GFS_STATIC, 1))
+    {
+      if (ws->chunk_size == 0)
+	{
+	  if (first < doacross->boundary)
+	    ent = first / (doacross->q + 1);
+	  else
+	    ent = (first - doacross->boundary) / doacross->q
+		  + doacross->t;
+	}
+      else
+	ent = first / ws->chunk_size % thr->ts.team->nthreads;
+    }
+  else
+    ent = first / doacross->chunk_size;
+  unsigned long *array = (unsigned long *) (doacross->array
+					    + ent * doacross->elt_sz);
+
+  if (__builtin_expect (doacross->flattened, 1))
+    {
+      unsigned long flattened
+	= (unsigned long) first << doacross->shift_counts[0];
+      unsigned long cur;
+
+      va_start (ap, first);
+      for (i = 1; i < doacross->ncounts; i++)
+	flattened |= (unsigned long) va_arg (ap, long)
+		     << doacross->shift_counts[i];
+      cur = __atomic_load_n (array, MEMMODEL_ACQUIRE);
+      if (flattened < cur)
+	{
+	  __atomic_thread_fence (MEMMODEL_RELEASE);
+	  va_end (ap);
+	  return;
+	}
+      doacross_spin (array, flattened, cur);
+      __atomic_thread_fence (MEMMODEL_RELEASE);
+      va_end (ap);
+      return;
+    }
+
+  do
+    {
+      va_start (ap, first);
+      for (i = 0; i < doacross->ncounts; i++)
+	{
+	  unsigned long thisv
+	    = (unsigned long) (i ? va_arg (ap, long) : first) + 1;
+	  unsigned long cur = __atomic_load_n (&array[i], MEMMODEL_RELAXED);
+	  if (thisv < cur)
+	    {
+	      i = doacross->ncounts;
+	      break;
+	    }
+	  if (thisv > cur)
+	    break;
+	}
+      va_end (ap);
+      if (i == doacross->ncounts)
+	break;
+      cpu_relax ();
+    }
+  while (1);
+  __sync_synchronize ();
 }
--- libgomp/loop.c.jj	2015-09-16 14:21:10.000000000 +0200
+++ libgomp/loop.c	2015-09-22 15:29:40.016583743 +0200
@@ -306,7 +306,7 @@ gomp_loop_doacross_static_start (unsigne
     {
       gomp_loop_init (thr->ts.work_share, 0, counts[0], 1,
 		      GFS_STATIC, chunk_size);
-      /* gomp_ordered_static_init (); */
+      gomp_doacross_init (ncounts, counts, chunk_size);
       gomp_work_share_init_done ();
     }
 
@@ -324,6 +324,7 @@ gomp_loop_doacross_dynamic_start (unsign
     {
       gomp_loop_init (thr->ts.work_share, 0, counts[0], 1,
 		      GFS_DYNAMIC, chunk_size);
+      gomp_doacross_init (ncounts, counts, chunk_size);
       gomp_work_share_init_done ();
     }
 
@@ -349,6 +350,7 @@ gomp_loop_doacross_guided_start (unsigne
     {
       gomp_loop_init (thr->ts.work_share, 0, counts[0], 1,
 		      GFS_GUIDED, chunk_size);
+      gomp_doacross_init (ncounts, counts, chunk_size);
       gomp_work_share_init_done ();
     }
 
--- libgomp/libgomp_g.h.jj	2015-09-17 09:25:23.000000000 +0200
+++ libgomp/libgomp_g.h	2015-09-24 13:33:32.726324481 +0200
@@ -177,7 +177,7 @@ extern bool GOMP_loop_ull_ordered_runtim
 
 extern void GOMP_ordered_start (void);
 extern void GOMP_ordered_end (void);
-extern void GOMP_doacross_post (long, ...);
+extern void GOMP_doacross_post (long *);
 extern void GOMP_doacross_wait (long, ...);
 
 /* parallel.c */
--- libgomp/libgomp.h.jj	2015-09-08 10:18:44.000000000 +0200
+++ libgomp/libgomp.h	2015-09-23 12:25:51.866847300 +0200
@@ -78,6 +78,36 @@ enum gomp_schedule_type
   GFS_AUTO
 };
 
+struct gomp_doacross_work_share
+{
+  union {
+    /* chunk_size copy, as ws->chunk_size is multiplied by incr for
+       GFS_DYNAMIC.  */
+    long chunk_size;
+    /* For schedule(static,0) this is the number
+       of iterations assigned to the last thread, i.e. number of
+       iterations / number of threads.  */
+    long q;
+  };
+  /* Size of each array entry (padded to cache line size).  */
+  unsigned long elt_sz;
+  /* Number of dimensions in sink vectors.  */
+  unsigned int ncounts;
+  /* True if the iterations can be flattened.  */
+  bool flattened;
+  /* Actual array (of elt_sz sized units), aligned to cache line size.
+     This is indexed by team_id for GFS_STATIC and outermost iteration
+     / chunk_size for other schedules.  */
+  unsigned char *array;
+  /* These two are only used for schedule(static,0).  */
+  /* This one is number of iterations % number of threads.  */
+  long t;
+  /* And this one is cached t * (q + 1).  */
+  long boundary;
+  /* Array of shift counts for each dimension if they can be flattened.  */
+  unsigned int shift_counts[];
+};
+
 struct gomp_work_share
 {
   /* This member records the SCHEDULE clause to be used for this construct.
@@ -109,13 +139,18 @@ struct gomp_work_share
     };
   };
 
-  /* This is a circular queue that details which threads will be allowed
-     into the ordered region and in which order.  When a thread allocates
-     iterations on which it is going to work, it also registers itself at
-     the end of the array.  When a thread reaches the ordered region, it
-     checks to see if it is the one at the head of the queue.  If not, it
-     blocks on its RELEASE semaphore.  */
-  unsigned *ordered_team_ids;
+  union {
+    /* This is a circular queue that details which threads will be allowed
+       into the ordered region and in which order.  When a thread allocates
+       iterations on which it is going to work, it also registers itself at
+       the end of the array.  When a thread reaches the ordered region, it
+       checks to see if it is the one at the head of the queue.  If not, it
+       blocks on its RELEASE semaphore.  */
+    unsigned *ordered_team_ids;
+
+    /* This is a pointer to DOACROSS work share data.  */
+    struct gomp_doacross_work_share *doacross;
+  };
 
   /* This is the number of threads that have registered themselves in
      the circular queue ordered_team_ids.  */
@@ -647,6 +682,7 @@ extern void gomp_ordered_next (void);
 extern void gomp_ordered_static_init (void);
 extern void gomp_ordered_static_next (void);
 extern void gomp_ordered_sync (void);
+extern void gomp_doacross_init (unsigned, long *, long);
 
 /* parallel.c */
 
--- libgomp/config/linux/doacross.h.jj	2015-09-23 12:10:32.039275447 +0200
+++ libgomp/config/linux/doacross.h	2015-09-23 12:33:03.861556163 +0200
@@ -0,0 +1,57 @@
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+   Contributed by Jakub Jelinek <jakub@redhat.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a Linux specific implementation of doacross spinning.  */
+
+#ifndef GOMP_DOACROSS_H
+#define GOMP_DOACROSS_H 1
+
+#include "libgomp.h"
+#include <errno.h>
+#include "wait.h"
+
+#ifdef HAVE_ATTRIBUTE_VISIBILITY
+# pragma GCC visibility push(hidden)
+#endif
+
+static inline void doacross_spin (unsigned long *addr, unsigned long expected,
+				  unsigned long cur)
+{
+  /* FIXME: back off depending on how large expected - cur is.  */
+  do
+    {
+      cpu_relax ();
+      cur = __atomic_load_n (addr, MEMMODEL_RELAXED);
+      if (expected < cur)
+	return;
+    }
+  while (1);
+}
+
+#ifdef HAVE_ATTRIBUTE_VISIBILITY
+# pragma GCC visibility pop
+#endif
+
+#endif /* GOMP_DOACROSS_H */
--- libgomp/config/posix/doacross.h.jj	2015-09-23 12:17:53.217834221 +0200
+++ libgomp/config/posix/doacross.h	2015-09-24 10:51:18.310081801 +0200
@@ -0,0 +1,62 @@
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+   Contributed by Jakub Jelinek <jakub@redhat.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a generic implementation of doacross spinning.  */
+
+#ifndef GOMP_DOACROSS_H
+#define GOMP_DOACROSS_H 1
+
+#include "libgomp.h"
+#include <errno.h>
+
+#ifdef HAVE_ATTRIBUTE_VISIBILITY
+# pragma GCC visibility push(hidden)
+#endif
+
+static inline void
+cpu_relax (void)
+{
+  __asm volatile ("" : : : "memory");
+}
+
+static inline void doacross_spin (unsigned long *addr, unsigned long expected,
+				  unsigned long cur)
+{
+  /* FIXME: back off depending on how large expected - cur is.  */
+  do
+    {
+      cpu_relax ();
+      cur = __atomic_load_n (addr, MEMMODEL_RELAXED);
+      if (expected < cur)
+	return;
+    }
+  while (1);
+}
+
+#ifdef HAVE_ATTRIBUTE_VISIBILITY
+# pragma GCC visibility pop
+#endif
+
+#endif /* GOMP_DOACROSS_H */
--- libgomp/testsuite/libgomp.c/doacross-1.c.jj	2015-09-23 13:38:24.270726352 +0200
+++ libgomp/testsuite/libgomp.c/doacross-1.c	2015-09-24 19:32:40.371922136 +0200
@@ -0,0 +1,182 @@
+extern void abort (void);
+
+#define N 256
+int a[N], b[N / 16][8][4], c[N / 32][8][8];
+volatile int d, e;
+
+int
+main ()
+{
+  int i, j, k, l, m;
+  #pragma omp parallel private (l)
+  {
+    #pragma omp for schedule(static, 1) ordered (1) nowait
+    for (i = 0; i < N; i++)
+      {
+	#pragma omp atomic write
+	a[i] = 1;
+	#pragma omp ordered depend(sink: i - 1)
+	if (i)
+	  {
+	    #pragma omp atomic read
+	    l = a[i - 1];
+	    if (l < 2)
+	      abort ();
+	  }
+	#pragma omp atomic write
+	a[i] = 2;
+	if (i < N - 1)
+	  {
+	    #pragma omp atomic read
+	    l = a[i + 1];
+	    if (l == 3)
+	      abort ();
+	  }
+	#pragma omp ordered depend(source)
+	#pragma omp atomic write
+	a[i] = 3;
+      }
+    #pragma omp for schedule(static, 0) ordered (3) nowait
+    for (i = 2; i < N / 16 - 1; i++)
+      for (j = 0; j < 8; j += 2)
+	for (k = 1; k <= 3; k++)
+	  {
+	    #pragma omp atomic write
+	    b[i][j][k] = 1;
+	    #pragma omp ordered depend(sink: i, j - 2, k - 1) \
+				depend(sink: i - 2, j - 2, k + 1)
+	    #pragma omp ordered depend(sink: i - 3, j + 2, k - 2)
+	    if (j >= 2 && k > 1)
+	      {
+		#pragma omp atomic read
+		l = b[i][j - 2][k - 1];
+		if (l < 2)
+		  abort ();
+	      }
+	    #pragma omp atomic write
+	    b[i][j][k] = 2;
+	    if (i >= 4 && j >= 2 && k < 3)
+	      {
+		#pragma omp atomic read
+		l = b[i - 2][j - 2][k + 1];
+		if (l < 2)
+		  abort ();
+	      }
+	    if (i >= 5 && j < N / 16 - 3 && k == 3)
+	      {
+		#pragma omp atomic read
+		l = b[i - 3][j + 2][k - 2];
+		if (l < 2)
+		  abort ();
+	      }
+	    #pragma omp ordered depend(source)
+	    #pragma omp atomic write
+	    b[i][j][k] = 3;
+	  }
+#define A(n) int n;
+#define B(n) A(n##0) A(n##1) A(n##2) A(n##3)
+#define C(n) B(n##0) B(n##1) B(n##2) B(n##3)
+#define D(n) C(n##0) C(n##1) C(n##2) C(n##3)
+    D(m)
+#undef A
+    #pragma omp for collapse (2) ordered(60) schedule(dynamic, 15)
+    for (i = 0; i < N / 32; i++)
+      for (j = 7; j > 1; j--)
+	for (k = 6; k >= 0; k -= 2)
+#define A(n) for (n = 4; n < 5; n++)
+	  D(m)
+#undef A
+	    {
+	      #pragma omp atomic write
+	      c[i][j][k] = 1;
+#define A(n) ,n
+#define E(n) C(n##0) C(n##1) C(n##2) B(n##30) B(n##31) A(n##320) A(n##321)
+	      #pragma omp ordered depend (sink: i, j, k + 2 E(m)) \
+				  depend (sink:i - 2, j + 1, k - 4 E(m)) \
+				  depend(sink: i - 1, j - 2, k - 2 E(m))
+	      if (k <= 4)
+		{
+		  l = c[i][j][k + 2];
+		  if (l < 2)
+		    abort ();
+		}
+	      #pragma omp atomic write
+	      c[i][j][k] = 2;
+	      if (i >= 2 && j < 7 && k >= 4)
+		{
+		  l = c[i - 2][j + 1][k - 4];
+		  if (l < 2)
+		    abort ();
+		}
+	      if (i >= 1 && j >= 4 && k >= 2)
+		{
+		  l = c[i - 1][j - 2][k - 2];
+		  if (l < 2)
+		    abort ();
+		}
+	      #pragma omp ordered depend (source)
+	      #pragma omp atomic write
+	      c[i][j][k] = 3;
+	    }
+
+    #pragma omp for collapse(2) ordered(3) lastprivate (i, j, k)
+    for (i = 0; i < d + 1; i++)
+      for (j = d + 1; j >= 0; j--)
+	for (k = 0; k < d; k++)
+	  for (l = 0; l < d + 2; l++)
+	    {
+	      #pragma omp ordered depend (source)
+	      #pragma omp ordered depend (sink:i - 2, j + 2, k - 2, l)
+	      if (!e)
+		abort ();
+	    }
+    #pragma omp single
+    {
+      if (i != 1 || j != -1 || k != 0)
+	abort ();
+      i = 8; j = 9; k = 10;
+    }
+    #pragma omp for collapse(2) ordered(3) lastprivate (i, j, k, m)
+    for (i = 0; i < d + 1; i++)
+      for (j = d + 1; j >= 0; j--)
+	for (k = 0; k < d + 2; k++)
+	  for (m = 0; m < d; m++)
+	    {
+	      #pragma omp ordered depend (source)
+	      #pragma omp ordered depend (sink:i - 2, j + 2, k - 2, m)
+	      if (!e)
+		abort ();
+	    }
+    #pragma omp single
+    if (i != 1 || j != -1 || k != 2 || m != 0)
+      abort ();
+    #pragma omp for collapse(2) ordered(3) nowait
+    for (i = 0; i < d + 1; i++)
+      for (j = d; j > 0; j--)
+	for (k = 0; k < d + 2; k++)
+	  for (l = 0; l < d + 4; l++)
+	    {
+	      #pragma omp ordered depend (source)
+	      #pragma omp ordered depend (sink:i - 2, j + 2, k - 2, l)
+	      if (!e)
+		abort ();
+	    }
+    #pragma omp for nowait
+    for (i = 0; i < N; i++)
+      if (a[i] != 3)
+	abort ();
+    #pragma omp for collapse(2) private(k) nowait
+    for (i = 0; i < N / 16; i++)
+      for (j = 0; j < 8; j++)
+	for (k = 0; k < 4; k++)
+	  if (b[i][j][k] != 3 * (i >= 2 && i < N / 16 - 1 && (j & 1) == 0 && k >= 1))
+	    abort ();
+    #pragma omp for collapse(3) nowait
+    for (i = 0; i < N / 32; i++)
+      for (j = 0; j < 8; j++)
+	for (k = 0; k < 8; k++)
+	  if (c[i][j][k] != 3 * (j >= 2 && (k & 1) == 0))
+	    abort ();
+  }
+  return 0;
+}

	Jakub

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [gomp4.1] Fixup handling of doacross loops with noreturn body
  2015-09-24 20:04 [gomp4.1] Doacross library implementation Jakub Jelinek
@ 2015-09-29 18:33 ` Jakub Jelinek
  2015-10-01 12:08 ` [gomp4.1] Fixup doacross lastprivate handling Jakub Jelinek
  2015-10-08 12:48 ` [gomp4.1] Doacross library implementation Torvald Riegel
  2 siblings, 0 replies; 5+ messages in thread
From: Jakub Jelinek @ 2015-09-29 18:33 UTC (permalink / raw)
  To: gcc-patches

On Thu, Sep 24, 2015 at 08:32:10PM +0200, Jakub Jelinek wrote:
> then there is a bug with ordered loops that have noreturn body (need to add
> some edge for that case and condition checking),

This patch fixes the above issue, if we have any of the ordered > collapse
loops that might have zero iterations, we need to deal with the !cont_bb
(aka broken_loop) case, for lastprivate reasons not just as simple checking
of the conditions and falling through into the cont_bb case, but have to
emit all the loops, and just for the innermost handle the case that there is
no fallthru from the body to the cont_bb block; the innermost could have
zero iterations and some of the outer ones could have all non-zero
iterations, at which point we want lastprivate to contain the initial value
of the innermost iterator and last iteration's values of the outer ones.

2015-09-29  Jakub Jelinek  <jakub@redhat.com>

	* omp-low.c (expand_omp_for_ordered_loops): Handle the case
	when cont_bb has no predecessors.
	(expand_omp_for_generic): If any of the ordered loops above
	collapsed loops could have zero iterations for broken_loop,
	create a cont_bb and continue as if the loop is not broken.

	* testsuite/libgomp.c/doacross-1.c (main): Adjust, so that one
	of the doacross loops has noreturn loop body.

--- gcc/omp-low.c.jj	2015-09-25 18:17:13.000000000 +0200
+++ gcc/omp-low.c	2015-09-29 19:07:25.366494422 +0200
@@ -7345,36 +7345,44 @@ expand_omp_for_ordered_loops (struct omp
       basic_block new_body = e1->dest;
       if (body_bb == cont_bb)
 	cont_bb = new_body;
-      gsi = gsi_last_bb (cont_bb);
-      if (POINTER_TYPE_P (type))
-	t = fold_build_pointer_plus (fd->loops[i].v,
-				     fold_convert (sizetype,
-						   fd->loops[i].step));
-      else
-	t = fold_build2 (PLUS_EXPR, type, fd->loops[i].v,
-			 fold_convert (type, fd->loops[i].step));
-      expand_omp_build_assign (&gsi, fd->loops[i].v, t);
-      if (counts[i])
-	{
-	  t = fold_build2 (PLUS_EXPR, fd->iter_type, counts[i],
-			   build_int_cst (fd->iter_type, 1));
-	  expand_omp_build_assign (&gsi, counts[i], t);
-	  t = counts[i];
-	}
-      else
+      edge e2 = NULL;
+      basic_block new_header;
+      if (EDGE_COUNT (cont_bb->preds) > 0)
 	{
-	  t = fold_build2 (MINUS_EXPR, TREE_TYPE (fd->loops[i].v),
-			   fd->loops[i].v, fd->loops[i].n1);
-	  t = fold_convert (fd->iter_type, t);
-	  t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
-					true, GSI_SAME_STMT);
+	  gsi = gsi_last_bb (cont_bb);
+	  if (POINTER_TYPE_P (type))
+	    t = fold_build_pointer_plus (fd->loops[i].v,
+					 fold_convert (sizetype,
+						       fd->loops[i].step));
+	  else
+	    t = fold_build2 (PLUS_EXPR, type, fd->loops[i].v,
+			     fold_convert (type, fd->loops[i].step));
+	  expand_omp_build_assign (&gsi, fd->loops[i].v, t);
+	  if (counts[i])
+	    {
+	      t = fold_build2 (PLUS_EXPR, fd->iter_type, counts[i],
+			       build_int_cst (fd->iter_type, 1));
+	      expand_omp_build_assign (&gsi, counts[i], t);
+	      t = counts[i];
+	    }
+	  else
+	    {
+	      t = fold_build2 (MINUS_EXPR, TREE_TYPE (fd->loops[i].v),
+			       fd->loops[i].v, fd->loops[i].n1);
+	      t = fold_convert (fd->iter_type, t);
+	      t = force_gimple_operand_gsi (&gsi, t, true, NULL_TREE,
+					    true, GSI_SAME_STMT);
+	    }
+	  aref = build4 (ARRAY_REF, fd->iter_type, counts[fd->ordered],
+			 size_int (i - fd->collapse + 1),
+			 NULL_TREE, NULL_TREE);
+	  expand_omp_build_assign (&gsi, aref, t);
+	  gsi_prev (&gsi);
+	  e2 = split_block (cont_bb, gsi_stmt (gsi));
+	  new_header = e2->dest;
 	}
-      aref = build4 (ARRAY_REF, fd->iter_type, counts[fd->ordered],
-		     size_int (i - fd->collapse + 1), NULL_TREE, NULL_TREE);
-      expand_omp_build_assign (&gsi, aref, t);
-      gsi_prev (&gsi);
-      edge e2 = split_block (cont_bb, gsi_stmt (gsi));
-      basic_block new_header = e2->dest;
+      else
+	new_header = cont_bb;
       gsi = gsi_after_labels (new_header);
       tree v = force_gimple_operand_gsi (&gsi, fd->loops[i].v, true, NULL_TREE,
 					 true, GSI_SAME_STMT);
@@ -7395,10 +7403,13 @@ expand_omp_for_ordered_loops (struct omp
       set_immediate_dominator (CDI_DOMINATORS, new_header, body_bb);
       set_immediate_dominator (CDI_DOMINATORS, new_body, new_header);
 
-      struct loop *loop = alloc_loop ();
-      loop->header = new_header;
-      loop->latch = e2->src;
-      add_loop (loop, body_bb->loop_father);
+      if (e2)
+	{
+	  struct loop *loop = alloc_loop ();
+	  loop->header = new_header;
+	  loop->latch = e2->src;
+	  add_loop (loop, body_bb->loop_father);
+	}
     }
   return cont_bb;
 }
@@ -7943,6 +7954,33 @@ expand_omp_for_generic (struct omp_regio
 	 depend(source).  */
       if (fd->collapse > 1)
 	memmove (counts, counts + 1, (fd->collapse - 1) * sizeof (counts[0]));
+      if (broken_loop)
+	{
+	  int i;
+	  for (i = fd->collapse; i < fd->ordered; i++)
+	    {
+	      tree type = TREE_TYPE (fd->loops[i].v);
+	      tree this_cond
+		= fold_build2 (fd->loops[i].cond_code, boolean_type_node,
+			       fold_convert (type, fd->loops[i].n1),
+			       fold_convert (type, fd->loops[i].n2));
+	      if (!integer_onep (this_cond))
+		break;
+	    }
+	  if (i < fd->ordered)
+	    {
+	      cont_bb
+		= create_empty_bb (EXIT_BLOCK_PTR_FOR_FN (cfun)->prev_bb);
+	      add_bb_to_loop (cont_bb, l1_bb->loop_father);
+	      gimple_stmt_iterator gsi = gsi_after_labels (cont_bb);
+	      gimple g = gimple_build_omp_continue (fd->loop.v, fd->loop.v);
+	      gsi_insert_before (&gsi, g, GSI_SAME_STMT);
+	      make_edge (cont_bb, l3_bb, EDGE_FALLTHRU);
+	      make_edge (cont_bb, l1_bb, 0);
+	      l2_bb = create_empty_bb (cont_bb);
+	      broken_loop = false;
+	    }
+	}
       expand_omp_ordered_source_sink (region, fd, counts, cont_bb);
       cont_bb = expand_omp_for_ordered_loops (fd, counts, cont_bb, l1_bb);
       if (counts[fd->collapse - 1])
--- libgomp/testsuite/libgomp.c/doacross-1.c.jj	2015-09-25 16:54:47.000000000 +0200
+++ libgomp/testsuite/libgomp.c/doacross-1.c	2015-09-29 16:36:26.131321339 +0200
@@ -144,8 +144,7 @@ main ()
 	    {
 	      #pragma omp ordered depend (source)
 	      #pragma omp ordered depend (sink:i - 2, j + 2, k - 2, m)
-	      if (!e)
-		abort ();
+	      abort ();
 	    }
     #pragma omp single
     if (i != 1 || j != -1 || k != 2 || m != 0)


	Jakub

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [gomp4.1] Fixup doacross lastprivate handling
  2015-09-24 20:04 [gomp4.1] Doacross library implementation Jakub Jelinek
  2015-09-29 18:33 ` [gomp4.1] Fixup handling of doacross loops with noreturn body Jakub Jelinek
@ 2015-10-01 12:08 ` Jakub Jelinek
  2015-10-08 12:48 ` [gomp4.1] Doacross library implementation Torvald Riegel
  2 siblings, 0 replies; 5+ messages in thread
From: Jakub Jelinek @ 2015-10-01 12:08 UTC (permalink / raw)
  To: gcc-patches

On Thu, Sep 24, 2015 at 08:32:10PM +0200, Jakub Jelinek wrote:
> some edge for that case and condition checking), lastprivate also needs
> checking for all the cases,

This patch handles lastprivate in the doacross loops.  In certain cases
(C++ class iterators and addressable iterators) the user IVs are replaced
with artificial IVs, and the user IVs are assigned (non-class) or adjusted
(class iterators) inside of the body of the loop, but while for normal omp
for (both collapse == 1 and > 1) lastprivate is undefined if there are no
iterations, for doacross it is IMHO only if the collapsed loops have zero
iterations; but if they have non-zero iters, but the ordered loops nested in
them have zero iterations, then the body might be not actually ever invoked.
So we need slightly different lastprivate sequences in that case.  And, to
make it more complicated, for the collapsed > 1 loops we need to add step to
the artificial IV before that, while for collapse == 1 loops or >= collapse
loops we should not.

2015-10-01  Jakub Jelinek  <jakub@redhat.com>

	* gimplify.c (gimplify_omp_for): Fix handling of lastprivate
	iterators in doacross loops.
	* omp-low.c (expand_omp_for_ordered_loops): Add ordered_lastprivate
	argument.  If true, add extra initializers for IVs starting with the
	one inner to the first >= collapse loop that could have zero
	iterations.
	(expand_omp_for_generic): Adjust caller.

	* tree-pretty-print.c (dump_omp_clause): Remove unused variable.
gcc/cp/
	* semantics.c (handle_omp_for_class_iterator): Add collapse and
	ordered arguments.  Fix handling of lastprivate iterators in
	doacross loops.
	(finish_omp_for): Adjust caller.
libgomp/
	* testsuite/libgomp.c++/doacross-1.C: New test.

--- gcc/gimplify.c.jj	2015-09-24 20:20:32.000000000 +0200
+++ gcc/gimplify.c	2015-10-01 12:55:44.955218974 +0200
@@ -8108,9 +8108,7 @@ gimplify_omp_for (tree *expr_p, gimple_s
 	  OMP_CLAUSE_LINEAR_STEP (c2) = OMP_CLAUSE_LINEAR_STEP (c);
 	}
 
-      if ((var != decl || collapse > 1)
-	  && orig_for_stmt == for_stmt
-	  && i < collapse)
+      if ((var != decl || collapse > 1) && orig_for_stmt == for_stmt)
 	{
 	  for (c = OMP_FOR_CLAUSES (for_stmt); c ; c = OMP_CLAUSE_CHAIN (c))
 	    if (((OMP_CLAUSE_CODE (c) == OMP_CLAUSE_LASTPRIVATE
@@ -8120,16 +8118,22 @@ gimplify_omp_for (tree *expr_p, gimple_s
 		     && OMP_CLAUSE_LINEAR_GIMPLE_SEQ (c) == NULL))
 		&& OMP_CLAUSE_DECL (c) == decl)
 	      {
-		t = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i);
-		gcc_assert (TREE_CODE (t) == MODIFY_EXPR);
-		gcc_assert (TREE_OPERAND (t, 0) == var);
-		t = TREE_OPERAND (t, 1);
-		gcc_assert (TREE_CODE (t) == PLUS_EXPR
-			    || TREE_CODE (t) == MINUS_EXPR
-			    || TREE_CODE (t) == POINTER_PLUS_EXPR);
-		gcc_assert (TREE_OPERAND (t, 0) == var);
-		t = build2 (TREE_CODE (t), TREE_TYPE (decl), decl,
-			    TREE_OPERAND (t, 1));
+		if (is_doacross && (collapse == 1 || i >= collapse))
+		  t = var;
+		else
+		  {
+		    t = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i);
+		    gcc_assert (TREE_CODE (t) == MODIFY_EXPR);
+		    gcc_assert (TREE_OPERAND (t, 0) == var);
+		    t = TREE_OPERAND (t, 1);
+		    gcc_assert (TREE_CODE (t) == PLUS_EXPR
+				|| TREE_CODE (t) == MINUS_EXPR
+				|| TREE_CODE (t) == POINTER_PLUS_EXPR);
+		    gcc_assert (TREE_OPERAND (t, 0) == var);
+		    t = build2 (TREE_CODE (t), TREE_TYPE (decl),
+				is_doacross ? var : decl,
+				TREE_OPERAND (t, 1));
+		  }
 		gimple_seq *seq;
 		if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_LASTPRIVATE)
 		  seq = &OMP_CLAUSE_LASTPRIVATE_GIMPLE_SEQ (c);
--- gcc/omp-low.c.jj	2015-09-29 19:07:25.000000000 +0200
+++ gcc/omp-low.c	2015-09-30 12:09:13.866406256 +0200
@@ -7303,7 +7303,8 @@ expand_omp_ordered_source_sink (struct o
 
 static basic_block
 expand_omp_for_ordered_loops (struct omp_for_data *fd, tree *counts,
-			      basic_block cont_bb, basic_block body_bb)
+			      basic_block cont_bb, basic_block body_bb,
+			      bool ordered_lastprivate)
 {
   if (fd->ordered == fd->collapse)
     return cont_bb;
@@ -7411,6 +7412,31 @@ expand_omp_for_ordered_loops (struct omp
 	  add_loop (loop, body_bb->loop_father);
 	}
     }
+
+  /* If there are any lastprivate clauses and it is possible some loops
+     might have zero iterations, ensure all the decls are initialized,
+     otherwise we could crash evaluating C++ class iterators with lastprivate
+     clauses.  */
+  bool need_inits = false;
+  for (int i = fd->collapse; ordered_lastprivate && i < fd->ordered; i++)
+    if (need_inits)
+      {
+	tree type = TREE_TYPE (fd->loops[i].v);
+	gimple_stmt_iterator gsi = gsi_after_labels (body_bb);
+	expand_omp_build_assign (&gsi, fd->loops[i].v,
+				 fold_convert (type, fd->loops[i].n1));
+      }
+    else
+      {
+	tree type = TREE_TYPE (fd->loops[i].v);
+	tree this_cond = fold_build2 (fd->loops[i].cond_code,
+				      boolean_type_node,
+				      fold_convert (type, fd->loops[i].n1),
+				      fold_convert (type, fd->loops[i].n2));
+	if (!integer_onep (this_cond))
+	  need_inits = true;
+      }
+
   return cont_bb;
 }
 
@@ -7524,6 +7550,7 @@ expand_omp_for_generic (struct omp_regio
   edge e, ne;
   tree *counts = NULL;
   int i;
+  bool ordered_lastprivate = false;
 
   gcc_assert (!broken_loop || !in_combined_parallel);
   gcc_assert (fd->iter_type == long_integer_type_node
@@ -7551,6 +7578,10 @@ expand_omp_for_generic (struct omp_regio
   gsi = gsi_last_bb (entry_bb);
 
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
+  if (fd->ordered
+      && find_omp_clause (gimple_omp_for_clauses (gsi_stmt (gsi)),
+			  OMP_CLAUSE_LASTPRIVATE))
+    ordered_lastprivate = false;
   if (fd->collapse > 1 || fd->ordered)
     {
       int first_zero_iter1 = -1, first_zero_iter2 = -1;
@@ -7982,7 +8013,8 @@ expand_omp_for_generic (struct omp_regio
 	    }
 	}
       expand_omp_ordered_source_sink (region, fd, counts, cont_bb);
-      cont_bb = expand_omp_for_ordered_loops (fd, counts, cont_bb, l1_bb);
+      cont_bb = expand_omp_for_ordered_loops (fd, counts, cont_bb, l1_bb,
+					      ordered_lastprivate);
       if (counts[fd->collapse - 1])
 	{
 	  gcc_assert (fd->collapse == 1);
--- gcc/tree-pretty-print.c.jj	2015-09-25 15:04:46.000000000 +0200
+++ gcc/tree-pretty-print.c	2015-09-30 14:03:06.581614091 +0200
@@ -568,7 +568,6 @@ dump_omp_clause (pretty_printer *pp, tre
 		dump_generic_node (pp, TREE_VALUE (t), spc, flags, false);
 		if (TREE_PURPOSE (t) != integer_zero_node)
 		  {
-		    tree p = TREE_PURPOSE (t);
 		    if (OMP_CLAUSE_DEPEND_SINK_NEGATIVE (t))
 		      pp_minus (pp);
 		    else
--- gcc/cp/semantics.c.jj	2015-09-10 10:58:25.000000000 +0200
+++ gcc/cp/semantics.c	2015-10-01 12:40:48.322177598 +0200
@@ -7202,7 +7202,7 @@ static bool
 handle_omp_for_class_iterator (int i, location_t locus, enum tree_code code,
 			       tree declv, tree initv, tree condv, tree incrv,
 			       tree *body, tree *pre_body, tree &clauses,
-			       tree *lastp)
+			       tree *lastp, int collapse, int ordered)
 {
   tree diff, iter_init, iter_incr = NULL, last;
   tree incr_var = NULL, orig_pre_body, orig_body, c;
@@ -7386,7 +7386,8 @@ handle_omp_for_class_iterator (int i, lo
   last = create_temporary_var (TREE_TYPE (diff));
   pushdecl (last);
   add_decl_expr (last);
-  if (c && iter_incr == NULL && TREE_CODE (incr) != INTEGER_CST)
+  if (c && iter_incr == NULL && TREE_CODE (incr) != INTEGER_CST
+      && (!ordered || (i < collapse && collapse > 1)))
     {
       incr_var = create_temporary_var (TREE_TYPE (diff));
       pushdecl (incr_var);
@@ -7422,7 +7423,8 @@ handle_omp_for_class_iterator (int i, lo
 					   iter, NOP_EXPR, init,
 					   tf_warning_or_error));
   init = build_int_cst (TREE_TYPE (diff), 0);
-  if (c && iter_incr == NULL)
+  if (c && iter_incr == NULL
+      && (!ordered || (i < collapse && collapse > 1)))
     {
       if (incr_var)
 	{
@@ -7435,6 +7437,8 @@ handle_omp_for_class_iterator (int i, lo
 				       iter, PLUS_EXPR, incr,
 				       tf_warning_or_error);
     }
+  if (c && ordered && i < collapse && collapse > 1)
+    iter_incr = incr;
   finish_expr_stmt (build_x_modify_expr (elocus,
 					 last, NOP_EXPR, init,
 					 tf_warning_or_error));
@@ -7471,7 +7475,22 @@ handle_omp_for_class_iterator (int i, lo
   if (c)
     {
       OMP_CLAUSE_LASTPRIVATE_STMT (c) = push_stmt_list ();
-      finish_expr_stmt (iter_incr);
+      if (!ordered)
+	finish_expr_stmt (iter_incr);
+      else
+	{
+	  iter_init = decl;
+	  if (i < collapse && collapse > 1 && !error_operand_p (iter_incr))
+	    iter_init = build2 (PLUS_EXPR, TREE_TYPE (diff),
+				iter_init, iter_incr);
+	  iter_init = build2 (MINUS_EXPR, TREE_TYPE (diff), iter_init, last);
+	  iter_init = build_x_modify_expr (elocus,
+					   iter, PLUS_EXPR, iter_init,
+					   tf_warning_or_error);
+	  if (iter_init != error_mark_node)
+	    iter_init = build1 (NOP_EXPR, void_type_node, iter_init);
+	  finish_expr_stmt (iter_init);
+	}
       OMP_CLAUSE_LASTPRIVATE_STMT (c)
 	= pop_stmt_list (OMP_CLAUSE_LASTPRIVATE_STMT (c));
     }
@@ -7502,10 +7521,20 @@ finish_omp_for (location_t locus, enum t
   tree last = NULL_TREE;
   location_t elocus;
   int i;
+  int collapse = 1;
+  int ordered = 0;
 
   gcc_assert (TREE_VEC_LENGTH (declv) == TREE_VEC_LENGTH (initv));
   gcc_assert (TREE_VEC_LENGTH (declv) == TREE_VEC_LENGTH (condv));
   gcc_assert (TREE_VEC_LENGTH (declv) == TREE_VEC_LENGTH (incrv));
+  if (TREE_VEC_LENGTH (declv) > 1)
+    {
+      tree c = find_omp_clause (clauses, OMP_CLAUSE_COLLAPSE);
+      if (c)
+	collapse = tree_to_shwi (OMP_CLAUSE_COLLAPSE_EXPR (c));
+      if (collapse != TREE_VEC_LENGTH (declv))
+	ordered = TREE_VEC_LENGTH (declv);
+    }
   for (i = 0; i < TREE_VEC_LENGTH (declv); i++)
     {
       decl = TREE_VEC_ELT (declv, i);
@@ -7636,7 +7665,8 @@ finish_omp_for (location_t locus, enum t
 	    orig_decl = decl;
 	  if (handle_omp_for_class_iterator (i, locus, code, declv, initv,
 					     condv, incrv, &body, &pre_body,
-					     clauses, &last))
+					     clauses, &last, collapse,
+					     ordered))
 	    return NULL;
 	  continue;
 	}
--- libgomp/testsuite/libgomp.c++/doacross-1.C.jj	2015-10-01 13:00:31.636054501 +0200
+++ libgomp/testsuite/libgomp.c++/doacross-1.C	2015-10-01 13:03:13.079709288 +0200
@@ -0,0 +1,294 @@
+// { dg-do run }
+
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+extern "C" void abort ();
+
+template <typename T>
+class I
+{
+public:
+  typedef ptrdiff_t difference_type;
+  I ();
+  ~I ();
+  I (T *);
+  I (const I &);
+  T &operator * ();
+  T *operator -> ();
+  T &operator [] (const difference_type &) const;
+  I &operator = (const I &);
+  I &operator ++ ();
+  I operator ++ (int);
+  I &operator -- ();
+  I operator -- (int);
+  I &operator += (const difference_type &);
+  I &operator -= (const difference_type &);
+  I operator + (const difference_type &) const;
+  I operator - (const difference_type &) const;
+  template <typename S> friend bool operator == (I<S> &, I<S> &);
+  template <typename S> friend bool operator == (const I<S> &, const I<S> &);
+  template <typename S> friend bool operator < (I<S> &, I<S> &);
+  template <typename S> friend bool operator < (const I<S> &, const I<S> &);
+  template <typename S> friend bool operator <= (I<S> &, I<S> &);
+  template <typename S> friend bool operator <= (const I<S> &, const I<S> &);
+  template <typename S> friend bool operator > (I<S> &, I<S> &);
+  template <typename S> friend bool operator > (const I<S> &, const I<S> &);
+  template <typename S> friend bool operator >= (I<S> &, I<S> &);
+  template <typename S> friend bool operator >= (const I<S> &, const I<S> &);
+  template <typename S> friend typename I<S>::difference_type operator - (I<S> &, I<S> &);
+  template <typename S> friend typename I<S>::difference_type operator - (const I<S> &, const I<S> &);
+  template <typename S> friend I<S> operator + (typename I<S>::difference_type , const I<S> &);
+private:
+  T *p;
+};
+template <typename T> I<T>::I () : p (0) {}
+template <typename T> I<T>::~I () {}
+template <typename T> I<T>::I (T *x) : p (x) {}
+template <typename T> I<T>::I (const I &x) : p (x.p) {}
+template <typename T> T &I<T>::operator * () { return *p; }
+template <typename T> T *I<T>::operator -> () { return p; }
+template <typename T> T &I<T>::operator [] (const difference_type &x) const { return p[x]; }
+template <typename T> I<T> &I<T>::operator = (const I &x) { p = x.p; return *this; }
+template <typename T> I<T> &I<T>::operator ++ () { ++p; return *this; }
+template <typename T> I<T> I<T>::operator ++ (int) { return I (p++); }
+template <typename T> I<T> &I<T>::operator -- () { --p; return *this; }
+template <typename T> I<T> I<T>::operator -- (int) { return I (p--); }
+template <typename T> I<T> &I<T>::operator += (const difference_type &x) { p += x; return *this; }
+template <typename T> I<T> &I<T>::operator -= (const difference_type &x) { p -= x; return *this; }
+template <typename T> I<T> I<T>::operator + (const difference_type &x) const { return I (p + x); }
+template <typename T> I<T> I<T>::operator - (const difference_type &x) const { return I (p - x); }
+template <typename T> bool operator == (I<T> &x, I<T> &y) { return x.p == y.p; }
+template <typename T> bool operator == (const I<T> &x, const I<T> &y) { return x.p == y.p; }
+template <typename T> bool operator != (I<T> &x, I<T> &y) { return !(x == y); }
+template <typename T> bool operator != (const I<T> &x, const I<T> &y) { return !(x == y); }
+template <typename T> bool operator < (I<T> &x, I<T> &y) { return x.p < y.p; }
+template <typename T> bool operator < (const I<T> &x, const I<T> &y) { return x.p < y.p; }
+template <typename T> bool operator <= (I<T> &x, I<T> &y) { return x.p <= y.p; }
+template <typename T> bool operator <= (const I<T> &x, const I<T> &y) { return x.p <= y.p; }
+template <typename T> bool operator > (I<T> &x, I<T> &y) { return x.p > y.p; }
+template <typename T> bool operator > (const I<T> &x, const I<T> &y) { return x.p > y.p; }
+template <typename T> bool operator >= (I<T> &x, I<T> &y) { return x.p >= y.p; }
+template <typename T> bool operator >= (const I<T> &x, const I<T> &y) { return x.p >= y.p; }
+template <typename T> typename I<T>::difference_type operator - (I<T> &x, I<T> &y) { return x.p - y.p; }
+template <typename T> typename I<T>::difference_type operator - (const I<T> &x, const I<T> &y) { return x.p - y.p; }
+template <typename T> I<T> operator + (typename I<T>::difference_type x, const I<T> &y) { return I<T> (x + y.p); }
+
+int results[2048];
+
+template <typename T>
+void
+baz (I<T> &i, I<T> &j, I<T> &k, T &l)
+{
+  if (*i < 0 || *i >= 16)
+    abort ();
+  if (*j < 0 || *j >= 16)
+    abort ();
+  if (*k < 0 || *k >= 16)
+    abort ();
+  if (l < 0 || l >= 16)
+    abort ();
+  #pragma omp atomic
+    results[512 * *i + 64 * *j + 8 * *k + l]++;
+}
+
+template <typename T>
+void
+baz (T &i, T &j, T &k, T &l)
+{
+  if (i < 0 || i >= 16)
+    abort ();
+  if (j < 0 || j >= 16)
+    abort ();
+  if (k < 0 || k >= 16)
+    abort ();
+  if (l < 0 || l >= 16)
+    abort ();
+  #pragma omp atomic
+    results[512 * i + 64 * j + 8 * k + l]++;
+}
+
+void
+f1 (const I<int> &a, const I<int> &b, const I<int> &c, const I<int> &d,
+    const I<int> &e, const I<int> &f, int g, int h,
+    I<int> &r1, I<int> &r2, I<int> &r3)
+{
+  I<int> i, j, k;
+  int l;
+#pragma omp parallel for ordered(4) lastprivate (i, j, k) schedule(static, 1)
+  for (i = a; i <= b; i++)
+    for (j = c; j < d; j++)
+      for (k = e; k < f; k++)
+	for (l = g; l < h; l++)
+	  {
+	    #pragma omp ordered depend(sink: i - 1, j, k + 1, l - 2)
+	    baz (i, j, k, l);
+	    if (i > a && k < f - 1 && l > g + 1)
+	      {
+		int m;
+		#pragma omp atomic read
+		m = results[512 * *(i - 1) + 64 * *j + 8 * *(k + 1) + l - 2];
+		if (m == 0)
+		  abort ();
+	      }
+	    #pragma omp ordered depend(source)
+	  }
+  r1 = i;
+  r2 = j;
+  r3 = k;
+}
+
+void
+f2 (int a, int b, int c, int d, int e, int f, int g, int h, int &r1, int &r2, int &r3)
+{
+  int i, j, k, l;
+#pragma omp parallel for collapse (1) ordered(4) lastprivate (i, j, k) schedule(static, 2)
+  for (i = a; i <= b; i++)
+    for (j = c; j < d; j++)
+      for (k = e; k < f; k++)
+	for (l = g; l < h; l++)
+	  {
+	    #pragma omp ordered depend(sink: i - 1, j, k + 1, l - 2)
+	    baz (i, j, k, l);
+	    if (i > a && k < f - 1 && l > g + 1)
+	      {
+		int m;
+		#pragma omp atomic read
+		m = results[512 * (i - 1) + 64 * j + 8 * (k + 1) + l - 2];
+		if (m == 0)
+		  abort ();
+	      }
+	    #pragma omp ordered depend(source)
+	  }
+  r1 = i;
+  r2 = j;
+  r3 = k;
+}
+
+void
+f3 (const I<int> &a, const I<int> &b, const I<int> &c, const I<int> &d,
+    const I<int> &e, const I<int> &f, int g, int h,
+    I<int> &r1, I<int> &r2, I<int> &r3)
+{
+  I<int> i, j, k;
+  int l;
+#pragma omp parallel for collapse (2) ordered(4) lastprivate (i, j, k) schedule(static, 1)
+  for (i = a; i <= b; i++)
+    for (j = c; j < d; j++)
+      for (k = e; k < f; k++)
+	for (l = g; l < h; l++)
+	  {
+	    #pragma omp ordered depend(sink: i - 1, j, k + 1, l - 2)
+	    baz (i, j, k, l);
+	    if (i > a && k < f - 1 && l > g + 1)
+	      {
+		int m;
+		#pragma omp atomic read
+		m = results[512 * *(i - 1) + 64 * *j + 8 * *(k + 1) + l - 2];
+		if (m == 0)
+		  abort ();
+	      }
+	    #pragma omp ordered depend(source)
+	  }
+  r1 = i;
+  r2 = j;
+  r3 = k;
+}
+
+void
+f4 (int a, int b, int c, int d, int e, int f, int g, int h, int &r1, int &r2, int &r3)
+{
+  int i, j, k, l;
+#pragma omp parallel for collapse (2) ordered(4) lastprivate (i, j, k) schedule(static, 2)
+  for (i = a; i <= b; i++)
+    for (j = c; j < d; j++)
+      for (k = e; k < f; k++)
+	for (l = g; l < h; l++)
+	  {
+	    #pragma omp ordered depend(sink: i - 1, j, k + 1, l - 2)
+	    baz (i, j, k, l);
+	    if (i > a && k < f - 1 && l > g + 1)
+	      {
+		int m;
+		#pragma omp atomic read
+		m = results[512 * (i - 1) + 64 * j + 8 * (k + 1) + l - 2];
+		if (m == 0)
+		  abort ();
+	      }
+	    #pragma omp ordered depend(source)
+	  }
+  r1 = i;
+  r2 = j;
+  r3 = k;
+}
+
+#define check(expr) \
+  for (int i = 0; i < 2048; i++)			\
+    if (expr)						\
+      {							\
+	if (results[i] != 1)				\
+	  abort ();					\
+	results[i] = 0;					\
+      }							\
+    else if (results[i])				\
+      abort ()
+
+int
+main ()
+{
+  int a[16], s1, s2, s3;
+  I<int> r1, r2, r3;
+  for (int i = 0; i < 16; i++)
+    a[i] = i;
+  r1 = &a[15]; r2 = &a[15]; r3 = &a[15];
+  f1 (&a[1], &a[3], &a[2], &a[5], &a[1], &a[3], 0, 5, r1, r2, r3);
+  if (*r1 != 4 || *r2 != 5 || *r3 != 3)
+    abort ();
+  check ((i / 512) - 1U < 3U && ((i / 64) & 7) - 2U < 3U && ((i / 8) & 7) - 1U < 2U && (i & 7) < 5);
+  r1 = &a[15]; r2 = &a[15]; r3 = &a[15];
+  f1 (&a[1], &a[3], &a[1], &a[4], &a[1], &a[5], 1, 0, r1, r2, r3);
+  if (*r1 != 4 || *r2 != 4 || *r3 != 5)
+    abort ();
+  r1 = &a[15]; r2 = &a[15]; r3 = &a[15];
+  f1 (&a[1], &a[3], &a[1], &a[9], &a[7], &a[2], 0, 7, r1, r2, r3);
+  if (*r1 != 4 || *r2 != 9 || *r3 != 7)
+    abort ();
+  s1 = 15; s2 = 15; s3 = 15;
+  f2 (1, 3, 2, 5, 1, 3, 0, 5, s1, s2, s3);
+  if (s1 != 4 || s2 != 5 || s3 != 3)
+    abort ();
+  check ((i / 512) - 1U < 3U && ((i / 64) & 7) - 2U < 3U && ((i / 8) & 7) - 1U < 2U && (i & 7) < 5);
+  s1 = 15; s2 = 15; s3 = 15;
+  f2 (1, 3, 1, 4, 1, 5, 1, 0, s1, s2, s3);
+  if (s1 != 4 || s2 != 4 || s3 != 5)
+    abort ();
+  s1 = 15; s2 = 15; s3 = 15;
+  f2 (1, 3, 1, 9, 7, 2, 0, 7, s1, s2, s3);
+  if (s1 != 4 || s2 != 9 || s3 != 7)
+    abort ();
+  r1 = &a[15]; r2 = &a[15]; r3 = &a[15];
+  f3 (&a[1], &a[3], &a[2], &a[5], &a[1], &a[3], 0, 5, r1, r2, r3);
+  if (*r1 != 4 || *r2 != 5 || *r3 != 3)
+    abort ();
+  check ((i / 512) - 1U < 3U && ((i / 64) & 7) - 2U < 3U && ((i / 8) & 7) - 1U < 2U && (i & 7) < 5);
+  r1 = &a[15]; r2 = &a[15]; r3 = &a[15];
+  f3 (&a[1], &a[3], &a[1], &a[4], &a[1], &a[5], 1, 0, r1, r2, r3);
+  if (*r1 != 4 || *r2 != 4 || *r3 != 5)
+    abort ();
+  r1 = &a[15]; r2 = &a[15]; r3 = &a[15];
+  f3 (&a[1], &a[3], &a[1], &a[9], &a[7], &a[2], 0, 7, r1, r2, r3);
+  if (*r1 != 4 || *r2 != 9 || *r3 != 7)
+    abort ();
+  s1 = 15; s2 = 15; s3 = 15;
+  f4 (1, 3, 2, 5, 1, 3, 0, 5, s1, s2, s3);
+  if (s1 != 4 || s2 != 5 || s3 != 3)
+    abort ();
+  check ((i / 512) - 1U < 3U && ((i / 64) & 7) - 2U < 3U && ((i / 8) & 7) - 1U < 2U && (i & 7) < 5);
+  s1 = 15; s2 = 15; s3 = 15;
+  f4 (1, 3, 1, 4, 1, 5, 1, 0, s1, s2, s3);
+  if (s1 != 4 || s2 != 4 || s3 != 5)
+    abort ();
+  s1 = 15; s2 = 15; s3 = 15;
+  f4 (1, 3, 1, 9, 7, 2, 0, 7, s1, s2, s3);
+  if (s1 != 4 || s2 != 9 || s3 != 7)
+    abort ();
+  return 0;
+}


	Jakub

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gomp4.1] Doacross library implementation
  2015-09-24 20:04 [gomp4.1] Doacross library implementation Jakub Jelinek
  2015-09-29 18:33 ` [gomp4.1] Fixup handling of doacross loops with noreturn body Jakub Jelinek
  2015-10-01 12:08 ` [gomp4.1] Fixup doacross lastprivate handling Jakub Jelinek
@ 2015-10-08 12:48 ` Torvald Riegel
  2015-10-08 14:26   ` Aldy Hernandez
  2 siblings, 1 reply; 5+ messages in thread
From: Torvald Riegel @ 2015-10-08 12:48 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Aldy Hernandez, Richard Henderson

On Thu, 2015-09-24 at 20:32 +0200, Jakub Jelinek wrote:
> Torvald, can you please have a look at it, if I got all the atomics / memory
> models right?

More detailed comments below, but in general, I'd really suggest to add
more code comments for the synchronization parts.  In the end, the level
of detail of documentation of libgomp is your decision, but, for
example, the lack of comments in synchronization code in glibc has made
maintaining this code and fixing issues in it very costly.  It has also
been hard to understand for many.

My suggestion would be both to (1) document the high-level, abstract
synchronization scheme and (2) how that scheme is implemented.  The
first point is important in my experience because typically, the
high-level scheme and the actual thinking behind it (or, IOW, the intent
of the original author) is much harder to reconstruct in case of
concurrent code than it is for sequential code; you can't just simply
follow the program along line by line, but have to consider
interleavings.

Even if the synchronization problem to solve is relatively
straight-forward as in thise case (ie, one-directional waiting), it's
worth IMO to do point (1).  If it is simple, the high-level description
will be simple, and it will assure others that one really has to just
solve that and that the original author wasn't aware of any other
issues.

Regarding point (2), what we're doing in glibc now is basically to
document how the specific things we do in the code make sure we
implement the high-level scheme.  So we'd say things like "this CAS here
now ensures consensus among the threads A, B, C".  For memory orders
specifically, it helps to document why they are sufficient and
necessary; this helps others understand the code, so that they don't
need to go hunting through all of the code looking for other accesses to
the same memory locations to be able to reconstruct the intended
happens-before relations.  I have some examples below.
Also, given that you don't use explicit atomic types but just atomic
operations, it's IMO a good idea to document which variables are
supposed to be accessed atomically so it becomes easier to not violate
the data-race-freedom requirement accidentally.

> The testcase obviously is not a good benchmark, we'll need
> some more realistic one.  But obviously when asking for oversubscription, it
> is quite expensive.  The question is how to implement a non-busy waiting
> fallback, whether we put some mutex and queue guarded by the mutex into the
> same (or some other?) cache-line, or just use atomics to queue it and how to
> make it cheap for the case where busy waiting is sufficient.

Atomics and futexes is probably the best approach if you want
performance; at least we need some efficient way for post() to figure
out that there are indeed waiters, and we don't want to use a lock for
that.

What specific approach to use is a question of how much time we want to
spend on this.  It's hard to estimate how often we'd really need a
blocking wait in practice, though.

> I'd say
> it should be sufficient to implement non-busy waiting in the flattened
> variant.

Sounds fine to me.

> --- libgomp/ordered.c.jj	2015-09-18 18:36:42.000000000 +0200
> +++ libgomp/ordered.c	2015-09-24 18:20:28.286244397 +0200
> @@ -252,14 +254,146 @@ GOMP_ordered_end (void)
>  {
>  }
>  
> +/* DOACROSS initialization.  */
> +
> +#define MAX_COLLAPSED_BITS (__SIZEOF_LONG__ * __CHAR_BIT__)
> +
> +void
> +gomp_doacross_init (unsigned ncounts, long *counts, long chunk_size)
> +{
> +  struct gomp_thread *thr = gomp_thread ();
> +  struct gomp_team *team = thr->ts.team;
> +  struct gomp_work_share *ws = thr->ts.work_share;
> +  unsigned int i, bits[MAX_COLLAPSED_BITS], num_bits = 0;
> +  unsigned long ent, num_ents, elt_sz, shift_sz;
> +  struct gomp_doacross_work_share *doacross;
> +
> +  if (team == NULL || team->nthreads == 1)
> +    return;
> +
> +  for (i = 0; i < ncounts; i++)
> +    {
> +      /* If any count is 0, GOMP_doacross_{post,wait} can't be called.  */
> +      if (counts[i] == 0)
> +	return;
> +
> +      if (num_bits <= MAX_COLLAPSED_BITS)
> +	{
> +	  unsigned int this_bits;
> +	  if (counts[i] == 1)
> +	    this_bits = 1;
> +	  else
> +	    this_bits = __SIZEOF_LONG__ * __CHAR_BIT__
> +			- __builtin_clzl (counts[i] - 1);
> +	  if (num_bits + this_bits <= MAX_COLLAPSED_BITS)
> +	    {
> +	      bits[i] = this_bits;
> +	      num_bits += this_bits;
> +	    }
> +	  else
> +	    num_bits = MAX_COLLAPSED_BITS + 1;
> +	}
> +    }
> +
> +  if (ws->sched == GFS_STATIC)
> +    num_ents = team->nthreads;
> +  else
> +    num_ents = (counts[0] - 1) / chunk_size + 1;
> +  if (num_bits <= MAX_COLLAPSED_BITS)
> +    {
> +      elt_sz = sizeof (unsigned long);
> +      shift_sz = ncounts * sizeof (unsigned int);
> +    }
> +  else
> +    {
> +      elt_sz = sizeof (unsigned long) * ncounts;
> +      shift_sz = 0;
> +    }
> +  elt_sz = (elt_sz + 63) & ~63UL;
> +
> +  doacross = gomp_malloc (sizeof (*doacross) + 63 + num_ents * elt_sz
> +			  + shift_sz);
> +  doacross->chunk_size = chunk_size;
> +  doacross->elt_sz = elt_sz;
> +  doacross->ncounts = ncounts;
> +  doacross->flattened = false;
> +  doacross->boundary = 0;
> +  doacross->array = (unsigned char *)
> +		    ((((uintptr_t) (doacross + 1)) + 63 + shift_sz)
> +		     & ~(uintptr_t) 63);
> +  if (num_bits <= MAX_COLLAPSED_BITS)
> +    {
> +      unsigned int shift_count = 0;
> +      doacross->flattened = true;
> +      for (i = ncounts; i > 0; i--)
> +	{
> +	  doacross->shift_counts[i - 1] = shift_count;
> +	  shift_count += bits[i - 1];
> +	}
> +      for (ent = 0; ent < num_ents; ent++)
> +	*(unsigned long *) (doacross->array + ent * elt_sz) = 0;
> +    }
> +  else
> +    for (ent = 0; ent < num_ents; ent++)
> +      memset (doacross->array + ent * elt_sz, '\0',
> +	      sizeof (unsigned long) * ncounts);
> +  if (ws->sched == GFS_STATIC && chunk_size == 0)
> +    {
> +      unsigned long q = counts[0] / num_ents;
> +      unsigned long t = counts[0] % num_ents;
> +      doacross->boundary = t * (q + 1);
> +      doacross->q = q;
> +      doacross->t = t;
> +    }
> +  ws->doacross = doacross;
> +}
> +
>  /* DOACROSS POST operation.  */
>  
>  void
> -GOMP_doacross_post (long first, ...)
> +GOMP_doacross_post (long *counts)
>  {
> -  va_list ap;
> -  va_start (ap, first);
> -  va_end (ap);
> +  struct gomp_thread *thr = gomp_thread ();
> +  struct gomp_work_share *ws = thr->ts.work_share;
> +  struct gomp_doacross_work_share *doacross = ws->doacross;
> +  unsigned long ent;
> +  unsigned int i;
> +
> +  if (__builtin_expect (doacross == NULL, 0))
> +    {
> +      __sync_synchronize ();

Is this necessary for OpenMP no-ops, or what is the reasoning behind
having the full barrier here?  Full barriers aren't cheap.
Also, given that you use the new-style atomic ops elsewhere, any reason
to not use a seq-cst thread fence?

> +      return;
> +    }
> +
> +  if (__builtin_expect (ws->sched == GFS_STATIC, 1))
> +    ent = thr->ts.team_id;
> +  else
> +    ent = counts[0] / doacross->chunk_size;
> +  unsigned long *array = (unsigned long *) (doacross->array
> +					    + ent * doacross->elt_sz);
> +
> +  if (__builtin_expect (doacross->flattened, 1))
> +    {
> +      unsigned long flattened
> +	= (unsigned long) counts[0] << doacross->shift_counts[0];
> +
> +      for (i = 1; i < doacross->ncounts; i++)
> +	flattened |= (unsigned long) counts[i]
> +		     << doacross->shift_counts[i];
> +      flattened++;
> +      if (flattened == __atomic_load_n (array, MEMMODEL_ACQUIRE))
> +	__atomic_thread_fence (MEMMODEL_RELEASE);

Can the same value actually be posted more than once?  If so, I suppose
this would be a program that has two equal post() statements,
unnecessarily?  If this isn't a common thing, I'd just avoid the
additional load.

But *another* thread can never store the same value, or can it?

Also, even if trying to avoid the load, it wouldn't work like this.  The
acquire MO load is fine for ensuring a synchronizes-with with the thread
that produced this value with a release MO store.  But the release MO
fence is in itself not effective, it only takes effect through other
atomic stores sequenced after the fence.
So, if this thread needs to make its application data changes
happen-before the accesses to that after a wait reading this value of
flattened, it needs to store to flattened.  But if it stores the same
value, the wait can't actually distinguish between the two stores to
flattened (unless you ensure through other sync), so it can't rely on
seeing this threads newer changes to application data anyway.

> +      else
> +	__atomic_store_n (array, flattened, MEMMODEL_RELEASE);

OK.  I'd add a code comment like this:

"We need release MO so that the respective acquire MO load in
GOMP_doacross_wait synchronizes with us; thise ensures that the changes
in application data produced by this thread before the post
happen-before the accesses to this data after a matching
GOMP_doacross_wait."

This tells readers of the code what this is intended to synchronize
with, and why we needs this happens-before relation to make the abstract
synchronization scheme work.

> +      return;
> +    }
> +
> +  __atomic_thread_fence (MEMMODEL_ACQUIRE);

What is the acquire fence for?  It would apply to atomic loads sequenced
before the fence, but I don't see any relevant ones, and you hvaen't
documented which these are supposed to be.  If it's supposed to be in
effect together with the relaxed MO in the loop below, then it has the
wrong position in the code.

> +  for (i = doacross->ncounts; i-- > 0; )
> +    {
> +      if (counts[i] + 1UL != __atomic_load_n (&array[i], MEMMODEL_RELAXED))

This looks okay, but it's worth documenting why relaxed MO is fine.  The
question to ask (and answer) is whether anyone reading this value tries
to rely on happening after app data changes by this call of post.  I
think this is not the case because (1) two different threads should not
attempt to store the same value in *all* elements of counts[] (ie, same
line of though as above), and (2) a wait() that does depend on seeing
our app data updates will check all elements of counts[] and will thus
read at least from one release MO store by us, thus synchronizing with
us.

Is counts[] always lexicographically increasing? (You start at the
"least-significant" element of it.)  Or is this just the case with
something like dependence folding in place on the compiler side?  If
it's not the case, we need to figure out something else to make the
atomic snapshot in wait() work without running into ABA issues.  See
below...

> +	__atomic_store_n (&array[i], counts[i] + 1UL, MEMMODEL_RELEASE);

OK.  I'd add a comment like for the release MO store above.
Additionally, I'd also explain why a release MO store is required for
each element to make the atomic snapshot in wait() work.

BTW, I hope this example clarifies why I think more detailed comments
can make things easier for readers of the code; showing that this
*really* works needs quite a bit more reasoning...

> +    }
>  }
>  
>  /* DOACROSS WAIT operation.  */
> @@ -267,7 +401,81 @@ GOMP_doacross_post (long first, ...)
>  void
>  GOMP_doacross_wait (long first, ...)
>  {
> +  struct gomp_thread *thr = gomp_thread ();
> +  struct gomp_work_share *ws = thr->ts.work_share;
> +  struct gomp_doacross_work_share *doacross = ws->doacross;
>    va_list ap;
> -  va_start (ap, first);
> -  va_end (ap);
> +  unsigned long ent;
> +  unsigned int i;
> +
> +  if (__builtin_expect (doacross == NULL, 0))
> +    {
> +      __sync_synchronize ();
> +      return;

See above.

> +    }
> +
> +  if (__builtin_expect (ws->sched == GFS_STATIC, 1))
> +    {
> +      if (ws->chunk_size == 0)
> +	{
> +	  if (first < doacross->boundary)
> +	    ent = first / (doacross->q + 1);
> +	  else
> +	    ent = (first - doacross->boundary) / doacross->q
> +		  + doacross->t;
> +	}
> +      else
> +	ent = first / ws->chunk_size % thr->ts.team->nthreads;
> +    }
> +  else
> +    ent = first / doacross->chunk_size;
> +  unsigned long *array = (unsigned long *) (doacross->array
> +					    + ent * doacross->elt_sz);
> +
> +  if (__builtin_expect (doacross->flattened, 1))
> +    {
> +      unsigned long flattened
> +	= (unsigned long) first << doacross->shift_counts[0];
> +      unsigned long cur;
> +
> +      va_start (ap, first);
> +      for (i = 1; i < doacross->ncounts; i++)
> +	flattened |= (unsigned long) va_arg (ap, long)
> +		     << doacross->shift_counts[i];
> +      cur = __atomic_load_n (array, MEMMODEL_ACQUIRE);

OK, but I'd add a small comment that this is supposed to synchronize
with just the release MO store in post(), and that post() describes why
this is necessary.

> +      if (flattened < cur)
> +	{
> +	  __atomic_thread_fence (MEMMODEL_RELEASE);

What is this trying to accomplish?  Where is the atomic store that is to
be combined with the fence, and which happens-before relation do you
want to enforce here?

> +	  va_end (ap);
> +	  return;
> +	}
> +      doacross_spin (array, flattened, cur);

doacross_spin used relaxed MO.  But you need acquire MO here, I believe.
Either doacross_spin needs to issue an acquire MO fence after reading
the expected value with relaxed MO, or you need to issue an acquire MO
fence here.  The latter might allow for better reuse of doacross_spin.

A quick comment about the required acquire MO would be helpful here too,
IMO.

> +      __atomic_thread_fence (MEMMODEL_RELEASE);

See above.

> +      va_end (ap);
> +      return;
> +    }
> +

The following relies on array[] to be lexicographically monotonically
increasing, and on post to change array[] in such a way that the
least-significant element is modified first.  If it does not, it will be
prone to ABA issues, I think.  I really think it should be documented
how the abstract synchronization scheme is supposed to work, and why it
works.

I can provide text for that if you want.  If you or somebody else wants
to practise reasoning about concurrent code, this piece of
synchronization would be a good exercise.  Start with two elements,
perhaps use cppmem, and explore which values can be written and read.
Don't forget to work with the assumptions we make.

Documenting why it works in clear terms is a good test for whether the
reasoning is sound.  (And that's another reason for why I think that
documenting the concurrent code is helpful; it's like a test for your
code.  I've often found bugs in my concurrent code just by noticing that
I couldn't really thoroughly explain why it would have to work...)

> +  do
> +    {
> +      va_start (ap, first);
> +      for (i = 0; i < doacross->ncounts; i++)
> +	{
> +	  unsigned long thisv
> +	    = (unsigned long) (i ? va_arg (ap, long) : first) + 1;
> +	  unsigned long cur = __atomic_load_n (&array[i], MEMMODEL_RELAXED);

This needs to be acquire MO.  The key here is to ensure that changes to
less significant elements of array[] by the same thread that did the
store we read from here do happen before our load.  I'll leave the
reason for that as an exercise (see above) :)

> +	  if (thisv < cur)

Why does this work?  Say, we wait for (2,6), and post has stored the
value (2+1,3+1) (the +1 is the flattened++; in post()).  This will see
2<3 and then break out of the loop, after which it will try the loop
again, and will never get to comparing the second element.

> +	    {
> +	      i = doacross->ncounts;
> +	      break;
> +	    }
> +	  if (thisv > cur)
> +	    break;

Can't you use just doacross_spin here and wait until this element has
the expected value? (With added acquire MO fence after it.)  This should
be better than just calling cpu_relax at the end of the loop.

> +	}
> +      va_end (ap);
> +      if (i == doacross->ncounts)
> +	break;
> +      cpu_relax ();
> +    }
> +  while (1);
> +  __sync_synchronize ();

What is this full barrier for?  Aren't the acquire MO in the loads above
sufficient?

> --- libgomp/libgomp_g.h.jj	2015-09-17 09:25:23.000000000 +0200
> +++ libgomp/libgomp_g.h	2015-09-24 13:33:32.726324481 +0200
> @@ -78,6 +78,36 @@ enum gomp_schedule_type
>    GFS_AUTO
>  };
>  
> +struct gomp_doacross_work_share
> +{
> +  union {
> +    /* chunk_size copy, as ws->chunk_size is multiplied by incr for
> +       GFS_DYNAMIC.  */
> +    long chunk_size;
> +    /* For schedule(static,0) this is the number
> +       of iterations assigned to the last thread, i.e. number of
> +       iterations / number of threads.  */
> +    long q;
> +  };
> +  /* Size of each array entry (padded to cache line size).  */
> +  unsigned long elt_sz;
> +  /* Number of dimensions in sink vectors.  */
> +  unsigned int ncounts;
> +  /* True if the iterations can be flattened.  */
> +  bool flattened;
> +  /* Actual array (of elt_sz sized units), aligned to cache line size.
> +     This is indexed by team_id for GFS_STATIC and outermost iteration
> +     / chunk_size for other schedules.  */
> +  unsigned char *array;
> +  /* These two are only used for schedule(static,0).  */
> +  /* This one is number of iterations % number of threads.  */
> +  long t;
> +  /* And this one is cached t * (q + 1).  */
> +  long boundary;
> +  /* Array of shift counts for each dimension if they can be flattened.  */
> +  unsigned int shift_counts[];
> +};

As mentioned above, I'd quickly document which of these must be accessed
with atomic operations to achieve data-race freedom.

> --- libgomp/config/posix/doacross.h.jj	2015-09-23 12:17:53.217834221 +0200
> +++ libgomp/config/posix/doacross.h	2015-09-24 10:51:18.310081801 +0200
> @@ -0,0 +1,62 @@
> +/* Copyright (C) 2015 Free Software Foundation, Inc.
> +   Contributed by Jakub Jelinek <jakub@redhat.com>.
> +
> +   This file is part of the GNU Offloading and Multi Processing Library
> +   (libgomp).
> +
> +   Libgomp is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
> +   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
> +   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> +   more details.
> +
> +   Under Section 7 of GPL version 3, you are granted additional
> +   permissions described in the GCC Runtime Library Exception, version
> +   3.1, as published by the Free Software Foundation.
> +
> +   You should have received a copy of the GNU General Public License and
> +   a copy of the GCC Runtime Library Exception along with this program;
> +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* This is a generic implementation of doacross spinning.  */
> +
> +#ifndef GOMP_DOACROSS_H
> +#define GOMP_DOACROSS_H 1
> +
> +#include "libgomp.h"
> +#include <errno.h>
> +
> +#ifdef HAVE_ATTRIBUTE_VISIBILITY
> +# pragma GCC visibility push(hidden)
> +#endif
> +
> +static inline void
> +cpu_relax (void)
> +{
> +  __asm volatile ("" : : : "memory");
> +}


IIRC, cpu_relax() is something like the pause instruction on x86, meant
to tell the CPU that busy-waiting is happening.  The compiler barrier in
the asm above doesn't do that.  It's also not necessary if using atomic
operations for the spinning (as opposed to plain memory accesses),
because the atomics (even with relaxed MO) already prevent the compiler
from optimizing them away.  (Specifically, the standards require that
threads need to eventually observe the most recent store to a value; if
the compiler would optimize away a spin-loop with relaxed MO atomics, it
would violate this forward progress requirement.  This is different if
plain accesses are used, because the compiler is allowed to assume that
there are not endless loops without IO or synchronization in a program.)

> +
> +static inline void doacross_spin (unsigned long *addr, unsigned long expected,
> +				  unsigned long cur)
> +{
> +  /* FIXME: back off depending on how large expected - cur is.  */
> +  do
> +    {
> +      cpu_relax ();
> +      cur = __atomic_load_n (addr, MEMMODEL_RELAXED);
> +      if (expected < cur)
> +	return;
> +    }
> +  while (1);
> +}
> +
> +#ifdef HAVE_ATTRIBUTE_VISIBILITY
> +# pragma GCC visibility pop
> +#endif
> +
> +#endif /* GOMP_DOACROSS_H */


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gomp4.1] Doacross library implementation
  2015-10-08 12:48 ` [gomp4.1] Doacross library implementation Torvald Riegel
@ 2015-10-08 14:26   ` Aldy Hernandez
  0 siblings, 0 replies; 5+ messages in thread
From: Aldy Hernandez @ 2015-10-08 14:26 UTC (permalink / raw)
  To: Torvald Riegel, Jakub Jelinek; +Cc: gcc-patches, Richard Henderson

On 10/08/2015 05:48 AM, Torvald Riegel wrote:
> On Thu, 2015-09-24 at 20:32 +0200, Jakub Jelinek wrote:
>> Torvald, can you please have a look at it, if I got all the atomics / memory
>> models right?
>
> More detailed comments below, but in general, I'd really suggest to add
> more code comments for the synchronization parts.  In the end, the level
> of detail of documentation of libgomp is your decision, but, for
> example, the lack of comments in synchronization code in glibc has made
> maintaining this code and fixing issues in it very costly.  It has also
> been hard to understand for many.
>
> My suggestion would be both to (1) document the high-level, abstract
> synchronization scheme and (2) how that scheme is implemented.  The
> first point is important in my experience because typically, the
> high-level scheme and the actual thinking behind it (or, IOW, the intent
> of the original author) is much harder to reconstruct in case of
> concurrent code than it is for sequential code; you can't just simply
> follow the program along line by line, but have to consider
> interleavings.

I couldn't agree more.  After having spent the last month trying to make 
sense of libgomp/task.c, I can honestly say that we need better internal 
documentation.  I know this isn't Jakub's fault, as Richard started the 
non-documenting party, but clearly defined descriptions, functions, and 
implementation go a long way.  APIs and abstractions also make things a 
_lot_ easier to follow.

It could also be that I'm very new to runtime work, specifically 
parallel runtime work, but it was hard to understand.  I think I finally 
have a firm grasp on it (I hope), but it did take me until early this 
week.  Consequently, I took it upon myself to documenting big pieces of 
task.c this week.  I assume anyone not jakub/rth coming after me will 
benefit from it.  So yeah, my upcoming patch will have some variables 
renamed, many more functions with better descriptions (or descriptions 
at all, etc), and a clearly defined API.

Maybe my brain is small; but this stuff is hard.  Every little bit helps :).

p.s. Ironically, it seems that the longer I spend looking at this code, 
the less I feel I need to comment because things are now "obvious", 
which perhaps is an indication that either putting newbies on the 
projects is a good thing, or documenting things early is good practice.

Aldy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-10-08 14:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-24 20:04 [gomp4.1] Doacross library implementation Jakub Jelinek
2015-09-29 18:33 ` [gomp4.1] Fixup handling of doacross loops with noreturn body Jakub Jelinek
2015-10-01 12:08 ` [gomp4.1] Fixup doacross lastprivate handling Jakub Jelinek
2015-10-08 12:48 ` [gomp4.1] Doacross library implementation Torvald Riegel
2015-10-08 14:26   ` Aldy Hernandez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).