From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jakub@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTP id B77AF385801D
 for <gcc-patches@gcc.gnu.org>; Mon, 22 Feb 2021 13:49:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B77AF385801D
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-227-70HH6lraObqrItq4ftDzYg-1; Mon, 22 Feb 2021 08:49:47 -0500
X-MC-Unique: 70HH6lraObqrItq4ftDzYg-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com
 [10.5.11.14])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id CC2EB801965;
 Mon, 22 Feb 2021 13:49:46 +0000 (UTC)
Received: from tucnak.zalov.cz (ovpn-112-197.ams2.redhat.com [10.36.112.197])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 5E0225D9D0;
 Mon, 22 Feb 2021 13:49:46 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
 by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 11MDnihV1341762
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
 Mon, 22 Feb 2021 14:49:44 +0100
Received: (from jakub@localhost)
 by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 11MDnhZA1341761;
 Mon, 22 Feb 2021 14:49:43 +0100
Date: Mon, 22 Feb 2021 14:49:43 +0100
From: Jakub Jelinek <jakub@redhat.com>
To: Kwok Cheung Yeung <kcy@codesourcery.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of
 task-detach-6 libgomp tests [PR98738]
Message-ID: <20210222134943.GA1837485@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <f465050f-125d-846d-86f8-ee671602fb5a@codesourcery.com>
 <20210129150317.GN4020736@tucnak>
 <0aca6daf-356a-7b03-c007-e33c35114356@codesourcery.com>
MIME-Version: 1.0
In-Reply-To: <0aca6daf-356a-7b03-c007-e33c35114356@codesourcery.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Feb 2021 13:49:52 -0000

On Fri, Feb 19, 2021 at 07:12:42PM +0000, Kwok Cheung Yeung wrote:
> I have opted for a union of completion_sem (for tasks that are undeferred)
> and a struct gomp_team *detach_team (for deferred tasks) that holds the team
> if the completion event has not yet fulfilled, or NULL if is it. I don't see
> the point of having an indirection to the union here since the union is just
> the size of a pointer, so it might as well be inlined.

I see three issues with the union of completion_sem and detach_team done
that way.

1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems
   and especially posix gomp_sem_t is large; so while it might be a good
   idea to inline gomp_sem_t on config/{linux,accel} into the union, for
   the rest it might be better to use indirection; if it is only for the
   undeferred tasks, it could be just using an automatic variable and
   put into the struct address of that; could be done either always,
   or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is
   small and decide on the indirection based on that macro
2) kind == GOMP_TASK_UNDEFERRED is true also for the deferred tasks while
   running the cpyfn callback; guess this could be dealt with making sure
   the detach handling is done only after
      thr->task = task;
      if (cpyfn)
        {
          cpyfn (arg, data);
          task->copy_ctors_done = true;
        }
      else
        memcpy (arg, data, arg_size);
      thr->task = parent;
      task->kind = GOMP_TASK_WAITING;
      task->fn = fn;
      task->fn_data = arg;
      task->final_task = (flags & GOMP_TASK_FLAG_FINAL) >> 1;
   I see you've instead removed the GOMP_TASK_UNDEFERRED but the rationale
   for that is that the copy constructors are being run synchronously
3) kind is not constant, for the deferred tasks it can change over the
   lifetime of the task, as you've said in the comments, it is kind ==
   GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind
   are done while holding the task lock, omp_fulfill_event reads it before
   locking that lock, so I think it needs to be done using
   if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == GOMP_TASK_UNDEFERRED)
   Pedantically the stores to task->kind also need to be done
   with __atomic_store_n MEMMODEL_RELAXED.

Now, similarly for 3) on task->kind, task->detach_team is similar case,
again, some other omp_fulfill_event can clear it (under lock, but still read
outside of the lock), so it
probably should be read with
  struct gomp_team *team
    = __atomic_load_n (&task->detach_team, MEMMODEL_RELAXED);
And again, pedantically the detach_team stores should be atomic relaxed
stores too.  

> > Do you agree with this, or see some reason why this can't work?
> 
> The main problem I see is this code in gomp_barrier_handle_tasks:
> 
> 	  if (--team->task_count == 0
> 	      && gomp_team_barrier_waiting_for_tasks (&team->barrier))
> 	    {
> 	      gomp_team_barrier_done (&team->barrier, state);
> 
> We do not have access to state from within omp_fulfill_event, so how should
> this be handled?

Sure, omp_fulfill_event shouldn't do any waiting, it needs to awake anything
that could have been waiting.

> @@ -688,8 +697,7 @@ struct gomp_team
>    int work_share_cancelled;
>    int team_cancelled;
>  
> -  /* Tasks waiting for their completion event to be fulfilled.  */
> -  struct priority_queue task_detach_queue;
> +  /* Number of tasks waiting for their completion event to be fulfilled.  */
>    unsigned int task_detach_count;

Do we need task_detach_count?  Currently it is only initialized and
incremented/decremented, but never tested for anything.
Though see below.

> +  gomp_debug (0, "omp_fulfill_event: %p event fulfilled for finished task\n",
> +	      task);
> +  size_t new_tasks = gomp_task_run_post_handle_depend (task, team);
> +  gomp_task_run_post_remove_parent (task);
> +  gomp_clear_parent (&task->children_queue);
> +  gomp_task_run_post_remove_taskgroup (task);
> +  team->task_count--;
> +  team->task_detach_count--;
> +
> +  /* Wake up any threads that may be waiting for the detached task
> +     to complete.  */
> +  struct gomp_task *parent = task->parent;
> +
> +  if (parent && parent->taskwait)
> +    {
> +      if (parent->taskwait->in_taskwait)
> +	{
> +	  parent->taskwait->in_taskwait = false;
> +	  gomp_sem_post (&parent->taskwait->taskwait_sem);
> +	}
> +      else if (parent->taskwait->in_depend_wait)
> +	{
> +	  parent->taskwait->in_depend_wait = false;
> +	  gomp_sem_post (&parent->taskwait->taskwait_sem);
> +	}
> +    }

Looking at gomp_task_run_post_remove_parent, doesn't that function
already handle the in_taskwait and in_depend_wait gomp_sem_posts?

> +  if (task->taskgroup && task->taskgroup->in_taskgroup_wait)
> +    {
> +      task->taskgroup->in_taskgroup_wait = false;
> +      gomp_sem_post (&task->taskgroup->taskgroup_sem);
> +    }

And into gomp_task_run_post_remove_taskgroup, doesn't that already
handle the in_taskgroup_wait gomp_sem_post?

> +
> +  int do_wake = 0;
> +  if (new_tasks > 1)
> +    {
> +      do_wake = team->nthreads - team->task_running_count;
> +      if (do_wake > new_tasks)
> +	do_wake = new_tasks;
> +    }
> +
> +  gomp_mutex_unlock (&team->task_lock);
> +  if (do_wake)
> +    gomp_team_barrier_wake (&team->barrier, do_wake);

I think for the barrier case we need to make a difference between
team == gomp_thread ()->ts.team case (I guess the more usual one),
where the fact that we know some thread (the one calling omp_fulfill_event)
is provably doing something other than sitting on a barrier waiting for
tasks means it is much simpler, i.e. that
gomp_team_barrier_set_task_pending (&team->barrier);
is needed only when some tasks dependent on completion of the current one
were added to the queues, but
gomp_task_run_post_handle_depend -> gomp_task_run_post_handle_dependers
takes care of that call and the above do_wake handles that too,
though the exact details I think need fixing
- in gomp_barrier_handle_tasks the reason for if (new_tasks > 1)
is that if there is a single dependent task, the current thread
just finished handling one task and so can take that single task and so no
need to wake up.  While in the omp_fulfill_event case, even if there
is just one new task, we need to schedule it to some thread and so
is desirable to wake some thread.  All we know
(if team == gomp_thread ()->ts.team) is that at least one thread is doing
something else but that one could be busy for quite some time.

And the other case is the omp_fulfill_event call from unshackeled thread,
i.e. team != gomp_thread ()->ts.team.
Here, e.g. what gomp_target_task_completion talks about applies:
  /* I'm afraid this can't be done after releasing team->task_lock,
     as gomp_target_task_completion is run from unrelated thread and
     therefore in between gomp_mutex_unlock and gomp_team_barrier_wake
     the team could be gone already.  */
Even there are 2 different cases.
One is where team->task_running_count > 0, at that point we know
at least one task is running and so the only thing that is unsafe
gomp_team_barrier_wake (&team->barrier, do_wake);
after gomp_mutex_unlock (&team->task_lock); - there is a possibility
that in between the two calls the thread running omp_fulfill_event
gets interrupted or just delayed and the team finishes barrier and
is freed too.  So the gomp_team_barrier_wake needs to be done before
the unlock in that case.

And then there is the case where all tasks finish on a barrier but some
haven't been fulfilled yet.
In that case, when the last thread calls
gomp_team_barrier_wait_end we run in there:
  if (__builtin_expect (state & BAR_WAS_LAST, 0))
...
      if (__builtin_expect (team->task_count, 0))
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
        }
and
gomp_barrier_handle_tasks
it will hit the
  gomp_mutex_lock (&team->task_lock);
  if (gomp_barrier_last_thread (state))
    {
      if (team->task_count == 0)
        {
          gomp_team_barrier_done (&team->barrier, state);
          gomp_mutex_unlock (&team->task_lock);
          gomp_team_barrier_wake (&team->barrier, 0);
          return;
        }
      gomp_team_barrier_set_waiting_for_tasks (&team->barrier);
    }
but team->task_count is not 0 (because of the one or more non-fulfilled
tasks), priority_queue_empty_p is true and so after unlocking
the function will just return.
And then it will jump to do_wait and on wake ups do
gomp_barrier_handle_tasks again.

So, I think for the team != gomp_thread ()->ts.team
&& !do_wake
&& gomp_team_barrier_waiting_for_tasks (&team->barrier)
&& team->task_detach_count == 0
case, we need to wake up 1 thread anyway and arrange for it to do:
              gomp_team_barrier_done (&team->barrier, state);
              gomp_mutex_unlock (&team->task_lock);
              gomp_team_barrier_wake (&team->barrier, 0);
Possibly in
      if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
add
      else if (team->task_count == 0
	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
	{
	  gomp_team_barrier_done (&team->barrier, state);
	  gomp_mutex_unlock (&team->task_lock);
	  gomp_team_barrier_wake (&team->barrier, 0);
	  if (to_free)
	    {
	      gomp_finish_task (to_free);
	      free (to_free);
	    }
	  return;
	}
but the:
          if (--team->task_count == 0
              && gomp_team_barrier_waiting_for_tasks (&team->barrier))
            {
              gomp_team_barrier_done (&team->barrier, state);
              gomp_mutex_unlock (&team->task_lock);
              gomp_team_barrier_wake (&team->barrier, 0);
              gomp_mutex_lock (&team->task_lock);
            }
in that case would then be incorrect, we don't want to do that twice.
So, either that second if would need to do the to_free handling
and return instead of taking the lock again and looping, or
perhaps we can just do
	  --team->task_count;
there instead and let the above added else if handle that?

	Jakub