On 22/02/2021 1:49 pm, Jakub Jelinek wrote:
> I see three issues with the union of completion_sem and detach_team done
> that way.
> 
> 1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems
>     and especially posix gomp_sem_t is large; so while it might be a good
>     idea to inline gomp_sem_t on config/{linux,accel} into the union, for
>     the rest it might be better to use indirection; if it is only for the
>     undeferred tasks, it could be just using an automatic variable and
>     put into the struct address of that; could be done either always,
>     or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is
>     small and decide on the indirection based on that macro

I think a pointer to an automatic variable would be simplest.

> 2) kind == GOMP_TASK_UNDEFERRED is true also for the deferred tasks while
>     running the cpyfn callback; guess this could be dealt with making sure
>     the detach handling is done only after
>        thr->task = task;
>        if (cpyfn)
>          {
>            cpyfn (arg, data);
>            task->copy_ctors_done = true;
>          }
>        else
>          memcpy (arg, data, arg_size);
>        thr->task = parent;
>        task->kind = GOMP_TASK_WAITING;
>        task->fn = fn;
>        task->fn_data = arg;
>        task->final_task = (flags & GOMP_TASK_FLAG_FINAL) >> 1;
>     I see you've instead removed the GOMP_TASK_UNDEFERRED but the rationale
>     for that is that the copy constructors are being run synchronously

Can anything in cpyfn make use of the fact that kind==GOMP_TASK_UNDEFERRED while 
executing it? Anyway, if we want to keep this, then I suppose we could just add 
an extra field deferred_p that does not change for the lifetime of the task to 
indicate that the task is 'really' a deferred task.

> 3) kind is not constant, for the deferred tasks it can change over the
>     lifetime of the task, as you've said in the comments, it is kind ==
>     GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind
>     are done while holding the task lock, omp_fulfill_event reads it before
>     locking that lock, so I think it needs to be done using
>     if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == GOMP_TASK_UNDEFERRED)
>     Pedantically the stores to task->kind also need to be done
>     with __atomic_store_n MEMMODEL_RELAXED.

If we check task->deferred_p instead (which never changes for a task after 
instantiation), is that still necessary?

> Now, similarly for 3) on task->kind, task->detach_team is similar case,
> again, some other omp_fulfill_event can clear it (under lock, but still read
> outside of the lock), so it
> probably should be read with
>    struct gomp_team *team
>      = __atomic_load_n (&task->detach_team, MEMMODEL_RELAXED);
> And again, pedantically the detach_team stores should be atomic relaxed
> stores too.
> 

Done.

> Looking at gomp_task_run_post_remove_parent, doesn't that function
> already handle the in_taskwait and in_depend_wait gomp_sem_posts?

> And into gomp_task_run_post_remove_taskgroup, doesn't that already
> handle the in_taskgroup_wait gomp_sem_post?

The extra code has been removed.

> - in gomp_barrier_handle_tasks the reason for if (new_tasks > 1)
> is that if there is a single dependent task, the current thread
> just finished handling one task and so can take that single task and so no
> need to wake up.  While in the omp_fulfill_event case, even if there
> is just one new task, we need to schedule it to some thread and so
> is desirable to wake some thread.

In that case, we could just do 'if (new_tasks > 0)' instead?

 > All we know
 > (if team == gomp_thread ()->ts.team) is that at least one thread is doing
 > something else but that one could be busy for quite some time.

Well, it should still get around to the new task eventually, so there is no 
problem in terms of correctness here. I suppose we could always wake up one more 
thread than strictly necessary, but that might have knock-on effects on 
performance elsewhere?

> And the other case is the omp_fulfill_event call from unshackeled thread,
> i.e. team != gomp_thread ()->ts.team.
> Here, e.g. what gomp_target_task_completion talks about applies:
>    /* I'm afraid this can't be done after releasing team->task_lock,
>       as gomp_target_task_completion is run from unrelated thread and
>       therefore in between gomp_mutex_unlock and gomp_team_barrier_wake
>       the team could be gone already.  */
> Even there are 2 different cases.
> One is where team->task_running_count > 0, at that point we know
> at least one task is running and so the only thing that is unsafe
> gomp_team_barrier_wake (&team->barrier, do_wake);
> after gomp_mutex_unlock (&team->task_lock); - there is a possibility
> that in between the two calls the thread running omp_fulfill_event
> gets interrupted or just delayed and the team finishes barrier and
> is freed too.  So the gomp_team_barrier_wake needs to be done before
> the unlock in that case.

The lock is now freed after the call for unshackeled threads, before otherwise.

> And then there is the case where all tasks finish on a barrier but some
> haven't been fulfilled yet.
> In that case, when the last thread calls
...
> So, I think for the team != gomp_thread ()->ts.team
> && !do_wake
> && gomp_team_barrier_waiting_for_tasks (&team->barrier)
> && team->task_detach_count == 0
> case, we need to wake up 1 thread anyway and arrange for it to do:
>                gomp_team_barrier_done (&team->barrier, state);
>                gomp_mutex_unlock (&team->task_lock);
>                gomp_team_barrier_wake (&team->barrier, 0);
> Possibly in
>        if (!priority_queue_empty_p (&team->task_queue, MEMMODEL_RELAXED))
> add
>        else if (team->task_count == 0
> 	       && gomp_team_barrier_waiting_for_tasks (&team->barrier))
> 	{
> 	  gomp_team_barrier_done (&team->barrier, state);
> 	  gomp_mutex_unlock (&team->task_lock);
> 	  gomp_team_barrier_wake (&team->barrier, 0);
> 	  if (to_free)
> 	    {
> 	      gomp_finish_task (to_free);
> 	      free (to_free);
> 	    }
> 	  return;
> 	}
> but the:
>            if (--team->task_count == 0
>                && gomp_team_barrier_waiting_for_tasks (&team->barrier))
>              {
>                gomp_team_barrier_done (&team->barrier, state);
>                gomp_mutex_unlock (&team->task_lock);
>                gomp_team_barrier_wake (&team->barrier, 0);
>                gomp_mutex_lock (&team->task_lock);
>              }
> in that case would then be incorrect, we don't want to do that twice.
> So, either that second if would need to do the to_free handling
> and return instead of taking the lock again and looping, or
> perhaps we can just do
> 	  --team->task_count;
> there instead and let the above added else if handle that?
>
I have applied your patch to move the gomp_team_barrier_done, and in 
omp_fulfill_event, I ensure that a single thread is woken up so that 
gomp_barrier_handle_tasks can signal for the barrier to finish.

I'm having some trouble coming up with a testcase to test this scenario though. 
I tried having a testcase like this to have threads in separate teams:

   #pragma omp teams num_teams (2) shared (event, started)
     #pragma omp parallel num_threads (1)
       if (omp_get_team_num () == 0)
	{
	  #pragma omp task detach (event)
	    started = 1;
	}
       else
         // Wait for started to become 1
	omp_fulfill_event (event);

but it does not work because GOMP_teams_reg launches the enclosed block 
sequentially:

   for (gomp_team_num = 0; gomp_team_num < num_teams; gomp_team_num++)
     fn (data);

and when the first team launches, it blocks waiting for the detach event in 
GOMP_parallel_end->gomp_team_end->gomp_team_barrier_wait_end, and never gets 
around to launching the second team. If I omit the 'omp parallel' (to try to get 
an undeferred task), GCC refuses to compile (only 'distribute', 'parallel' or 
'loop' regions are allowed to be strictly nested inside 'teams' region). And you 
can't nest 'omp teams' inside an 'omp parallel' either. Is there any way of 
doing this within OpenMP or do we have to resort to creating threads outside of 
OpenMP?

Thanks

Kwok