public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
*  Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
       [not found] <e2a9f7c55311795785d0f2c47f70acbd@cweb001.nm.nfra.io>
@ 2019-06-24 19:55 ` 김규래
  2019-06-24 20:13   ` Jakub Jelinek
  2019-07-13  7:46   ` John Pinkerton
  2019-07-09 12:56 ` 김규래
  1 sibling, 2 replies; 20+ messages in thread
From: 김규래 @ 2019-06-24 19:55 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc

Hi,
I'm not very familiar with the gomp plugin system.
However, looking at 'GOMP_PLUGIN_target_task_completion' seem like tasks have to go in and out of the runtime.
In that case, is it right that the tasks have to know from which queue they came from?
I think I'll have to add the id of the corresponding queue of each task in the gomp_task structure.
 
Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re:  Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-24 19:55 ` [GSoC'19, libgomp work-stealing] Task parallelism runtime 김규래
@ 2019-06-24 20:13   ` Jakub Jelinek
  2019-07-13  7:46   ` John Pinkerton
  1 sibling, 0 replies; 20+ messages in thread
From: Jakub Jelinek @ 2019-06-24 20:13 UTC (permalink / raw)
  To: 김규래; +Cc: gcc

On Tue, Jun 25, 2019 at 04:55:17AM +0900, 김규래 wrote:
> I'm not very familiar with the gomp plugin system.
> However, looking at 'GOMP_PLUGIN_target_task_completion' seem like tasks have to go in and out of the runtime.
> In that case, is it right that the tasks have to know from which queue they came from?
> I think I'll have to add the id of the corresponding queue of each task in the gomp_task structure.

While libgomp has a plugin system, the only supported plugins are those in
the tree, i.e. libgomp/plugin/plugin-{hsa,nvptx}.c and
liboffloadmic/plugin/*
nvptx plugin doesn't have async support ATM, so it is just hsa and xeonphi
offloading that can call it when an asynchronous target region execution is
over.  No matter in which task queue the task is, gomp_target_task_completion
needs to ensure that if something already waits on it (taskwait, taskgroup
end, barrier, dependency wait), that it is awaken.
And, like for other parts of task.c, there needs to be a design what lock
is used to protect any code that needs to be guarded.
The current code as you know uses team->task_lock as a big lock, I think
with the separate workqueues + work stealing you need per implicit task lock
+ the per team (team->task_lock), design the locking such that there is no
ABBA deadlock possibility and that you use the team task lock only when
really necessary (not sure, but perhaps one example where I really don't see
much way to avoid the per team lock are task dependencies, the hash table
for that etc.).

	Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

*  Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
       [not found] <e2a9f7c55311795785d0f2c47f70acbd@cweb001.nm.nfra.io>
  2019-06-24 19:55 ` [GSoC'19, libgomp work-stealing] Task parallelism runtime 김규래
@ 2019-07-09 12:56 ` 김규래
  2019-07-13  6:28   ` Jakub Jelinek
  1 sibling, 1 reply; 20+ messages in thread
From: 김규래 @ 2019-07-09 12:56 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc

Hi, 
This is an update about my status.
I've been working on unifying the three queues into a single queue.
I'm almost finished and passed all the tests except for the dependency handling part.
 
Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re:  Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-07-09 12:56 ` 김규래
@ 2019-07-13  6:28   ` Jakub Jelinek
  2019-07-21  7:46     ` 김규래
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Jelinek @ 2019-07-13  6:28 UTC (permalink / raw)
  To: 김규래; +Cc: gcc

On Tue, Jul 09, 2019 at 09:56:00PM +0900, 김규래 wrote:
> Hi, 
> This is an update about my status.
> I've been working on unifying the three queues into a single queue.
> I'm almost finished and passed all the tests except for the dependency handling part.

For dependencies, I can imagine taking a lock on the parent task rather than
a team lock when dealing with the dependency data structures, and outside of
the lock perhaps do a quick check if there are any dependencies using atomic
load.  I can't imagine how one could get away without that though, and while
that can scale well if you have many tasks that spawn many other tasks, it
will still act as a team lock if say all tasks are spawned from the same
parent task.

	Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re:  Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-24 19:55 ` [GSoC'19, libgomp work-stealing] Task parallelism runtime 김규래
  2019-06-24 20:13   ` Jakub Jelinek
@ 2019-07-13  7:46   ` John Pinkerton
  1 sibling, 0 replies; 20+ messages in thread
From: John Pinkerton @ 2019-07-13  7:46 UTC (permalink / raw)
  To: gcc

unsubscribe

On Mon, Jun 24, 2019, at 3:55 PM, 김규래 wrote:
> Hi,
> I'm not very familiar with the gomp plugin system.
> However, looking at 'GOMP_PLUGIN_target_task_completion' seem like 
> tasks have to go in and out of the runtime.
> In that case, is it right that the tasks have to know from which queue 
> they came from?
> I think I'll have to add the id of the corresponding queue of each task 
> in the gomp_task structure.
>  
> Ray Kim
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-07-13  6:28   ` Jakub Jelinek
@ 2019-07-21  7:46     ` 김규래
  2019-07-22 18:54       ` Jakub Jelinek
  0 siblings, 1 reply; 20+ messages in thread
From: 김규래 @ 2019-07-21  7:46 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc

Hi Jakub,
About the snippet below, 
 
  if (gomp_barrier_last_thread (state)) 
    {
      if (team->task_count == 0) 
{
  gomp_team_barrier_done (&team->barrier, state);
  gomp_mutex_unlock (&team->task_lock);
  gomp_team_barrier_wake (&team->barrier, 0);
  return;
}
      gomp_team_barrier_set_waiting_for_tasks (&team->barrier);
    }

Am I safe to assume that gomp_barrier_last_thread is thread-safe?

Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-07-21  7:46     ` 김규래
@ 2019-07-22 18:54       ` Jakub Jelinek
  2019-07-22 19:00         ` 김규래
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Jelinek @ 2019-07-22 18:54 UTC (permalink / raw)
  To: 김규래; +Cc: gcc

On Sun, Jul 21, 2019 at 04:46:33PM +0900, 김규래 wrote:
> About the snippet below, 
>  
>   if (gomp_barrier_last_thread (state)) 
>     {
>       if (team->task_count == 0) 
> {
>   gomp_team_barrier_done (&team->barrier, state);
>   gomp_mutex_unlock (&team->task_lock);
>   gomp_team_barrier_wake (&team->barrier, 0);
>   return;
> }
>       gomp_team_barrier_set_waiting_for_tasks (&team->barrier);
>     }
> 
> Am I safe to assume that gomp_barrier_last_thread is thread-safe?

Yes, you can look up the definition.
gomp_ barrier_last_thread is just a bit in the state bitmask passed to the
routine, it is set on the last thread that encounters the barrier, which is
figured out by doing atomic subtraction from the counter.

	Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-07-22 18:54       ` Jakub Jelinek
@ 2019-07-22 19:00         ` 김규래
  2019-08-03  9:12           ` 김규래
  0 siblings, 1 reply; 20+ messages in thread
From: 김규래 @ 2019-07-22 19:00 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc

> Yes, you can look up the definition.
> gomp_ barrier_last_thread is just a bit in the state bitmask passed to the
> routine, it is set on the last thread that encounters the barrier, which is
> figured out by doing atomic subtraction from the counter.

I saw the implementation, just wanted to be sure that's the general case.
Thanks.
 
Ray Kim
 
-----Original Message-----
From: "Jakub Jelinek"<jakub@redhat.com>
To: "김규래"<msca8h@naver.com>;
Cc: <gcc@gcc.gnu.org>;
Sent: 2019-07-23 (화) 03:54:13 (GMT+09:00)
Subject: Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
 
On Sun, Jul 21, 2019 at 04:46:33PM +0900, 김규래 wrote:
> About the snippet below,
>  
>   if (gomp_barrier_last_thread (state))
>     {
>       if (team->task_count == 0)
> {
>   gomp_team_barrier_done (&team->barrier, state);
>   gomp_mutex_unlock (&team->task_lock);
>   gomp_team_barrier_wake (&team->barrier, 0);
>   return;
> }
>       gomp_team_barrier_set_waiting_for_tasks (&team->barrier);
>     }
>
> Am I safe to assume that gomp_barrier_last_thread is thread-safe?

Yes, you can look up the definition.
gomp_ barrier_last_thread is just a bit in the state bitmask passed to the
routine, it is set on the last thread that encounters the barrier, which is
figured out by doing atomic subtraction from the counter.

Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-07-22 19:00         ` 김규래
@ 2019-08-03  9:12           ` 김규래
  2019-08-05 10:32             ` Jakub Jelinek
  0 siblings, 1 reply; 20+ messages in thread
From: 김규래 @ 2019-08-03  9:12 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc

Hi,
I'm currently having trouble implementing the thread sleeping mechanism when the queue is out of tasks.
Problem is, it's hard to maintain consistency between the thread sleeping routine and the queues.
See the pseudocode below,
 
1. check queue is empty
2. go to sleep
 
if we go lock-free, the consistency between 1 and 2 cannot be maintained.
Moreover, if we go to a multi-queue setting, simply checking whether all the queues are empty is not consistent. 
I would really like to have the thread sleeping mechanism but I'm not sure how we could do it.
 
Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-08-03  9:12           ` 김규래
@ 2019-08-05 10:32             ` Jakub Jelinek
  2019-08-05 11:01               ` 김규래
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Jelinek @ 2019-08-05 10:32 UTC (permalink / raw)
  To: 김규래; +Cc: gcc

On Sat, Aug 03, 2019 at 06:11:58PM +0900, 김규래 wrote:
> I'm currently having trouble implementing the thread sleeping mechanism when the queue is out of tasks.
> Problem is, it's hard to maintain consistency between the thread sleeping routine and the queues.
> See the pseudocode below,
>  
> 1. check queue is empty
> 2. go to sleep
>  
> if we go lock-free, the consistency between 1 and 2 cannot be maintained.

I thought we don't want to go lock-free, the queue operations aren't easily
implementable lock-free, but instead with a lock for each of the queues,
so in the multi-queue setting having locks on the implicit tasks that hold
those queues.  What can and should be done without lock is perhaps some
preliminary check if a queue is empty, that can be done through
__atomic_load.
And, generally go to sleep is done outside of the critical section, inside
of the critical section we decide if we go to sleep or not, and then
go to sleep either (on Linux) using futexes, or otherwise using semaphores,
both have the properties that one can already post to them before some other
thread sleeps on it, and in that case the other thread doesn't actually go
to sleep.  The wake up (post on the semaphore or updating the memory + later
futex wake) is sometimes done inside of a critical section, the updating of
memory if it is not atomic increase/decrease and the latter depending on
whether we remember from the atomic operation whether the wake up is needed
or not and defer it until after the critical section.

Given say:
      ++team->task_count;
      ++team->task_queued_count;
      gomp_team_barrier_set_task_pending (&team->barrier);
      do_wake = team->task_running_count + !parent->in_tied_task
                < team->nthreads;
      gomp_mutex_unlock (&team->task_lock);
      if (do_wake)
        gomp_team_barrier_wake (&team->barrier, 1);
you can see the wake up is done outside of the critical section.
If team->task_lock isn't used, there will be of course problems, say
team->task_count and team->task_queued_count need to be bumped atomically,
ditto operations on team->barrier, and the question is what to do with the
team->task_running_count check, if that one is updated atomically too,
maybe __atomic_load might be good enough, though perhaps worst case it might
mean we don't in some cases wake anybody, so there will be threads idling
instead of doing useful work, but at least one thread probably should handle
it later.

	Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-08-05 10:32             ` Jakub Jelinek
@ 2019-08-05 11:01               ` 김규래
       [not found]                 ` <20190819061020.GA27842@laptop.zalov.cz>
  0 siblings, 1 reply; 20+ messages in thread
From: 김규래 @ 2019-08-05 11:01 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc general

> I thought we don't want to go lock-free, the queue operations > aren't easily
> implementable lock-free, but instead with a lock for each of > the queues,
​
Hi,
By lock-free I meant to use locks only for the queues,
But my terminology was indeed confusing sorry about that.
​
> mean we don't in some cases wake anybody, so there will be > threads idling
> instead of doing useful work, but at least one thread
> probably should handle
> it later.
​
I was personally worried about this case 
Since this could result in huge inefficiencies, but maybe it'll be fine.
I'll first try to implement it.
Thanks
​
Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
       [not found]                 ` <20190819061020.GA27842@laptop.zalov.cz>
@ 2019-08-25  7:49                   ` 김규래
  0 siblings, 0 replies; 20+ messages in thread
From: 김규래 @ 2019-08-25  7:49 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc general

Hi Jakub,
I think the current semaphore sleep system ought to be improved.
I'm not sure how but since the GSoC deadline is approaching I'll just post the results without the semaphores.
Instead of sleeping on a per-task basis (for example there are depend waits, task waits, taskgroup waits etc..),
I think we should simply sleep the threads when the queue is empty and wake them up whenever a task finished executing or 
a new task has been added to the queue.
This shouldn't be too difficult to implement using semaphores.
However, since the current gomp semaphores are not always the most performant, I'm not absolutely certain how to do this.
I'll defer this to after GSoC.
Let me know if you have an idea.
 
Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-05 20:36         ` Janne Blomqvist
@ 2019-06-06 17:54           ` 김규래
  0 siblings, 0 replies; 20+ messages in thread
From: 김규래 @ 2019-06-06 17:54 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: gcc mailing list, Jakub Jelinek

> Another option, which I guess starts to go out of scope of your gsoc, is
> parallel depth first (PDF) search (Blelloch 1999) as an alternative to work
> stealing. Here's a presentation about some recent work in this area,
> although for Julia and not OpenMP (no idea if PDF would fit with OpenMP at
> all): https://www.youtube.com/watch?v=YdiZa0Y3F3c

I am actually aware of PDF and the works ongoing on the Julia side.
Also, I think it does not go out of the scope of GSoC,
since the essential goal is to implement a more advanced task parallel scheduler anyway.

> Better cache locality.
Despite previous research results that PDF is better in term of locality,
recently developed advanced work-stealing (WS) schemes improved a lot in terms of data locality.
I think an up-to-date quantitive comparison with SOTA algorithms from both sides is required.

Personally I think the WS framework is more flexible and popular? right now.
I'd like to hear the opinion of others on the subject.

Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-05 19:42       ` 김규래
@ 2019-06-05 20:36         ` Janne Blomqvist
  2019-06-06 17:54           ` 김규래
  0 siblings, 1 reply; 20+ messages in thread
From: Janne Blomqvist @ 2019-06-05 20:36 UTC (permalink / raw)
  To: 김규래; +Cc: gcc mailing list, Jakub Jelinek

On Wed, Jun 5, 2019 at 10:42 PM 김규래 <msca8h@naver.com> wrote:

> > On Wed, Jun 5, 2019 at 9:25 PM 김규래 <msca8h@naver.com> wrote:
>
> >
> > > Hi, thanks for the detailed explanation.
> > > I think I now get the picture.
> > > Judging from my current understanding, the task-parallelism currently
> works as follows:
> > > 1. Tasks are placed in a global shared queue.
> > > 2. Workers consume the tasks by bashing the queue in a while loop,
> just as self-scheduling (dynamic scheduling)/
> > >
> > > Then the improvements including work-stealing must be done by:
> > > 1. Each worker holds a dedicated task queue reducing the resource
> contention.
> > > 2. The tasks are distributed in a round-robin fashion
>
> > For nested task submission (does OpenMP support that?) you probably
> > want to submit to the local queue rather than round-robin, no?
>
>
>
> Hi Janne,
>
> I believe you're talking about spawning child tasks within an already
> running task, which I believe is supported by OpenMP.
>
> In this case, pushing to the local queue could introduce a deadlock if the
> master thread waits on the spawned tasks.
>
> A short one though since work-stealing can resolve the deadlock.
>
> A better way to handle this is to round-robin the child tasks to the
> queues excluding the queue of the thread consuming the current task.
>
> Then waiting on the tasks should never cause a deadlock.
>
> Or did I misunderstand your question?
>
> Maybe there is a specific reason for avoiding avoid round-robin that I
> missed?
>
Better cache locality.

Another option, which I guess starts to go out of scope of your gsoc, is
parallel depth first (PDF) search (Blelloch 1999) as an alternative to work
stealing. Here's a presentation about some recent work in this area,
although for Julia and not OpenMP (no idea if PDF would fit with OpenMP at
all): https://www.youtube.com/watch?v=YdiZa0Y3F3c

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-05 19:06     ` Janne Blomqvist
@ 2019-06-05 19:42       ` 김규래
  2019-06-05 20:36         ` Janne Blomqvist
  0 siblings, 1 reply; 20+ messages in thread
From: 김규래 @ 2019-06-05 19:42 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: gcc mailing list, Jakub Jelinek

> On Wed, Jun 5, 2019 at 9:25 PM 김규래 <msca8h@naver.com> wrote: 
>
> > Hi, thanks for the detailed explanation.
> > I think I now get the picture.
> > Judging from my current understanding, the task-parallelism currently works as follows:
> > 1. Tasks are placed in a global shared queue.
> > 2. Workers consume the tasks by bashing the queue in a while loop, just as self-scheduling (dynamic scheduling)/
> >
> > Then the improvements including work-stealing must be done by:
> > 1. Each worker holds a dedicated task queue reducing the resource contention.
> > 2. The tasks are distributed in a round-robin fashion

> For nested task submission (does OpenMP support that?) you probably
> want to submit to the local queue rather than round-robin, no? 
 
Hi Janne,
I believe you're talking about spawning child tasks within an already running task, which I believe is supported by OpenMP.
In this case, pushing to the local queue could introduce a deadlock if the master thread waits on the spawned tasks. 
A short one though since work-stealing can resolve the deadlock.
A better way to handle this is to round-robin the child tasks to the queues excluding the queue of the thread consuming the current task.
Then waiting on the tasks should never cause a deadlock.
Or did I misunderstand your question?
Maybe there is a specific reason for avoiding avoid round-robin that I missed?
 
Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-05 18:25   ` 김규래
  2019-06-05 18:52     ` Jakub Jelinek
@ 2019-06-05 19:06     ` Janne Blomqvist
  2019-06-05 19:42       ` 김규래
  1 sibling, 1 reply; 20+ messages in thread
From: Janne Blomqvist @ 2019-06-05 19:06 UTC (permalink / raw)
  To: 김규래; +Cc: gcc mailing list, Jakub Jelinek

On Wed, Jun 5, 2019 at 9:25 PM 김규래 <msca8h@naver.com> wrote:
>
> Hi, thanks for the detailed explanation.
> I think I now get the picture.
> Judging from my current understanding, the task-parallelism currently works as follows:
> 1. Tasks are placed in a global shared queue.
> 2. Workers consume the tasks by bashing the queue in a while loop, just as self-scheduling (dynamic scheduling)/
>
> Then the improvements including work-stealing must be done by:
> 1. Each worker holds a dedicated task queue reducing the resource contention.
> 2. The tasks are distributed in a round-robin fashion

For nested task submission (does OpenMP support that?) you probably
want to submit to the local queue rather than round-robin, no?

> 3. work-stealing will resolve the load imbalance.
>
> If the above statements are correct, I guess the task priority should be given some special treatment?
>
> Ray Kim
>
> -----Original Message-----
> From: "Jakub Jelinek"<jakub@redhat.com>
> To: "김규래"<msca8h@naver.com>;
> Cc: <gcc@gcc.gnu.org>;
> Sent: 2019-06-04 (화) 03:21:01 (GMT+09:00)
> Subject: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
>
> On Tue, Jun 04, 2019 at 03:01:13AM +0900, 김규래 wrote:
> > Hi,
> > I've been studying the libgomp task parallelism system.
> > I have a few questions.
> > First, Tracing the events shows that only the main thread calls GOMP_task.
>
> No, any thread can call GOMP_task, in particular the thread that encountered
> the #pragma omp task construct.
>
> The GOMP_task function then decides based on the clauses of the construct
> (passed in various ways through the arguments of that function) whether it
> will be included (executed by the encountering thread), or queued for
> later execution.  In the latter case, it will be scheduled during a barrier
> (implicit or explicit), see gomp_barrier_handle_tasks called from the
> bar.[ch] code, or at other spots, e.g. during taskwait construct
> (GOMP_taskwait) or at the end of taskgroup (GOMP_taskgroup_end).
>
> > How do the other worker threads enter the libgomp runtime?
>
> If you never encounter a parallel, teams or target construct, then there is
> just one thread that does everything (well, the library is written such that
> if you by hand pthread_create, each such thread acts as a separate initial
> thread from OpenMP POV).
> Threads are created e.g. during parallel construct (GOMP_parallel), where
> for non-nested parallelism as the standard requires it reuses existing
> threads if possible or spawns new ones, see mainly team.c (gomp_team_start)
> for the function that spawns new threads or awakes the ones waiting for
> work, or gomp_thread_start in the same file for the function actually run by
> the libgomp library created threads.
>
> > I can't find the entry point of the worker threads from the event tracing and the assembly dump.
> > Second, How is the task priority set?
>
> By the user, through priority clause, passed to GOMP_task and then taken
> into account when handling tasks in the various queues.
>
> Jakub



-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-05 18:25   ` 김규래
@ 2019-06-05 18:52     ` Jakub Jelinek
  2019-06-05 19:06     ` Janne Blomqvist
  1 sibling, 0 replies; 20+ messages in thread
From: Jakub Jelinek @ 2019-06-05 18:52 UTC (permalink / raw)
  To: 김규래; +Cc: gcc

On Thu, Jun 06, 2019 at 03:25:24AM +0900, 김규래 wrote:
> Hi, thanks for the detailed explanation.
> I think I now get the picture.
> Judging from my current understanding, the task-parallelism currently works as follows: 
> 1. Tasks are placed in a global shared queue.

It isn't a global shared queue, but a per-team shared queue, in fact 3
different ones, guarded by the same per-team mutex team->task_lock though:
team->task_queue	used on barriers
task->children_queue	used for #pragma omp taskwait
taskgroup->taskgroup_queue	used at the end of #pragma omp taskgroup

> 2. Workers consume the tasks by bashing the queue in a while loop, just as self-scheduling (dynamic scheduling)/
>  
> Then the improvements including work-stealing must be done by:
> 1. Each worker holds a dedicated task queue reducing the resource contention.
> 2. The tasks are distributed in a round-robin fashion
> 3. work-stealing will resolve the load imbalance.
>  
> If the above statements are correct, I guess the task priority should be given some special treatment?

Yes, one thing to consider is task priority, another thing is what to do
on those #pragma omp taskwait or #pragma omp taskgroup waits where while one
can schedule most of the tasks (the OpenMP specification has restrictions on
what can be scheduled, but the restrictions are mostly relevant to untied
tasks which we don't support anyway (well, ignore untied and schedule them
like tied tasks)), but it might be more beneficial (and what is currently
implemented) to queue only the tasks that will help satisfying what we are
waiting on.  All of priority and only scheduling task or taskgroup children
instead of all tasks are just hints, the implementation may choose other
tasks.
There is another thing though, tasks with dependencies, where
gomp_task_handle_depend which needs to look up addresses in hash tables
for the dependencies also uses the team->task_lock mutex.

	Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-03 18:21 ` Jakub Jelinek
@ 2019-06-05 18:25   ` 김규래
  2019-06-05 18:52     ` Jakub Jelinek
  2019-06-05 19:06     ` Janne Blomqvist
  0 siblings, 2 replies; 20+ messages in thread
From: 김규래 @ 2019-06-05 18:25 UTC (permalink / raw)
  To: gcc; +Cc: Jakub Jelinek

Hi, thanks for the detailed explanation.
I think I now get the picture.
Judging from my current understanding, the task-parallelism currently works as follows: 
1. Tasks are placed in a global shared queue.
2. Workers consume the tasks by bashing the queue in a while loop, just as self-scheduling (dynamic scheduling)/
 
Then the improvements including work-stealing must be done by:
1. Each worker holds a dedicated task queue reducing the resource contention.
2. The tasks are distributed in a round-robin fashion
3. work-stealing will resolve the load imbalance.
 
If the above statements are correct, I guess the task priority should be given some special treatment?
 
Ray Kim
 
-----Original Message-----
From: "Jakub Jelinek"<jakub@redhat.com>
To: "김규래"<msca8h@naver.com>;
Cc: <gcc@gcc.gnu.org>;
Sent: 2019-06-04 (화) 03:21:01 (GMT+09:00)
Subject: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
 
On Tue, Jun 04, 2019 at 03:01:13AM +0900, 김규래 wrote:
> Hi,
> I've been studying the libgomp task parallelism system.
> I have a few questions.
> First, Tracing the events shows that only the main thread calls GOMP_task.

No, any thread can call GOMP_task, in particular the thread that encountered
the #pragma omp task construct.

The GOMP_task function then decides based on the clauses of the construct
(passed in various ways through the arguments of that function) whether it
will be included (executed by the encountering thread), or queued for
later execution.  In the latter case, it will be scheduled during a barrier
(implicit or explicit), see gomp_barrier_handle_tasks called from the
bar.[ch] code, or at other spots, e.g. during taskwait construct
(GOMP_taskwait) or at the end of taskgroup (GOMP_taskgroup_end).

> How do the other worker threads enter the libgomp runtime?

If you never encounter a parallel, teams or target construct, then there is
just one thread that does everything (well, the library is written such that
if you by hand pthread_create, each such thread acts as a separate initial
thread from OpenMP POV).
Threads are created e.g. during parallel construct (GOMP_parallel), where
for non-nested parallelism as the standard requires it reuses existing
threads if possible or spawns new ones, see mainly team.c (gomp_team_start)
for the function that spawns new threads or awakes the ones waiting for
work, or gomp_thread_start in the same file for the function actually run by
the libgomp library created threads.

> I can't find the entry point of the worker threads from the event tracing and the assembly dump.
> Second, How is the task priority set?

By the user, through priority clause, passed to GOMP_task and then taken
into account when handling tasks in the various queues.

Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
  2019-06-03 18:01 김규래
@ 2019-06-03 18:21 ` Jakub Jelinek
  2019-06-05 18:25   ` 김규래
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Jelinek @ 2019-06-03 18:21 UTC (permalink / raw)
  To: 김규래; +Cc: gcc

On Tue, Jun 04, 2019 at 03:01:13AM +0900, 김규래 wrote:
> Hi,
> I've been studying the libgomp task parallelism system.
> I have a few questions.
> First, Tracing the events shows that only the main thread calls GOMP_task.

No, any thread can call GOMP_task, in particular the thread that encountered
the #pragma omp task construct.

The GOMP_task function then decides based on the clauses of the construct
(passed in various ways through the arguments of that function) whether it
will be included (executed by the encountering thread), or queued for
later execution.  In the latter case, it will be scheduled during a barrier
(implicit or explicit), see gomp_barrier_handle_tasks called from the
bar.[ch] code, or at other spots, e.g. during taskwait construct
(GOMP_taskwait) or at the end of taskgroup (GOMP_taskgroup_end).

> How do the other worker threads enter the libgomp runtime?

If you never encounter a parallel, teams or target construct, then there is
just one thread that does everything (well, the library is written such that
if you by hand pthread_create, each such thread acts as a separate initial
thread from OpenMP POV).
Threads are created e.g. during parallel construct (GOMP_parallel), where
for non-nested parallelism as the standard requires it reuses existing
threads if possible or spawns new ones, see mainly team.c (gomp_team_start)
for the function that spawns new threads or awakes the ones waiting for
work, or gomp_thread_start in the same file for the function actually run by
the libgomp library created threads.

> I can't find the entry point of the worker threads from the event tracing and the assembly dump. 
> Second, How is the task priority set?

By the user, through priority clause, passed to GOMP_task and then taken
into account when handling tasks in the various queues.

	Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [GSoC'19, libgomp work-stealing] Task parallelism runtime
@ 2019-06-03 18:01 김규래
  2019-06-03 18:21 ` Jakub Jelinek
  0 siblings, 1 reply; 20+ messages in thread
From: 김규래 @ 2019-06-03 18:01 UTC (permalink / raw)
  To: gcc; +Cc: jakub

Hi,
I've been studying the libgomp task parallelism system.
I have a few questions.
First, Tracing the events shows that only the main thread calls GOMP_task.
How do the other worker threads enter the libgomp runtime?
I can't find the entry point of the worker threads from the event tracing and the assembly dump. 
Second, How is the task priority set?
 
Ray Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-08-25  7:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <e2a9f7c55311795785d0f2c47f70acbd@cweb001.nm.nfra.io>
2019-06-24 19:55 ` [GSoC'19, libgomp work-stealing] Task parallelism runtime 김규래
2019-06-24 20:13   ` Jakub Jelinek
2019-07-13  7:46   ` John Pinkerton
2019-07-09 12:56 ` 김규래
2019-07-13  6:28   ` Jakub Jelinek
2019-07-21  7:46     ` 김규래
2019-07-22 18:54       ` Jakub Jelinek
2019-07-22 19:00         ` 김규래
2019-08-03  9:12           ` 김규래
2019-08-05 10:32             ` Jakub Jelinek
2019-08-05 11:01               ` 김규래
     [not found]                 ` <20190819061020.GA27842@laptop.zalov.cz>
2019-08-25  7:49                   ` 김규래
2019-06-03 18:01 김규래
2019-06-03 18:21 ` Jakub Jelinek
2019-06-05 18:25   ` 김규래
2019-06-05 18:52     ` Jakub Jelinek
2019-06-05 19:06     ` Janne Blomqvist
2019-06-05 19:42       ` 김규래
2019-06-05 20:36         ` Janne Blomqvist
2019-06-06 17:54           ` 김규래

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).