From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id A542438708D9 for ; Wed, 24 Feb 2021 19:46:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A542438708D9 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-331-pqD5oKMzPwKgq_GHQyX9Eg-1; Wed, 24 Feb 2021 14:46:31 -0500 X-MC-Unique: pqD5oKMzPwKgq_GHQyX9Eg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E7F7E193578E; Wed, 24 Feb 2021 19:46:29 +0000 (UTC) Received: from tucnak.zalov.cz (ovpn-112-197.ams2.redhat.com [10.36.112.197]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 801A860BE5; Wed, 24 Feb 2021 19:46:29 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 11OJkQbr583934 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Wed, 24 Feb 2021 20:46:26 +0100 Received: (from jakub@localhost) by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 11OJkPYs583933; Wed, 24 Feb 2021 20:46:25 +0100 Date: Wed, 24 Feb 2021 20:46:25 +0100 From: Jakub Jelinek To: Kwok Cheung Yeung Cc: GCC Patches Subject: Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738] Message-ID: <20210224194625.GC4020736@tucnak> Reply-To: Jakub Jelinek References: <20210129150317.GN4020736@tucnak> <0aca6daf-356a-7b03-c007-e33c35114356@codesourcery.com> <20210222134943.GA1837485@tucnak> <253a1682-5e95-c8a1-4274-ad1d9db65c44@codesourcery.com> MIME-Version: 1.0 In-Reply-To: <253a1682-5e95-c8a1-4274-ad1d9db65c44@codesourcery.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Feb 2021 19:46:35 -0000 On Wed, Feb 24, 2021 at 06:17:01PM +0000, Kwok Cheung Yeung wrote: > > 1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems > > and especially posix gomp_sem_t is large; so while it might be a good > > idea to inline gomp_sem_t on config/{linux,accel} into the union, for > > the rest it might be better to use indirection; if it is only for the > > undeferred tasks, it could be just using an automatic variable and > > put into the struct address of that; could be done either always, > > or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is > > small and decide on the indirection based on that macro > > I think a pointer to an automatic variable would be simplest. Agreed. > Can anything in cpyfn make use of the fact that kind==GOMP_TASK_UNDEFERRED > while executing it? Anyway, if we want to keep this, then I suppose we could > just add an extra field deferred_p that does not change for the lifetime of > the task to indicate that the task is 'really' a deferred task. Adding a bool is fine, but see bellow. > > 3) kind is not constant, for the deferred tasks it can change over the > > lifetime of the task, as you've said in the comments, it is kind == > > GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind > > are done while holding the task lock, omp_fulfill_event reads it before > > locking that lock, so I think it needs to be done using > > if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == GOMP_TASK_UNDEFERRED) > > Pedantically the stores to task->kind also need to be done > > with __atomic_store_n MEMMODEL_RELAXED. > > If we check task->deferred_p instead (which never changes for a task after > instantiation), is that still necessary? Not for kind or the new field. > > - in gomp_barrier_handle_tasks the reason for if (new_tasks > 1) > > is that if there is a single dependent task, the current thread > > just finished handling one task and so can take that single task and so no > > need to wake up. While in the omp_fulfill_event case, even if there > > is just one new task, we need to schedule it to some thread and so > > is desirable to wake some thread. > > In that case, we could just do 'if (new_tasks > 0)' instead? Yes. > > > All we know > > (if team == gomp_thread ()->ts.team) is that at least one thread is doing > > something else but that one could be busy for quite some time. > > Well, it should still get around to the new task eventually, so there is no > problem in terms of correctness here. I suppose we could always wake up one > more thread than strictly necessary, but that might have knock-on effects on > performance elsewhere? Yeah, waking something unnecessarily is always going to cause performance problems. > I have applied your patch to move the gomp_team_barrier_done, and in > omp_fulfill_event, I ensure that a single thread is woken up so that > gomp_barrier_handle_tasks can signal for the barrier to finish. > > I'm having some trouble coming up with a testcase to test this scenario > though. I tried having a testcase like this to have threads in separate > teams: The unshackeled thread testcase would probably need a pthread_create call and restricting the testcase to POSIX threads targets. The teams in host teams (or target) don't have at least in OpenMP a way to serialize, e.g. it can always be implemented like we do ATM. But I guess that testcase can be done incrementally. > @@ -545,8 +548,15 @@ struct gomp_task > entries and the gomp_task in which they reside. */ > struct priority_node pnode[3]; > > - bool detach; > - gomp_sem_t completion_sem; > + union { > + /* Valid only if deferred_p is false. */ > + gomp_sem_t *completion_sem; > + /* Valid only if deferred_p is true. Set to the team that executes the > + task if the task is detached and the completion event has yet to be > + fulfilled. */ > + struct gomp_team *detach_team; > + }; > + bool deferred_p; > > struct gomp_task_icv icv; > void (*fn) (void *); What I don't like is that this creates too much wasteful padding in a struct that should be as small as possible. At least on 64-bit hosts which we care about most, pahole shows with your patch: struct gomp_task { struct gomp_task * parent; /* 0 8 */ struct priority_queue children_queue; /* 8 32 */ struct gomp_taskgroup * taskgroup; /* 40 8 */ struct gomp_dependers_vec * dependers; /* 48 8 */ struct htab * depend_hash; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct gomp_taskwait * taskwait; /* 64 8 */ size_t depend_count; /* 72 8 */ size_t num_dependees; /* 80 8 */ int priority; /* 88 4 */ /* XXX 4 bytes hole, try to pack */ struct priority_node pnode[3]; /* 96 48 */ /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */ union { gomp_sem_t * completion_sem; /* 144 8 */ struct gomp_team * detach_team; /* 144 8 */ }; /* 144 8 */ _Bool deferred_p; /* 152 1 */ /* XXX 7 bytes hole, try to pack */ struct gomp_task_icv icv; /* 160 40 */ /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */ void (*fn)(void *); /* 200 8 */ void * fn_data; /* 208 8 */ enum gomp_task_kind kind; /* 216 4 */ _Bool in_tied_task; /* 220 1 */ _Bool final_task; /* 221 1 */ _Bool copy_ctors_done; /* 222 1 */ _Bool parent_depends_on; /* 223 1 */ struct gomp_task_depend_entry depend[]; /* 224 0 */ /* size: 224, cachelines: 4, members: 21 */ /* sum members: 213, holes: 2, sum holes: 11 */ /* last cacheline: 32 bytes */ }; So perhaps it might be better to put the new 1 fields before int priority; field, in order bool deferred_p; union { }; That way, there will be just 3 bytes hole in the whole struct, not 4 + 7 byte holes. > > - if (task.detach && !task_fulfilled_p (&task)) > - gomp_sem_wait (&task.completion_sem); > + if ((flags & GOMP_TASK_FLAG_DETACH) != 0 && detach) > + gomp_sem_wait (&completion_sem); I think gomp_sem_destroy is missing here (in the conditional if it was only initialized. Furthermore, I don't understand the && detach, the earlier code assumes that if (flags & GOMP_TASK_FLAG_DETACH) != 0 then it can dereference *(void *)) detach, so the && detach seems to be unnecessary. > @@ -484,15 +483,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *), > task->kind = GOMP_TASK_UNDEFERRED; > task->in_tied_task = parent->in_tied_task; > task->taskgroup = taskgroup; > + task->deferred_p = true; > if ((flags & GOMP_TASK_FLAG_DETACH) != 0) > { > - task->detach = true; > - gomp_sem_init (&task->completion_sem, 0); > - *(void **) detach = &task->completion_sem; I think you can move task->deferred_p into the if stmt. > + if (!shackled_thread_p > + && !do_wake > + && gomp_team_barrier_waiting_for_tasks (&team->barrier) > + && team->task_detach_count == 0) && team->task_detach_count == 0 is cheaper than the && gomp_team_barrier_waiting_for_tasks (&team->barrier) so please swap those two. > + { > + /* Ensure that at least one thread is woken up to signal that the > + barrier can finish. */ > + do_wake = 1; > + } Please drop the {}s around the single do_wake = 1; stmt. Otherwise LGTM. Jakub