From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-74.mimecast.com (us-smtp-delivery-74.mimecast.com [170.10.133.74]) by sourceware.org (Postfix) with ESMTPS id C82E8385735B for ; Thu, 5 May 2022 08:30:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C82E8385735B Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-662-UP8PcwXBOjGxh_6SJT5eWA-1; Thu, 05 May 2022 04:30:27 -0400 X-MC-Unique: UP8PcwXBOjGxh_6SJT5eWA-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A0599398CA77; Thu, 5 May 2022 08:30:26 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.192.16]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 400E740CFD37; Thu, 5 May 2022 08:30:26 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 2458UM641705822 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Thu, 5 May 2022 10:30:23 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 2458UL661705821; Thu, 5 May 2022 10:30:21 +0200 Date: Thu, 5 May 2022 10:30:21 +0200 From: Jakub Jelinek To: Marcel Vollweiler Cc: gcc-patches@gcc.gnu.org, fortran@gcc.gnu.org Subject: Re: [PATCH] OpenMP, libgomp: Add new runtime routines omp_target_memcpy_async and omp_target_memcpy_rect_async Message-ID: Reply-To: Jakub Jelinek References: MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.84 on 10.11.54.1 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: fortran@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Fortran mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 May 2022 08:30:33 -0000 On Mon, Feb 21, 2022 at 12:19:20PM +0100, Marcel Vollweiler wrote: > gcc/ChangeLog: > > * omp-low.cc (omp_runtime_api_call): Added target_memcpy_async and > target_memcpy_rect_async to omp_runtime_apis array. > > libgomp/ChangeLog: > > * libgomp.map: Added omp_target_memcpy_async and > omp_target_memcpy_rect_async. > * libgomp.texi: Both functions are now supported. > * omp.h.in: Added omp_target_memcpy_async and > omp_target_memcpy_rect_async. > * omp_lib.f90.in: Added interfaces for both new functions. > * omp_lib.h.in: Likewise. > * target.c (omp_target_memcpy): Restructured into check and copy part. > (omp_target_memcpy_check): New helper function for omp_target_memcpy and > omp_target_memcpy_async that checks requirements. > (omp_target_memcpy_copy): New helper function for omp_target_memcpy and > omp_target_memcpy_async that performs the memcpy. > (omp_target_memcpy_async_helper): New helper function that is used in > omp_target_memcpy_async for the asynchronous task. > (omp_target_memcpy_async): Added. > (omp_target_memcpy_rect): Restructured into check and copy part. > (omp_target_memcpy_rect_check): New helper function for > omp_target_memcpy_rect and omp_target_memcpy_rect_async that checks > requirements. > (omp_target_memcpy_rect_copy): New helper function for > omp_target_memcpy_rect and omp_target_memcpy_rect_async that performs > the memcpy. > (omp_target_memcpy_rect_async_helper): New helper function that is used > in omp_target_memcpy_rect_async for the asynchronous task. > (omp_target_memcpy_rect_async): Added. > * testsuite/libgomp.c-c++-common/target-memcpy-async-1.c: New test. > * testsuite/libgomp.c-c++-common/target-memcpy-async-2.c: New test. > * testsuite/libgomp.c-c++-common/target-memcpy-rect-async-1.c: New test. > * testsuite/libgomp.c-c++-common/target-memcpy-rect-async-2.c: New test. > * testsuite/libgomp.fortran/target-memcpy-async-1.f90: New test. > * testsuite/libgomp.fortran/target-memcpy-async-2.f90: New test. > * testsuite/libgomp.fortran/target-memcpy-rect-async-1.f90: New test. > * testsuite/libgomp.fortran/target-memcpy-rect-async-2.f90: New test. > > --- a/libgomp/libgomp.map > +++ b/libgomp/libgomp.map > @@ -224,6 +224,8 @@ OMP_5.1 { > omp_set_teams_thread_limit_8_; > omp_get_teams_thread_limit; > omp_get_teams_thread_limit_; > + omp_target_memcpy_async; > + omp_target_memcpy_rect_async; > } OMP_5.0.2; These should be added to OMP_5.1.1, not here. > --- a/libgomp/omp.h.in > +++ b/libgomp/omp.h.in > @@ -272,6 +272,10 @@ extern int omp_target_is_present (const void *, int) __GOMP_NOTHROW; > extern int omp_target_memcpy (void *, const void *, __SIZE_TYPE__, > __SIZE_TYPE__, __SIZE_TYPE__, int, int) > __GOMP_NOTHROW; > +extern int omp_target_memcpy_async (void *, const void *, __SIZE_TYPE__, > + __SIZE_TYPE__, __SIZE_TYPE__, int, int, > + int, omp_depend_t*) Formatting, space before *. > + __GOMP_NOTHROW; > extern int omp_target_memcpy_rect (void *, const void *, __SIZE_TYPE__, int, > const __SIZE_TYPE__ *, > const __SIZE_TYPE__ *, > @@ -279,6 +283,14 @@ extern int omp_target_memcpy_rect (void *, const void *, __SIZE_TYPE__, int, > const __SIZE_TYPE__ *, > const __SIZE_TYPE__ *, int, int) > __GOMP_NOTHROW; > +extern int omp_target_memcpy_rect_async (void *, const void *, __SIZE_TYPE__, > + int, const __SIZE_TYPE__ *, > + const __SIZE_TYPE__ *, > + const __SIZE_TYPE__ *, > + const __SIZE_TYPE__ *, > + const __SIZE_TYPE__ *, int, int, int, > + omp_depend_t*) Likewise. > -int > -omp_target_memcpy (void *dst, const void *src, size_t length, > - size_t dst_offset, size_t src_offset, int dst_device_num, > - int src_device_num) > +static int > +omp_target_memcpy_check (void *dst, const void *src, int dst_device_num, > + int src_device_num, > + struct gomp_device_descr **dst_devicep, > + struct gomp_device_descr **src_devicep) > { Why does omp_target_memcpy_check need the dst and src arguments? From what I can see, they aren't used by it. > +typedef struct > +{ > + void *dst; > + const void *src; > + size_t length; > + size_t dst_offset; > + size_t src_offset; > + struct gomp_device_descr *dst_devicep; > + struct gomp_device_descr *src_devicep; > +} memcpy_t; Please come up with some less generic name, struct omp_target_memcpy_data or something similar. Even the *_t suffix is problematic, as *_t is reserved for the implementation. > + > +void > +omp_target_memcpy_async_helper (void *args) This should be static. > +{ > + memcpy_t *a = args; > + int ret = omp_target_memcpy_copy (a->dst, a->src, a->length, a->dst_offset, > + a->src_offset, a->dst_devicep, > + a->src_devicep); > + if (ret) > + gomp_fatal ("asynchronous memcpy failed"); I'm not really sure killing the whole program if the copying failed is the best action. Has it been discussed on omp-lang? Perhaps the APIs should have a way how to propagate the result to the caller when it completes somehow? Even if we do that, the ret variable seems to be superfluos, just do if (omp_target_memcpy_copy (...)) gomp_fatal (...); > +{ > + struct gomp_device_descr *dst_devicep = NULL, *src_devicep = NULL; > + > + int check = omp_target_memcpy_check (dst, src, dst_device_num, src_device_num, > + &dst_devicep, &src_devicep); > + if (check) > + return check; > + > + void (*fn) (void *) = &omp_target_memcpy_async_helper; > + void *data = NULL; > + void (*cpyfn) (void *, void *) = NULL; > + long arg_size = 0; > + long arg_align = 0; > + bool if_clause = false; > + unsigned flags = 0; > + int priority_arg = 0; > + void *detach = NULL; > + > + memcpy_t s = { > + .dst = dst, > + .src = src, > + .length = length, > + .dst_offset = dst_offset, > + .src_offset = src_offset, > + .dst_devicep = dst_devicep, > + .src_devicep = src_devicep > + }; I think we in libgomp try to use C89 and so declare vars first before other statements. > + data = &s; > + > + void *depend[depobj_count+5]; Spaces around + , i.e. depobj_count + 5 > + depend[0] = 0; > + depend[1] = (void*) ((uintptr_t) depobj_count); Space before *. The ()s around (uintptr_t) depobj_count are superfluous. > + depend[2] = depend[3] = depend[4] = 0; > + for (int i = 0; i < depobj_count; ++i) > + depend[i+5] = &depobj_list[i]; i + 5 > + > + if (depobj_count > 0 && depobj_list != NULL) > + flags |= GOMP_TASK_FLAG_DEPEND; > + > + GOMP_task (fn, data, cpyfn, arg_size, arg_align, if_clause, flags, depend, > + priority_arg, detach); We need to make sure that GOMP_task doesn't go through PLT. So, I think this needs to be ialias_call and task.c needs to add ialias for GOMP_task. Also, I must say I don't like very much using variables that you initialize to constants and just pass to the call, either pass the constants directly to the call, or use /*priority_arg=*/0, /*detach=*/NULL style. More importantly, I have no idea how this can work when you pass arg_size 0 and arg_align 0. The s variable is in the current function frame, with arg_size 0 nothing is really copied to the generated task. arg_size should be sizeof (memcpy_t) and arg_align __alignof__ (memcpy_t) (well, struct omp_target_memcpy_data). Also, it would be nice to avoid GOMP_task for the depobj_count == 0 case at least sometimes (but perhaps that can be done incrementally) and instead use some CUDA etc. asynchronous copy APIs. We don't really need to wait for anything in that case, and from OpenMP POV all we need to make sure is that barrier/taskwait/taskgroup end will know about these "tasks" and wait for them. So, it can be implemented more like #pragma omp target nowait instead of #pragma omp task that calls the synchronous omp_target_memcpy. Though, maybe that is how it should be implemented always, something like gomp_create_target_task and its caller. We already use that single routine for multiple purposes (target nowait as well as target enter/exit data nowait), so just telling it somehow that it shouldn't do mapping/unmapping and perhaps target execution and instead copying would be nice. E.g. if one uses omp_target_memcpy_async outside of any explicit parallel or host teams, would be nice if it still was asynchronous and not synchronous. But even in explicit parallel, would be nice if we didn't waste one of the threads waiting for it when it can do useful work on the host. It is true that for target nowait we have one unshackeled thread usually that polls the device. Though that is mainly because we need to do some unmapping at the end of target nowait, including taking the lock etc. For the async copying maybe we don't need to take any lock and could just arrange for check if already completed or sleep until completed if possible, at least in the future. For now at least handling it like target nowait would be an improvement. > +} memcpy_rect_t; Again, please use better type name. > + > +void And this should be static. > +omp_target_memcpy_rect_async_helper (void *args) > +{ > + memcpy_rect_t *a = args; > + int ret = omp_target_memcpy_rect_copy (a->dst, a->src, a->element_size, > + a->num_dims, a->volume, a->dst_offsets, > + a->src_offsets, a->dst_dimensions, > + a->src_dimensions, a->dst_devicep, > + a->src_devicep); > + if (ret) > + gomp_fatal ("asynchronous memcpy failed"); See above. Jakub