Hi Jakub, Am 17.05.2022 um 20:08 schrieb Jakub Jelinek: > On Tue, May 17, 2022 at 11:57:02AM +0200, Marcel Vollweiler wrote: >>> More importantly, I have no idea how this can work when you pass arg_size 0 >>> and arg_align 0. The s variable is in the current function frame, with >>> arg_size 0 nothing is really copied to the generated task. >>> arg_size should be sizeof (memcpy_t) and arg_align __alignof__ (memcpy_t) >>> (well, struct omp_target_memcpy_data). >> >> The copy function of GOMP_task ("cpyfn") is not used here (set to NULL) and thus >> also arg_size and arg_align are set to 0 since they are related to cpyfn if I >> understand it correctly. > > No, arg_size and arg_align are for all (explicit) tasks the size and > alignment of the arguments. For an included task (one executed by the > encountering thread) we indeed use data directly instead of allocating > arg_size arg_align aligned bytes and copying data to it. But when we create > a deferred task (that is the only thing that actually can be asynchronous), we > allocate struct gomp_task together with memory for the data (arg_size bytes > aligned to arg_align). If cpyfn, we invoke that copy function (from source > data to the destination buffer), otherwise memcpy. cpyfn is a callback that > will do memcpy for parts that need bitwise copy and copy construction / > whatever else is needed for other data. > Looking at your patch, you call GOMP_task always with if_clause = false, > that means it is always included task (like with #pragma omp task if(0)), > but that also means calling GOMP_task doesn't bring any advantages and it is > not asynchronous. > If you called it with if_clause = true, like what #pragma omp task would do, > then the arg_size = 0 and arg_align = 0 would make it not work at all, > so after fixing if_clause, you need to supply sizeof (s) and __alignof__ (s). Good explanation, thanks. Changed accordingly. > >>> Also, it would be nice to avoid GOMP_task for the depobj_count == 0 case >>> at least sometimes (but perhaps that can be done incrementally) and instead >>> use some CUDA etc. asynchronous copy APIs. We don't really need to wait >>> for anything in that case, and from OpenMP POV all we need to make sure is >>> that barrier/taskwait/taskgroup end will know about these "tasks" and >>> wait for them. So, it can be implemented more like #pragma omp target nowait >>> instead of #pragma omp task that calls the synchronous omp_target_memcpy. >>> Though, maybe that is how it should be implemented always, something like >>> gomp_create_target_task and its caller. We already use that single routine >>> for multiple purposes (target nowait as well as target enter/exit data >>> nowait), so just telling it somehow that it shouldn't do mapping/unmapping >>> and perhaps target execution and instead copying would be nice. >> >> I dont't see/understand the advantage using gomp_create_target_task over >> GOMP_task. Whether the task waits for dependencies >> ("gomp_task_maybe_wait_for_dependencies") depends on GOMP_TASK_FLAG_DEPEND which >> is only set if depobj_count > 0 and depobj_list != NULL. Thus, there shouldn't >> be any waiting in case of depobj_count == 0? Additionally, in both functions a >> new thread is created - independently of dependencies. > > GOMP_task never creates a new thread. > gomp_create_target_task can create (but just once) an unshackeled thread > that runs on the side, doesn't do normal OpenMP user work and just polls the > offloading device and performs unmapping or whatever is needed to finish a > nowait offloaded task. > > The disadvantage of GOMP_task is: > 1) if you call say omp_target_memcpy_async from outside of parallel, it will > not be actually asynchronous even if you call GOMP_task with if_clause = true > 2) if you call it from inside of parallel, it might be scheduled only when > some host thread is ready for work (e.g. when reaching #pragma omp barrier, > implicit barrier, #pragma omp taskwait etc.), so even when the offloading > device is unused but host has lots of work to do, it might take quite a > while before starting the work, and then one of the OpenMP host threads > will be blocked waiting for the copying to be done > > gomp_create_target_task doesn't have these disadvantages, it can fire off the > copying right away and then just needs to be able to figure out when it > finished (either the unshackeled thread polls the device, or some other way > how to find out that it finished; but OpenMP certainly needs to know that, > because user code can say #pragma omp taskwait for it, or it should be > complete at the end of a taskgroup, or at the end of #pragma omp barrier > or implicit barrier etc.). > > Anyway, I guess it is ok to use GOMP_task in the initial patch and change it > later, but if_clause = false and 0, 0 for arg_{size,align} are definitely > wrong. Agreed. Thanks for the details. > >> +int >> +omp_target_memcpy (void *dst, const void *src, size_t length, size_t dst_offset, >> + size_t src_offset, int dst_device_num, int src_device_num) >> +{ >> + struct gomp_device_descr *dst_devicep = NULL, *src_devicep = NULL; >> + int ret; >> + >> + ret = omp_target_memcpy_check (dst_device_num, src_device_num, &dst_devicep, >> + &src_devicep); > > You can just use > int ret = omp_target_memcpy_check (dst_device_num, src_device_num, > &dst_devicep, &src_devicep); Changed. > >> +int >> +omp_target_memcpy_async (void *dst, const void *src, size_t length, >> + size_t dst_offset, size_t src_offset, >> + int dst_device_num, int src_device_num, >> + int depobj_count, omp_depend_t *depobj_list) >> +{ >> + struct gomp_device_descr *dst_devicep = NULL, *src_devicep = NULL; >> + void (*fn) (void *) = &omp_target_memcpy_async_helper; > > No need for the fn variable, just pass /*fn=*/omp_target_memcpy_async_helper > as the first argument to GOMP_task. Changed. > >> + unsigned int flags = 0; >> + void *data; > > No need for the data variable. > >> + void *depend[depobj_count + 5]; >> + int i; >> + int check = omp_target_memcpy_check (dst_device_num, src_device_num, >> + &dst_devicep, &src_devicep); >> + >> + omp_target_memcpy_data s = { >> + .dst = dst, >> + .src = src, >> + .length = length, >> + .dst_offset = dst_offset, >> + .src_offset = src_offset, >> + .dst_devicep = dst_devicep, >> + .src_devicep = src_devicep >> + }; >> + data = &s; > > And the above stmt, just pass &s as the second argument. Changed. > >> + >> + if (check) >> + return check; >> + >> + depend[0] = 0; >> + depend[1] = (void *) (uintptr_t) depobj_count; >> + depend[2] = depend[3] = depend[4] = 0; >> + for (i = 0; i < depobj_count; ++i) >> + depend[i + 5] = &depobj_list[i]; > > This doesn't need to be done if flags will not include > GOMP_TASK_FLAG_DEPEND, so maybe better: > >> + >> + if (depobj_count > 0 && depobj_list != NULL) >> + flags |= GOMP_TASK_FLAG_DEPEND; > > add here > else > { > depend[0] = 0; > ... > } Added the "depend" definition to the "if" branch (instead the "else" branch). > >> + >> + GOMP_task (fn, data, /*cpyfn=*/NULL, /*arg_size=*/0, /*arg_align=*/0, >> + /*if_clause=*/false, flags, depend, /*priority_arg=*/0, >> + /*detach=*/NULL); > > Ditto for the other function. Also changed. An updated patch is attached (and tested again on x86_64-linux with nvptx and amdgcn offloading without regression). Marcel ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955