From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id 8A07A3857829; Thu, 18 Mar 2021 12:28:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8A07A3857829 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Tobias_Burnus@mentor.com IronPort-SDR: r4jLWGMCS1oXiEGPm4lOWThXNgEGXeskoQzT1N2OJmcPzC8M8ZHTzE03vvtm/wQ+IIeS5Ge1BN PedpPW2kccipohrNNJIPnKI8a1Qn1vbGQJ88wMRB8SEzCRGwz/2XhCgEmMSiw0uaeGOg2c4FQv 84ofPSL2/eHzhP8SpXnlOQdVL/dcOmKfmLi0dQgRzG1v27ocjfVo05XmXZZC1x0Gf12uWRuhvk LVI7SQ95sFDrrahAjHD181mOZaKKUa49k56wGgIliy2NJxmuKTt3uJ8fuby72DR+jxT38123Vj yTs= X-IronPort-AV: E=Sophos;i="5.81,258,1610438400"; d="scan'208";a="59224918" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 18 Mar 2021 04:28:27 -0800 IronPort-SDR: nHSkxw1YhNMSzIhyFr8OFBOjjK3/OSUKkX+CUzrWafqCcRBHlgBPmmkfOxEdLfiqi+u39KbBCh kmwUda3fXWPg3a9vFmzUDRo2KOK2dfIrT84FIs+K4oi38nAQ2J3xnrQ59UJyH57wFalubA0AFH rvUPGfLQui/Vosiwbw+W2TsTTM812Qt2IGAFj1ToEb8D2iGUjBUMNp8SE/y+ZiPLF57sL9PmIQ Y2/PT/O5cCM2pmTHjmc5vdZghxrNpjJnCD2C9e8qVrg1vmK465dJebMJ2axInKA/nfeSACiI93 4Z0= To: Jakub Jelinek , gcc-patches , fortran , Andre Vehreschild , Paul Richard Thomas , Catherine Moore From: Tobias Burnus Subject: =?UTF-8?B?W1JGQ10gRm9ydHJhbjogT3Blbk1QIChDb2FycmF5Pykg4oCTIGhhbmRs?= =?UTF-8?Q?ing_transfer/mapping_of_allocatable_componens=2c_esp=2e_polymorph?= =?UTF-8?Q?ic_ones?= Message-ID: Date: Thu, 18 Mar 2021 13:28:20 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: quoted-printable Content-Language: en-US X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) To svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, KAM_ADVERT2, KAM_DMARC_STATUS, KAM_MANYTO, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: fortran@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Fortran mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Mar 2021 12:28:31 -0000 Fortran itself: suggestion is to add a new entry to the vtable (breaking change) =E2=80=94 thus, please also comment if you are not interested in OpenMP (or coarrays). For OpenMP: When mapping a derived-type to a non-shared-memory (accelerator/GPU) device, it gets complicated with (polymorphic) allocatable components =E2=80=94 as OpenMP requires a deep copy of _allocatable_ components. [Side note: 'virtual calls' on the device are also permitted, i.e. the vtable also has to be mapped properly.] For coarrays: I thought there is the same issue with CO_REDUCE (arbitrary type w/ user-defined reduction function), but I now think that I either missed a constraint or that J3/WG5 missed to add one. See thread starting with my just written email (no reply so far): to J3: https://mailman.j3-fortran.org/pipermail/j3/2021-March/012965.html [C++: Side note =E2=80=93 OpenMP 5.1 now also permits virtual calls; but the deep copy problem does not seem to exist (excpt next item?).] For OpenMP, I think there is a relation between this issue and how MAPPER might be implemented. =E2=80=94 However, I have not looked at mappers, hence, it could be a completely separate implementation or not. * * * (A) EXAMPLES AS PREREMARK type recursive_t type(recursive_t) :: A ! recursive types; OpenMP: valid since 5.1 end type type t end type t type, extends(t) :: t2 integer, allocatable :: A(:) ! allocatable component end type t2 type t3 class(t), allocatable :: C ! allocatable polymorphic component end type t3 type(recursive_t) :: rt, rtc[*] class(t), allocatable :: B type(t3) :: C[*] ... !$omp target enter data map(to:B, rt, C) ... call CO_REDUCE (rtc, my_reduct_proc, result_image=3D1) call CO_REDUCE (C, my_reduct_proc3, result_image=3D1) And for OpenMP also the following (virtual call on device): class(*), intent(in) :: dummy_class !$omp target map(to:dummy_class) select class(dummy_class) class is my_cmplx_class call dummy_class%type_bound_proc(5) ! TBP / virtual call (B) DESCRIPTION OF THE PROBLEM Coarrays: While there are some restrictions regarding the use of coarrays, especially with user-defined reductions data has to be accessed on the remote image with limited data available on this_image() about details on the remote image. OpenMP: While OpenMP 4.5 mostly avoided all pitfalls, 5.0 permitted a lot more and 5.1 removed additional restrictions. For unified shared memory or when not using 'target' constructs, there is no issue beyond the normal Fortran issues (e.g. data-sharing firstprivate with polymorphic variables). However, when the memory is not shared it becomes harder. In any case, the information is distributed over several places: * Run-time library libgomp: knows how to transfer the data between the host and the device and update pointers libcaf: knows how to access remote memory. I think pointer mapping (like remove vs. local vtable) is not required, but it looks as if the vtab->hash value has to be obtainable for same_type_as(var[i], var[j]) * Type and associated: At the location of the type declaration and vtable generation, all details about the type is known (except for array bounds and the depth of the recursive types, which are both only known at run time). * Code location which calls into the library (openMP construct, co_reduction call etc.): Here, both the need for the data transfer and the declared type is known =E2=80=93 including which parts have to be handled in a loop form (for A%B(:)%recursive, the compiler can generate an outer loop over A%B(= i) and then an inner loop over A%B(i)%recursive%recursive%...%recursive). For the used data ref itself, the compiler can also add code to handle the dynamic type of the last partref - that is both vtable and obtaining the vtab->size or similar. But if the last partref is a polymorphic type, neither allocatable nonpolymorphic nor allocatable polymorphic components are known at the code location. (B) CURRENT LIB SUPPORT (1) For OpenMP The current code generation does not permit run-time dependent mapping as everything is folded into a single libgomp mapping call: map(A) may become something like: map(to:a.p [len: 64]) map(to:*a.p.data [len: D.3953 * 4]) map(always_po= inter: a.p.data [pointer assign, bias: 0]) ... which then calls __builtin_GOMP_target_enter_exit_data (-1, 1, &.omp_data_arr.4, &.omp_d= ata_sizes.5, &.omp_data_kinds.6, 0, 0B); Taking the example of the recusive type (valid since OpenMP 5.1), we would = need something like: __builtin_GOMP_target_enter_exit_data_begin =E2=86=92 map 'A' prev =3D A; for (ptr =3D A.rt; rt !=3D NULL; rt =3D rt->rt, prev =3D prev->rt) map (ptr) map(alwaysptr: ptr =E2=86=92 prev%rt) __builtin_GOMP_target_enter_exit_data_end (2) Likewise for Coarrays which also only get as argument: _gfortran_caf_co_reduce (gfc_descriptor_t *a, ..., int a_len) which also does not help with allocatable components. While for simple cases, a loop would do (cf. above), in the general case it would not. Hence: (C) PROPOSED SOLUTION I think we we need callbacks: If doing on the user side 'call co_reduce(A, ...), we would call: caf_co_reduce (A, ..., A->vptr->callbackftn) and in caf_co_reduce, we call callbackftn (A, mytoken, transfer_fn) and in the compiler-generated part of the user code (vtable), i.e. in A->_vptr->callbackftn: for (rt =3D A->rt; rt.next !=3D NULL; rt =3D rt.next) { transfer_fn (mytoken, rt.next, size, NULL, NULL); attach_fn (mytoken, rt.next, &rt.next); } Or another example case: // B%a%{_vptr, _data} already handled as it is inside 'B', size() = =3D arraysize transfer_fn (mytoken, B%a%_data%data, B%A->vptr->size * size(B%A), B%A%_data, B%A%_vptr->callbackftn) attach_fn (mytoken, &B%a%_data%data, B%a%data%_data); attach_fn (mytoken, &B%a%vpn, B%a%vpn); Or some implementation like that. BACKWARD COMPATIBILITY ISSUE: In order that this works, we need a new entry in the VTABLE, which is a breaking change. (Unless we want to restrict it to -fcoarray=3Dlib / -fopenmp but that is bound to break when mixing code compiled with different flags.) SIDE REMARK: When breaking code, we could also do the following change: * Instead of always generating the VTABLE when a derived-type is declared, we should only generate it (was weak symbol) when actually using it in a polymorphic context, e.g. class(t) or class(*) ... :: c allocate(t :: c) c =3D t() or when needed for nonpolymorphic types for the purposes discussed in this text. If a vtable has been generated, this can be noted in the MODULE file (and in gsym for the translation unit) to avoid streaming out the vtable multiple times. * Bumping the module version is probably sufficient to mark the incompatibility. SIDE REMARK 2: If we break backward compatibility, we could consider doing some other cleanup =E2=80=93 some random thoughts: * I think we do mishandle the decl with dimension(..) especially when there is a coarray token at the end. * Removing now/then unused functions from libgfortran? * Some fixes for the array descriptor/C-class array descriptor/ its convert aux functions (e.g. moving to the FE [can be done w/o breaking] + removing it from libgfortran) * Other fixes (which?) (D) WHAT NEEDS TO BE SUPPORTED * Association of the vptr to the virtual table for polymorphic components * Association of 'pointers' (C/internal impl sense), i.e. allocated/copied actual data =E2=86=92 'data' comp of the array descript= or * Recursive allocatable types * Array-valued components (by construction: contiguous), which again have components, which might be allocatable (and polymorphic) RFC: (a) Future extension? 'maybe_attach' for data/procedure pointer components to update the pointer address, if available? (Note: pointer might be 'undefined' and point to 0xDEADBEEF besides NULL or a proper value.) If we think it might become useful, we should reserve space for a flag or a function pointer. (b) Other flags? For instance to walk backwards when freeing memory? Probably not needed (tracking the library, could do some reverse handling itself)? (c) Current way of the implementation (example above or desc below) assumes that first all data is transferred and then the pointers attached; alternatively, it could be combined in the transfer call (extra argument), albeit vptr probably would remain a different call. Could be before or after the transfer. (d) Should vtabs tagged in a special way in the call to the library? (F) IMPLEMENTATION (See also above's RFC.) Compiler side: * New copy-callback function in the vtable of type 't' Arguments: - 'void* token' to be used by the caller - 'type(t) this' pointer - Function to be called for the data transfer - Function to be called for pointer assignment (vtab or just transferred allocatable or also data pointer/proc pointe= r?) * Will transfer the data for all allocatable components; and either NULL/NULL or a new callback function if the component is a CLASS or a derived type with allocatable components * For recursive types, it could be either handled as in previous item (call callback recursively) or in a loop (as in the example above). Library side: * Provide a function to handle data transfers: - token (provided by the library, e.g. storing whether a specific allocator should be used to distinguish 'map(to:' from 'firstprivate= ' - pointer to the data to be handled - its size (contiguous memory chunk) If required by the data type: - another callback function - the data argument for that one if not required: last two arguments are NULL * * * Thoughts? Remarks? Tobias ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 M=C3=BCnchen R= egistergericht M=C3=BCnchen HRB 106955, Gesch=C3=A4ftsf=C3=BChrer: Thomas H= eurung, Frank Th=C3=BCrauf