From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 7340 invoked by alias); 26 Aug 2013 11:59:43 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 7328 invoked by uid 89); 26 Aug 2013 11:59:43 -0000 Received: from mail-pa0-f46.google.com (HELO mail-pa0-f46.google.com) (209.85.220.46) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Mon, 26 Aug 2013 11:59:43 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.7 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,FREEMAIL_FROM autolearn=ham version=3.3.2 X-HELO: mail-pa0-f46.google.com Received: by mail-pa0-f46.google.com with SMTP id fa1so3339273pad.19 for ; Mon, 26 Aug 2013 04:59:40 -0700 (PDT) X-Received: by 10.68.113.99 with SMTP id ix3mr2386636pbb.180.1377518380481; Mon, 26 Aug 2013 04:59:40 -0700 (PDT) Received: from msticlxl57.ims.intel.com ([192.55.54.40]) by mx.google.com with ESMTPSA id om2sm17601647pbc.30.1969.12.31.16.00.00 (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 26 Aug 2013 04:59:39 -0700 (PDT) Date: Tue, 27 Aug 2013 00:36:00 -0000 From: "Michael V. Zolotukhin" To: Jakub Jelinek Cc: Kirill Yukhin , Richard Henderson , gcc@gcc.gnu.org, triegel@redhat.com Subject: Re: [RFC] Offloading Support in libgomp Message-ID: <20130826115911.GA40923@msticlxl57.ims.intel.com> References: <20130822140810.GA27868@msticlxl57.ims.intel.com> <20130822142814.GB1814@tucnak.redhat.com> <20130823092810.GA36483@msticlxl57.ims.intel.com> <20130823095250.GJ1814@tucnak.redhat.com> <20130823153052.GA2974@msticlxl57.ims.intel.com> <20130823161631.GO1814@tucnak.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130823161631.GO1814@tucnak.redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-SW-Source: 2013-08/txt/msg00302.txt.bz2 > No need for the device and handler IMHO, each vector would correspond to > one function call (GOMP_target, GOMP_target_data or GOMP_target_update) > and all those calls would be called with device id. Probably yes. > Let's talk about some concrete example (though, I see the gimplifier > doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C > FE, ++todo). That's a great idea, I wanted to go into an example too, but I'd take a simpler case for the beginning. Here is a modified version of your test - I just removed all pragmas, that don't deal with offloading. They aren't new in OpenMP4, and GCC supports them well, so we could concentrate on the others - like 'pragma target' and 'pragma target data'. So, here is the original code: #pragma omp declare target int v = 6; int tgt () { #pragma omp atomic update v++; return 0; } #pragma omp end declare target float bar (int x, int y, int z) { float b[1024], c[1024], s = 0; int i, j; baz (b, c, x); #pragma omp target data map(to: b) { #pragma omp target map(tofrom: c) map(from:s) for (i = 0; i < 1024; i++) tgt (), s += b[i] * c[i]; #pragma omp target update from(b, v) } return s; } Let's write what we want this to be expanded to. For now let's ignore obvious problems leading to ICEs that you mentioned - they are certainly need to be addressed, but I don't think they affect the overall design which we're discussing here. As I currently see it, the given code would be expanded to something like this: // Create two versions of V: for host and for target int v; int v_target __attribute(target); // The same for TGT function int tgt () { .. update v .. } int tgt_target () __attribute(target) { .. update v_target .. } float bar (int x, int y, int z) { float b[1024], c[1024], s = 0; int i, j; baz (b, c, x); // #pragma omp target data map(to: b) vec data_desc; data_desc.push ({&b, 1024*sizeof(float), TO}); GOMP_target_data (&data_desc); { // #pragma omp target map(tofrom: c) map(from:s) data_desc.push ({&c, 1024*sizeof(float), TOFROM}); data_desc.push ({&s, sizeof(float), FROM}); GOMP_target_data (&data_desc); // Add mapping for S and C variables, // mapping for B shouldn't change GOMP_target (foo1, "foo1", &data_desc); // Call either FOO1 or offloaded // FOO1_TARGET with arguments // from vector DATA_DESC // #pragma omp target update from(b, v) vec data_desc_update; // target update pragma require a // separate vector data_desc_update.push ({&b, 1024*sizeof(float), FROM}); data_desc_update.push ({&v, sizeof(int), FROM}); GOMP_target_data (&data_desc_update); } return s; } void foo1 (vec data_desc) { float b = *data_desc[0].host_address; float c = *data_desc[1].host_address; float s = 0; int i; for (i = 0; i < 1024; i++) tgt (), s += b[i] * c[i]; *data_desc[2].host_address = s; } void foo1_target (int n, void **arguments) __attribute(target) { float b = *arguments[0]; float c = *arguments[1]; float s = 0; int i; for (i = 0; i < 1024; i++) tgt_target (), s += b[i] * c[i]; *arguments[2] = s; } That's how I think the code should look like after omp-expanding. I.e. all variables and functions marked with 'target declare' are cloned so that we have host and target versions available (if we have N different targets, then we need N+1 versions). All regions, corresponding to 'pragma omp target' are outlined and, also, cloned to N+1 versions. 'pragma target data' are replaced with generating of vector of data-descriptors and call to GOMP_target_data which performs actual data mapping and marshalling. We also could call GOMP_target_data to invoke additional data-mapping/marshalling - e.g. we need this in 'pragma target' where we add clause MAP(TOFROM:C). 'pragma omp target update' needs a separate vector, as its mapping could differ from mapping of the embracing 'pragma target data'. But all that could be quite meaningless unless internals of GOMP_target{_data} are discussed. So, let's proceed to libgomp part and discuss what these functions should do. From my POV, thesy would perform the following: 1) GOMP_target_data: for each element of the input vector check whether it's already had a mapping and if not, create a new one. Also, necessary marshalling is triggered from here (but probably, it's better to move data transferring to a separate routine). Result of this function work would be a consistent data structure containing all mapped memory entries as well as their handlers, representing target-side addresses. 2) GOMP_target: First of all, this function would call gomp_choose_device_for_offload that would check all available targets and choose a one for offloading. Host could also be chosen here. If host is chosen, we just call host-version of the routine (the function address is passed via the first argument) and pass data_descriptor vector to it. If target-device is chosen we do the following: Create vector of of handlers corrseponding to data descriptors from the input vector. Pass the routine name as well as the vector of handlers to function gomp_run_offloaded_function from the target plugin. That routine perform the actual offloading, waits for the end and returns. Does this overall scheme sounds ok to you? > Now, for GOMP_target we want omplower to replace the var references > like b or c with something like .omp_target_data->b, .omp_target_data->c > etc., where the structure will contain the target addresses of the > variables. So, GOMP_target would again receive vector of the > { mapkind, hostaddr, length }, do the lookups, allocations / copying > like for GOMP_target_data, but also prepare a vector of the corresponding > target addresses that it would pass to the target function. Agreed. I tried to describe and rephrase that a bit above to make sure we both mean the same here. > Automatic variables defined in the scope of #pragma omp target body > don't need any special treatment (but I hope gimplifier doesn't do anything > for them), they will be just automatic variables inside of the target > outlined body. I hope that too. > But let's start with non-optimized code, > everything is passed as target address of the allocated spot. Agreed. > As specs are target specific, I'm afraid you'll need to be looking for > the gcc driver for the target, not lto1 binary. I think I didn't get it. Could you explain this point? What are the specs here? > Configure could record the names, or you could scan a directory with the > plugins and dlopen all shared libraries in there, ... I'd prefer recording at configure step - to me it looks more robust, in general I'm ok with both options. --- Thanks, Michael On 23 Aug 18:16, Jakub Jelinek wrote: > On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote: > > That makes sense. We could maintain a vector of descriptors for each > > encountered MAP clause and push to and pop from it when needed (when > > e.g. new mapping is encountered inside 'pragma omp target data'). The > > desciptor should contain address in the host memory, size of the mapped > > block, type of mapping, related device, and handler, which would be > > returned for this mapping by runtime. Having vector of such > > descriptors, we could pass it as an argument for outlined functions - in > > them we need to extract needed addresses from the vector before > > executing the body. Did I get it right? > > No need for the device and handler IMHO, each vector would correspond to > one function call (GOMP_target, GOMP_target_data or GOMP_target_update) > and all those calls would be called with device id. > > > Also, a bit unclear point here is how should we generate these > > extractions in target-version of the outlined function - seemingly we > > won't pass this entire vector to it, so it's unclear out of what should > > we extract the data. What do you think on this? > > Let's talk about some concrete example (though, I see the gimplifier > doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C > FE, ++todo). > > void baz (float *, float *, int); > > #pragma omp declare target > int v = 6; > int tgt () > { > #pragma omp atomic update > v++; > return 0; > } > #pragma omp end declare target > > float > bar (int x, int y, int z) > { > float b[1024], c[1024], s = 0; > int i, j; > baz (b, c, x); > #pragma omp target data map(to: b) > { > #pragma omp target map(tofrom: c) > #if 0 > #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s) > #pragma omp distribute dist_schedule(static, 4) collapse(1) > for (j=0; j < x; j += y) > #else > j = 0; > #endif > #pragma omp parallel for reduction(+:s) > for (i = j; i < j + y; i++) > tgt (), s += b[i] * c[i]; > #pragma omp target update from(b, v) > } > return s; > } > > float > foo (int x) > { > float b[1024], c[1024], s = 0; > int i; > baz (b, c, x); > #pragma omp target map(to: b, c) > #pragma omp parallel for reduction(+:s) > for (i = 0; i < x; i++) > tgt (), s += b[i] * c[i]; > return s; > } > > This ICEs during ompexp right now otherwise and obviously even omplower > doesn't DTRT. > > So we have something like: > > #pragma omp target data map(to:b) > #pragma omp target map(tofrom:j) > j = 0; > #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) [child fn: _Z3bariii._omp_fn.0 (???)] > #pragma omp for nowait private(i) > for (i = j; i < D.2235; i = i + 1) > { > tgt (); > D.2236 = b[i]; > D.2237 = c[i]; > D.2238 = D.2236 * D.2237; > s = D.2238 + s; > } > #pragma omp target update from(v) from(b) > > On #pragma omp target it clearly is missing many other map clauses, > like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to > debug later on why they disappeared or weren't added. > > In any case, the only thing GOMP_target_data can do is take the vector > of the map clauses { mapkind, hostaddr, length } and look them up > one by one in the mapping of the device and if not present there, allocate > and/or copy and remember. > > Now, for GOMP_target we want omplower to replace the var references > like b or c with something like .omp_target_data->b, .omp_target_data->c > etc., where the structure will contain the target addresses of the > variables. So, GOMP_target would again receive vector of the > { mapkind, hostaddr, length }, do the lookups, allocations / copying > like for GOMP_target_data, but also prepare a vector of the corresponding > target addresses that it would pass to the target function. > > Automatic variables defined in the scope of #pragma omp target body > don't need any special treatment (but I hope gimplifier doesn't do anything > for them), they will be just automatic variables inside of the target > outlined body. Other automatic variables in the function containing #pragma omp > target could have some optimization for them, if there aren't any #pragma > omp target data directives referencing them around the #pragma omp target > that references them, such variables are guaranteed not to be mapped > in the target device upon GOMP_target call, thus such vars could be e.g. > allocated in a flexible array at the end of the .omp_target_data > structure. Also for non-addressable variables supposedly we could consider > promoting them into a temporary variable (at the start of GOMP_target > body load them from .omp_target_data->something, at the end store them back > (well, depending on map kind)). But let's start with non-optimized code, > everything is passed as target address of the allocated spot. > > Also, GOMP_target{_data,} could just lookup addresses from the whole vector > and remember what succeeded and what failed (i.e. what has been already > mapped and thus noop and what needs mapping and depending on mapkind > copying) and sum up the amount of memory that needs allocation for the > latter ones, then just allocate in the device everything at once and just > partition it for the individual vars. > > > > I meant just a single plugin that would handle all of them, or as richi > > > said, perhaps teach LTO plugin to do that. > > > For options, my vision was something like: > > > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz' > > > which would mean: > > > 1) compile LTO IL from the accelerator section for mic with > > > the originally recorded gcc command line options with the Target options > > > removed and no extra options added > > > 2) compile LTO IL also for hsail target, with originally recorded gcc > > > command line options but Target options and -mfoobaz=4 -mbazbaz > > > options added > > > 3) don't compile for ptx > > > The thing is if you originally compile with > > > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops > > > the -m* options might not apply to the target compiler at all. > > > So you'd construct the command line from the original command line sans > > > CL_TARGET options, append to that the link time override for the > > > accelerator. Then another thing is how to find out the corresponding > > > compiler (including its driver) for the target from the plugin. > > Could we set some correspondance between '-ftarget' option value and > > corresponding compiler? E.g. for '-ftarget=xyz' we would look for > > xyz-cc1. I haven't looked in details at how the compiler plugins work, > > so maybe I said something unfeasable:) > > As specs are target specific, I'm afraid you'll need to be looking for > the gcc driver for the target, not lto1 binary. > > > > libgomp would start by trying to dlopen all available plugins, > > > and for each of them call some routine in them that would query the hw > > > for available devices, then libgomp would assign device ids to them (0 and > > > up) and then for target specific parts just dispatch again to the plugin > > > corresponding to the chosen device. > > In libgomp we have a similar problem - there we would need to find out > > plugins names from somewhere. The difference is that libgomp would > > always iterate through all plugins independently on compiler options, > > but even with this I currently have no idea of how to populate list of > > plugins names (I suppose, this should be done somewhere at > > configure/make step of libgomp building process?). > > Configure could record the names, or you could scan a directory with the > plugins and dlopen all shared libraries in there, ... > > Jakub