From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 2054 invoked by alias); 23 Aug 2013 16:16:39 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 2033 invoked by uid 89); 23 Aug 2013 16:16:37 -0000 X-Spam-SWARE-Status: No, score=-7.6 required=5.0 tests=AWL,BAYES_00,RCVD_IN_HOSTKARMA_W,RCVD_IN_HOSTKARMA_WL,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Fri, 23 Aug 2013 16:16:36 +0000 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r7NGGYQU006195 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 23 Aug 2013 12:16:34 -0400 Received: from zalov.cz (vpn1-4-247.ams2.redhat.com [10.36.4.247]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id r7NGGWha000975 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 23 Aug 2013 12:16:34 -0400 Received: from zalov.cz (localhost [127.0.0.1]) by zalov.cz (8.14.5/8.14.5) with ESMTP id r7NGGWGL006437; Fri, 23 Aug 2013 18:16:32 +0200 Received: (from jakub@localhost) by zalov.cz (8.14.5/8.14.5/Submit) id r7NGGVEj006436; Fri, 23 Aug 2013 18:16:31 +0200 Date: Sun, 25 Aug 2013 16:24:00 -0000 From: Jakub Jelinek To: "Michael V. Zolotukhin" Cc: Kirill Yukhin , Richard Henderson , gcc@gcc.gnu.org, triegel@redhat.com Subject: Re: [RFC] Offloading Support in libgomp Message-ID: <20130823161631.GO1814@tucnak.redhat.com> Reply-To: Jakub Jelinek References: <20130822140810.GA27868@msticlxl57.ims.intel.com> <20130822142814.GB1814@tucnak.redhat.com> <20130823092810.GA36483@msticlxl57.ims.intel.com> <20130823095250.GJ1814@tucnak.redhat.com> <20130823153052.GA2974@msticlxl57.ims.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130823153052.GA2974@msticlxl57.ims.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-SW-Source: 2013-08/txt/msg00287.txt.bz2 On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote: > That makes sense. We could maintain a vector of descriptors for each > encountered MAP clause and push to and pop from it when needed (when > e.g. new mapping is encountered inside 'pragma omp target data'). The > desciptor should contain address in the host memory, size of the mapped > block, type of mapping, related device, and handler, which would be > returned for this mapping by runtime. Having vector of such > descriptors, we could pass it as an argument for outlined functions - in > them we need to extract needed addresses from the vector before > executing the body. Did I get it right? No need for the device and handler IMHO, each vector would correspond to one function call (GOMP_target, GOMP_target_data or GOMP_target_update) and all those calls would be called with device id. > Also, a bit unclear point here is how should we generate these > extractions in target-version of the outlined function - seemingly we > won't pass this entire vector to it, so it's unclear out of what should > we extract the data. What do you think on this? Let's talk about some concrete example (though, I see the gimplifier doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C FE, ++todo). void baz (float *, float *, int); #pragma omp declare target int v = 6; int tgt () { #pragma omp atomic update v++; return 0; } #pragma omp end declare target float bar (int x, int y, int z) { float b[1024], c[1024], s = 0; int i, j; baz (b, c, x); #pragma omp target data map(to: b) { #pragma omp target map(tofrom: c) #if 0 #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s) #pragma omp distribute dist_schedule(static, 4) collapse(1) for (j=0; j < x; j += y) #else j = 0; #endif #pragma omp parallel for reduction(+:s) for (i = j; i < j + y; i++) tgt (), s += b[i] * c[i]; #pragma omp target update from(b, v) } return s; } float foo (int x) { float b[1024], c[1024], s = 0; int i; baz (b, c, x); #pragma omp target map(to: b, c) #pragma omp parallel for reduction(+:s) for (i = 0; i < x; i++) tgt (), s += b[i] * c[i]; return s; } This ICEs during ompexp right now otherwise and obviously even omplower doesn't DTRT. So we have something like: #pragma omp target data map(to:b) #pragma omp target map(tofrom:j) j = 0; #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) [child fn: _Z3bariii._omp_fn.0 (???)] #pragma omp for nowait private(i) for (i = j; i < D.2235; i = i + 1) { tgt (); D.2236 = b[i]; D.2237 = c[i]; D.2238 = D.2236 * D.2237; s = D.2238 + s; } #pragma omp target update from(v) from(b) On #pragma omp target it clearly is missing many other map clauses, like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to debug later on why they disappeared or weren't added. In any case, the only thing GOMP_target_data can do is take the vector of the map clauses { mapkind, hostaddr, length } and look them up one by one in the mapping of the device and if not present there, allocate and/or copy and remember. Now, for GOMP_target we want omplower to replace the var references like b or c with something like .omp_target_data->b, .omp_target_data->c etc., where the structure will contain the target addresses of the variables. So, GOMP_target would again receive vector of the { mapkind, hostaddr, length }, do the lookups, allocations / copying like for GOMP_target_data, but also prepare a vector of the corresponding target addresses that it would pass to the target function. Automatic variables defined in the scope of #pragma omp target body don't need any special treatment (but I hope gimplifier doesn't do anything for them), they will be just automatic variables inside of the target outlined body. Other automatic variables in the function containing #pragma omp target could have some optimization for them, if there aren't any #pragma omp target data directives referencing them around the #pragma omp target that references them, such variables are guaranteed not to be mapped in the target device upon GOMP_target call, thus such vars could be e.g. allocated in a flexible array at the end of the .omp_target_data structure. Also for non-addressable variables supposedly we could consider promoting them into a temporary variable (at the start of GOMP_target body load them from .omp_target_data->something, at the end store them back (well, depending on map kind)). But let's start with non-optimized code, everything is passed as target address of the allocated spot. Also, GOMP_target{_data,} could just lookup addresses from the whole vector and remember what succeeded and what failed (i.e. what has been already mapped and thus noop and what needs mapping and depending on mapkind copying) and sum up the amount of memory that needs allocation for the latter ones, then just allocate in the device everything at once and just partition it for the individual vars. > > I meant just a single plugin that would handle all of them, or as richi > > said, perhaps teach LTO plugin to do that. > > For options, my vision was something like: > > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz' > > which would mean: > > 1) compile LTO IL from the accelerator section for mic with > > the originally recorded gcc command line options with the Target options > > removed and no extra options added > > 2) compile LTO IL also for hsail target, with originally recorded gcc > > command line options but Target options and -mfoobaz=4 -mbazbaz > > options added > > 3) don't compile for ptx > > The thing is if you originally compile with > > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops > > the -m* options might not apply to the target compiler at all. > > So you'd construct the command line from the original command line sans > > CL_TARGET options, append to that the link time override for the > > accelerator. Then another thing is how to find out the corresponding > > compiler (including its driver) for the target from the plugin. > Could we set some correspondance between '-ftarget' option value and > corresponding compiler? E.g. for '-ftarget=xyz' we would look for > xyz-cc1. I haven't looked in details at how the compiler plugins work, > so maybe I said something unfeasable:) As specs are target specific, I'm afraid you'll need to be looking for the gcc driver for the target, not lto1 binary. > > libgomp would start by trying to dlopen all available plugins, > > and for each of them call some routine in them that would query the hw > > for available devices, then libgomp would assign device ids to them (0 and > > up) and then for target specific parts just dispatch again to the plugin > > corresponding to the chosen device. > In libgomp we have a similar problem - there we would need to find out > plugins names from somewhere. The difference is that libgomp would > always iterate through all plugins independently on compiler options, > but even with this I currently have no idea of how to populate list of > plugins names (I suppose, this should be done somewhere at > configure/make step of libgomp building process?). Configure could record the names, or you could scan a directory with the plugins and dlopen all shared libraries in there, ... Jakub