From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-180004-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 9744 invoked by alias); 28 Aug 2013 10:39:06 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 9731 invoked by uid 89); 28 Aug 2013 10:39:05 -0000
Received: from mail-we0-f174.google.com (HELO mail-we0-f174.google.com) (74.125.82.174) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Wed, 28 Aug 2013 10:39:05 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.4 required=5.0 tests=BAYES_00,FREEMAIL_FROM,KHOP_THREADED,NO_RELAYS autolearn=ham version=3.3.2
X-HELO: mail-we0-f174.google.com
Received: by mail-we0-f174.google.com with SMTP id q54so4963280wes.19        for <gcc@gcc.gnu.org>; Wed, 28 Aug 2013 03:39:01 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.194.179.98 with SMTP id df2mr1348684wjc.49.1377686340946; Wed, 28 Aug 2013 03:39:00 -0700 (PDT)
Received: by 10.194.200.74 with HTTP; Wed, 28 Aug 2013 03:39:00 -0700 (PDT)
In-Reply-To: <20130826115911.GA40923@msticlxl57.ims.intel.com>
References: <20130822140810.GA27868@msticlxl57.ims.intel.com>	<20130822142814.GB1814@tucnak.redhat.com>	<20130823092810.GA36483@msticlxl57.ims.intel.com>	<20130823095250.GJ1814@tucnak.redhat.com>	<20130823153052.GA2974@msticlxl57.ims.intel.com>	<20130823161631.GO1814@tucnak.redhat.com>	<20130826115911.GA40923@msticlxl57.ims.intel.com>
Date: Wed, 28 Aug 2013 12:56:00 -0000
Message-ID: <CAFiYyc1M_rJiVPJqU=M_PaqyP4n8nqWQwavZLJS8t6kd3Ejdkg@mail.gmail.com>
Subject: Re: [RFC] Offloading Support in libgomp
From: Richard Biener <richard.guenther@gmail.com>
To: "Michael V. Zolotukhin" <michael.v.zolotukhin@gmail.com>
Cc: Jakub Jelinek <jakub@redhat.com>, Kirill Yukhin <kirill.yukhin@gmail.com>, 	Richard Henderson <rth@redhat.com>, GCC Development <gcc@gcc.gnu.org>, Torvald Riegel <triegel@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1
X-SW-Source: 2013-08/txt/msg00330.txt.bz2

On Mon, Aug 26, 2013 at 1:59 PM, Michael V. Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
>> No need for the device and handler IMHO, each vector would correspond to
>> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
>> and all those calls would be called with device id.
> Probably yes.
>
>> Let's talk about some concrete example (though, I see the gimplifier
>> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
>> FE, ++todo).
> That's a great idea, I wanted to go into an example too, but I'd take a simpler
> case for the beginning.

(jumping into the middle of the discussion where there is a small example)

> Here is a modified version of your test - I just removed all pragmas, that
> don't deal with offloading.  They aren't new in OpenMP4, and GCC supports them
> well, so we could concentrate on the others - like 'pragma target' and 'pragma
> target data'.
>
> So, here is the original code:
>
>   #pragma omp declare target
>   int v = 6;
>   int tgt ()
>   {
>     #pragma omp atomic update
>       v++;
>     return 0;
>   }
>   #pragma omp end declare target
>
>   float
>   bar (int x, int y, int z)
>   {
>     float b[1024], c[1024], s = 0;
>     int i, j;
>     baz (b, c, x);
>     #pragma omp target data map(to: b)
>     {
>       #pragma omp target map(tofrom: c) map(from:s)
>       for (i = 0; i < 1024; i++)
>         tgt (), s += b[i] * c[i];
>       #pragma omp target update from(b, v)
>     }
>     return s;
>   }

You could even simplify this more by removing tgt and its use?  That is,
s += b[i] * c[i] would still be executed on the accelerator?  What's
this omp target map stuff?  Just guessing from the names shouldn't it be

>     {
>       #pragma omp target map(from: c) map(from: b) map(tofrom:s)
>       for (i = 0; i < 1024; i++)
>         s += b[i] * c[i];
>       #pragma omp target update from(b, v)
>     }

that is, b and c are read, s is read and written.  Not sure what the
last pragma should even do ... (sync and wait so following code could
read from b and v?)

Coming from the HSA side I'd like to see that we can easily auto-accelerate
(as we auto-vectorize) regular OMP code like

   #pragma omp parallel for
    for (i = 0; i < 1024; i++)
       s += b[i] * c[i];

but as we lower this stuff very early we'd have to lower it as omp target?
Or can we make the libgomp interfacing for the workers the same so
we can easily modify them later?

With HSA we don't have to bother about accelerator memory handling
because of the unified address space and the (appearant) cache coherency.
So for HSA it would have been enough to have the omp parallel for
getting a descriptor instead of a function pointer where possible accelerator
implementations are queued up for use by the OMP scheduler.

That is, AFAIK all the omp target stuff is decoupled from scheduling "regular"
CPU OMP tasks?  And omp target implies a parallel region following, so
it's not just additional hints?

>From the accelerator BOF video I gather we agreed on using the GOMP
representation as unified middle-end.  What I didn't get is whether we
agreed on libgomp being the unified single runtime (that eventually
dispatches to accelerator specific runtimes, opened via dlopen)?

Thanks,
Richard.

> Let's write what we want this to be expanded to.  For now let's ignore obvious
> problems leading to ICEs that you mentioned - they are certainly need to be
> addressed, but I don't think they affect the overall design which we're
> discussing here.
>
> As I currently see it, the given code would be expanded to something like
> this:
>
>   // Create two versions of V: for host and for target
>   int v;
>   int v_target __attribute(target);
>
>   // The same for TGT function
>   int tgt ()
>   {
>     .. update v ..
>   }
>   int tgt_target () __attribute(target)
>   {
>     .. update v_target ..
>   }
>
>   float
>   bar (int x, int y, int z)
>   {
>     float b[1024], c[1024], s = 0;
>     int i, j;
>     baz (b, c, x);
>     // #pragma omp target data map(to: b)
>     vec<data_descriptor> data_desc;
>     data_desc.push ({&b, 1024*sizeof(float), TO});
>     GOMP_target_data (&data_desc);
>     {
>       // #pragma omp target map(tofrom: c) map(from:s)
>       data_desc.push ({&c, 1024*sizeof(float), TOFROM});
>       data_desc.push ({&s, sizeof(float), FROM});
>       GOMP_target_data (&data_desc); // Add mapping for S and C variables,
>                                      // mapping for B shouldn't change
>       GOMP_target (foo1, "foo1", &data_desc); // Call either FOO1 or offloaded
>                                               // FOO1_TARGET with arguments
>                                               // from vector DATA_DESC
>
>       // #pragma omp target update from(b, v)
>       vec<data_descriptor> data_desc_update; // target update pragma require a
>                                              // separate vector
>       data_desc_update.push ({&b, 1024*sizeof(float), FROM});
>       data_desc_update.push ({&v, sizeof(int), FROM});
>       GOMP_target_data (&data_desc_update);
>     }
>     return s;
>   }
>   void
>   foo1 (vec<data_descriptor> data_desc)
>   {
>     float b = *data_desc[0].host_address;
>     float c = *data_desc[1].host_address;
>     float s = 0;
>     int i;
>     for (i = 0; i < 1024; i++)
>       tgt (), s += b[i] * c[i];
>     *data_desc[2].host_address = s;
>   }
>   void
>   foo1_target (int n, void **arguments) __attribute(target)
>   {
>     float b = *arguments[0];
>     float c = *arguments[1];
>     float s = 0;
>     int i;
>     for (i = 0; i < 1024; i++)
>       tgt_target (), s += b[i] * c[i];
>     *arguments[2] = s;
>   }
>
> That's how I think the code should look like after omp-expanding.  I.e. all
> variables and functions marked with 'target declare' are cloned so that we have
> host and target versions available (if we have N different targets, then we need
> N+1 versions).  All regions, corresponding to 'pragma omp target' are outlined
> and, also, cloned to N+1 versions.  'pragma target data' are replaced with
> generating of vector of data-descriptors and call to GOMP_target_data which
> performs actual data mapping and marshalling.  We also could call
> GOMP_target_data to invoke additional data-mapping/marshalling - e.g. we need
> this in 'pragma target' where we add clause MAP(TOFROM:C).  'pragma omp target
> update' needs a separate vector, as its mapping could differ from mapping of the
> embracing 'pragma target data'.
>
> But all that could be quite meaningless unless internals of GOMP_target{_data}
> are discussed.  So, let's proceed to libgomp part and discuss what these
> functions should do.  From my POV, thesy would perform the following:
> 1) GOMP_target_data:
> for each element of the input vector check whether it's already had a mapping
> and if not, create a new one.  Also, necessary marshalling is triggered from
> here (but probably, it's better to move data transferring to a separate
> routine).
> Result of this function work would be a consistent data structure containing all
> mapped memory entries as well as their handlers, representing target-side
> addresses.
> 2) GOMP_target:
> First of all, this function would call gomp_choose_device_for_offload that would
> check all available targets and choose a one for offloading.  Host could also be
> chosen here.
> If host is chosen, we just call host-version of the routine (the function
> address is passed via the first argument) and pass data_descriptor vector to it.
> If target-device is chosen we do the following:
> Create vector of of handlers corrseponding to data descriptors from the input
> vector.  Pass the routine name as well as the vector of handlers to function
> gomp_run_offloaded_function from the target plugin.  That routine perform the
> actual offloading, waits for the end and returns.
>
> Does this overall scheme sounds ok to you?
>
>> Now, for GOMP_target we want omplower to replace the var references
>> like b or c with something like .omp_target_data->b, .omp_target_data->c
>> etc., where the structure will contain the target addresses of the
>> variables.  So, GOMP_target would again receive vector of the
>> { mapkind, hostaddr, length }, do the lookups, allocations / copying
>> like for GOMP_target_data, but also prepare a vector of the corresponding
>> target addresses that it would pass to the target function.
> Agreed.  I tried to describe and rephrase that a bit above to make sure we both
> mean the same here.
>
>> Automatic variables defined in the scope of #pragma omp target body
>> don't need any special treatment (but I hope gimplifier doesn't do anything
>> for them), they will be just automatic variables inside of the target
>> outlined body.
> I hope that too.
>
>> But let's start with non-optimized code,
>> everything is passed as target address of the allocated spot.
> Agreed.
>
>> As specs are target specific, I'm afraid you'll need to be looking for
>> the gcc driver for the target, not lto1 binary.
> I think I didn't get it.  Could you explain this point?  What are the specs
> here?
>
>> Configure could record the names, or you could scan a directory with the
>> plugins and dlopen all shared libraries in there, ...
> I'd prefer recording at configure step - to me it looks more robust, in general
> I'm ok with both options.
>
> ---
> Thanks, Michael
>
>
>
> On 23 Aug 18:16, Jakub Jelinek wrote:
>> On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote:
>> > That makes sense.  We could maintain a vector of descriptors for each
>> > encountered MAP clause and push to and pop from it when needed (when
>> > e.g. new mapping is encountered inside 'pragma omp target data').  The
>> > desciptor should contain address in the host memory, size of the mapped
>> > block, type of mapping, related device, and handler, which would be
>> > returned for this mapping by runtime.  Having vector of such
>> > descriptors, we could pass it as an argument for outlined functions - in
>> > them we need to extract needed addresses from the vector before
>> > executing the body.  Did I get it right?
>>
>> No need for the device and handler IMHO, each vector would correspond to
>> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
>> and all those calls would be called with device id.
>>
>> > Also, a bit unclear point here is how should we generate these
>> > extractions in target-version of the outlined function - seemingly we
>> > won't pass this entire vector to it, so it's unclear out of what should
>> > we extract the data.  What do you think on this?
>>
>> Let's talk about some concrete example (though, I see the gimplifier
>> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
>> FE, ++todo).
>>
>> void baz (float *, float *, int);
>>
>> #pragma omp declare target
>> int v = 6;
>> int tgt ()
>> {
>>   #pragma omp atomic update
>>     v++;
>>   return 0;
>> }
>> #pragma omp end declare target
>>
>> float
>> bar (int x, int y, int z)
>> {
>>   float b[1024], c[1024], s = 0;
>>   int i, j;
>>   baz (b, c, x);
>>   #pragma omp target data map(to: b)
>>   {
>>     #pragma omp target map(tofrom: c)
>> #if 0
>>       #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s)
>>         #pragma omp distribute dist_schedule(static, 4) collapse(1)
>>           for (j=0; j < x; j += y)
>> #else
>>         j = 0;
>> #endif
>>             #pragma omp parallel for reduction(+:s)
>>               for (i = j; i < j + y; i++)
>>                 tgt (), s += b[i] * c[i];
>>     #pragma omp target update from(b, v)
>>   }
>>   return s;
>> }
>>
>> float
>> foo (int x)
>> {
>>   float b[1024], c[1024], s = 0;
>>   int i;
>>   baz (b, c, x);
>>   #pragma omp target map(to: b, c)
>>     #pragma omp parallel for reduction(+:s)
>>       for (i = 0; i < x; i++)
>>         tgt (), s += b[i] * c[i];
>>   return s;
>> }
>>
>> This ICEs during ompexp right now otherwise and obviously even omplower
>> doesn't DTRT.
>>
>> So we have something like:
>>
>>   #pragma omp target data map(to:b)
>>   #pragma omp target map(tofrom:j)
>>   j = 0;
>>   #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) [child fn: _Z3bariii._omp_fn.0 (???)]
>>   #pragma omp for nowait private(i)
>>   for (i = j; i < D.2235; i = i + 1)
>>     {
>>       tgt ();
>>       D.2236 = b[i];
>>       D.2237 = c[i];
>>       D.2238 = D.2236 * D.2237;
>>       s = D.2238 + s;
>>     }
>>   #pragma omp target update from(v) from(b)
>>
>> On #pragma omp target it clearly is missing many other map clauses,
>> like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to
>> debug later on why they disappeared or weren't added.
>>
>> In any case, the only thing GOMP_target_data can do is take the vector
>> of the map clauses { mapkind, hostaddr, length } and look them up
>> one by one in the mapping of the device and if not present there, allocate
>> and/or copy and remember.
>>
>> Now, for GOMP_target we want omplower to replace the var references
>> like b or c with something like .omp_target_data->b, .omp_target_data->c
>> etc., where the structure will contain the target addresses of the
>> variables.  So, GOMP_target would again receive vector of the
>> { mapkind, hostaddr, length }, do the lookups, allocations / copying
>> like for GOMP_target_data, but also prepare a vector of the corresponding
>> target addresses that it would pass to the target function.
>>
>> Automatic variables defined in the scope of #pragma omp target body
>> don't need any special treatment (but I hope gimplifier doesn't do anything
>> for them), they will be just automatic variables inside of the target
>> outlined body.  Other automatic variables in the function containing #pragma omp
>> target could have some optimization for them, if there aren't any #pragma
>> omp target data directives referencing them around the #pragma omp target
>> that references them, such variables are guaranteed not to be mapped
>> in the target device upon GOMP_target call, thus such vars could be e.g.
>> allocated in a flexible array at the end of the .omp_target_data
>> structure.  Also for non-addressable variables supposedly we could consider
>> promoting them into a temporary variable (at the start of GOMP_target
>> body load them from .omp_target_data->something, at the end store them back
>> (well, depending on map kind)).  But let's start with non-optimized code,
>> everything is passed as target address of the allocated spot.
>>
>> Also, GOMP_target{_data,} could just lookup addresses from the whole vector
>> and remember what succeeded and what failed (i.e. what has been already
>> mapped and thus noop and what needs mapping and depending on mapkind
>> copying) and sum up the amount of memory that needs allocation for the
>> latter ones, then just allocate in the device everything at once and just
>> partition it for the individual vars.
>>
>> > > I meant just a single plugin that would handle all of them, or as richi
>> > > said, perhaps teach LTO plugin to do that.
>> > > For options, my vision was something like:
>> > > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
>> > > which would mean:
>> > > 1) compile LTO IL from the accelerator section for mic with
>> > >    the originally recorded gcc command line options with the Target options
>> > >    removed and no extra options added
>> > > 2) compile LTO IL also for hsail target, with originally recorded gcc
>> > >    command line options but Target options and -mfoobaz=4 -mbazbaz
>> > >    options added
>> > > 3) don't compile for ptx
>> > > The thing is if you originally compile with
>> > > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
>> > > the -m* options might not apply to the target compiler at all.
>> > > So you'd construct the command line from the original command line sans
>> > > CL_TARGET options, append to that the link time override for the
>> > > accelerator.  Then another thing is how to find out the corresponding
>> > > compiler (including its driver) for the target from the plugin.
>> > Could we set some correspondance between '-ftarget' option value and
>> > corresponding compiler?  E.g. for '-ftarget=xyz' we would look for
>> > xyz-cc1.  I haven't looked in details at how the compiler plugins work,
>> > so maybe I said something unfeasable:)
>>
>> As specs are target specific, I'm afraid you'll need to be looking for
>> the gcc driver for the target, not lto1 binary.
>>
>> > > libgomp would start by trying to dlopen all available plugins,
>> > > and for each of them call some routine in them that would query the hw
>> > > for available devices, then libgomp would assign device ids to them (0 and
>> > > up) and then for target specific parts just dispatch again to the plugin
>> > > corresponding to the chosen device.
>> > In libgomp we have a similar problem - there we would need to find out
>> > plugins names from somewhere.  The difference is that libgomp would
>> > always iterate through all plugins independently on compiler options,
>> > but even with this I currently have no idea of how to populate list of
>> > plugins names (I suppose, this should be done somewhere at
>> > configure/make step of libgomp building process?).
>>
>> Configure could record the names, or you could scan a directory with the
>> plugins and dlopen all shared libraries in there, ...
>>
>>       Jakub