[RFC] Offloading Support in libgomp

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] Offloading Support in libgomp
@ 2013-08-22 22:37 Michael V. Zolotukhin
  2013-08-23  0:22 ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-22 22:37 UTC (permalink / raw)
  To: Kirill Yukhin, Richard Henderson, Jakub Jelinek; +Cc: gcc

Hi,
We're working on design for offloading support in GCC (part of OpenMP4), and I
have a question regarding libgomp part.

Suppose we expand '#pragma omp target' like we expand '#pragma omp parallel',
i.e. the compiler expands the following code:
  #pragma omp target
  {
    body;
  }
to this:
  void subfunction (void *data)
  {
    use data;
    body;
  }

  setup data;
  function_name = "subfunction";
  GOMP_offload (subfunction, &data, function_name);

GOMP_offload is a call to libgomp, which will be implemented somehow like this:
  void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
  {
    if (gomp_offload_available ())
      {
	handler = gomp_upload_data (data);
	gomp_offload_call (fname, handler);
	gomp_download_data (&data, handler);
      }
    else
      {
	fn (data);
      }
  }

Routines gomp_upload_data, gomp_offload_call and similar could, for example, use
COI (see
http://download-software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemssoftwaredevelopersguide_0.pdf)
functions to perform actual data marshalling and calling routines on the target
side.

Does this generic scheme sounds ok to you?

We'd probably want to be able to use the same compiler for different
offload-targets, so it's important to decide how we would invoke different
implementations of these routines with the same compiler.  One way to do it is
to use dlopen-routines - i.e. we try to load, say, "libtargetiface.so" and if it
fails, we use some default (dummy) implementations - otherwise we use the
versions from the library.  In this approach, along with libgomp.so we'll need
to have libtargetiface.so for each target we want to offload to.  Is this way
viable, or should it be done in some other way?

--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-22 22:37 [RFC] Offloading Support in libgomp Michael V. Zolotukhin
@ 2013-08-23  0:22 ` Jakub Jelinek
  2013-08-23 12:16   ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-23  0:22 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Thu, Aug 22, 2013 at 06:08:10PM +0400, Michael V. Zolotukhin wrote:
> We're working on design for offloading support in GCC (part of OpenMP4), and I
> have a question regarding libgomp part.
> 
> Suppose we expand '#pragma omp target' like we expand '#pragma omp parallel',
> i.e. the compiler expands the following code:
>   #pragma omp target
>   {
>     body;
>   }
> to this:
>   void subfunction (void *data)
>   {
>     use data;
>     body;
>   }
> 
>   setup data;
>   function_name = "subfunction";
>   GOMP_offload (subfunction, &data, function_name);

Roughly.  We have 3 directives here,
#pragma omp target
#pragma omp target data
#pragma omp target update
and all of them have various clauses, some that are allowed at most once
(e.g. the device clause, if clause) and others that can be used many times
(the data movement clauses).
I'd prefer GOMP_target instead of GOMP_offload for the function name, to
make it clearly related to the corresponding directive.
The question is if we want to emit multiple calls for the single directive,
say one for each data movement clause (where for each one we need address,
length, direction and some way how to propagate the transformed address
to the accelerator code), or if we build an array of the data movement
structures and just pass that down to a single routine.  Because of the
device clause which should be probably passed just as an integer with -1
meaning the default, perhaps single routine might be better.

> GOMP_offload is a call to libgomp, which will be implemented somehow like this:
>   void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
>   {
>     if (gomp_offload_available ())

This really isn't just check whether accelerator is available, we need to
query all accelerators in the system (and cache that somehow in the
library), assign device numbers to individual devices (say, you could have
two Intel MIC cards, one AMD HSAIL capable GPGPU and 4 Nvidia PTX capable
GPGPUs or similar), ensure that already assigned device numbers aren't
reused when discovering new ones and then just check what device user
requested (if not available, fall back to host), next check see if we
have corresponding code for that accelerator (again, fallback to host
otherwise), optionally compile the code if not compiled yet (HSAIL/PTX code
only) then finally do the name lookup and spawn it.
Stuff specific to the HW should be in libgomp plugins IMHO, so we have one
dlopenable module for each of the 3 variants, where one fn in the plugin
would be about checking what HW is available, one about trying to run the
code etc.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-23  0:22 ` Jakub Jelinek
@ 2013-08-23 12:16   ` Michael V. Zolotukhin
  2013-08-23 12:37     ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-23 12:16 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> Roughly.  We have 3 directives here,
> #pragma omp target
> #pragma omp target data
> #pragma omp target update
> and all of them have various clauses, some that are allowed at most once
> (e.g. the device clause, if clause) and others that can be used many times
> (the data movement clauses).
> The question is if we want to emit multiple calls for the single directive,
> say one for each data movement clause (where for each one we need address,
> length, direction and some way how to propagate the transformed address
> to the accelerator code), or if we build an array of the data movement
> structures and just pass that down to a single routine.  Because of the
> device clause which should be probably passed just as an integer with -1
> meaning the default, perhaps single routine might be better.
Sure, I used '#pragma omp target' just for a simple example.  The
question about '#pragma omp target data' is still open.  As far as I
understand, all three of the pragmas could require data marshalling (but
not necessary - 'omp target', if it's located inside 'omp target data'
which specifies all needed for 'omp dtarget' variables, won't need any
data marshalling - right?).  This data movement could be done, as you
noted by a single call or a set of calls (one for each clause) - and
while single call seems appealing, it could be better to use separate
calls in case, when e.g. we have a 'target update' for only a subset of
all described in 'target data' variables.  One-call approach has
difficulties with specifying, which subset of the data we want to
update.

> I'd prefer GOMP_target instead of GOMP_offload for the function name, to
> make it clearly related to the corresponding directive.
That makes sense - I just used first name that came to my mind here.

> > GOMP_offload is a call to libgomp, which will be implemented somehow like this:
> >   void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
> >   {
> >     if (gomp_offload_available ())
> 
> This really isn't just check whether accelerator is available, we need to
> query all accelerators in the system (and cache that somehow in the
> library), assign device numbers to individual devices (say, you could have
> two Intel MIC cards, one AMD HSAIL capable GPGPU and 4 Nvidia PTX capable
> GPGPUs or similar), ensure that already assigned device numbers aren't
> reused when discovering new ones and then just check what device user
> requested (if not available, fall back to host), next check see if we
> have corresponding code for that accelerator (again, fallback to host
> otherwise), optionally compile the code if not compiled yet (HSAIL/PTX code
> only) then finally do the name lookup and spawn it.
Multi-target option arises another bunch of questions:)  Could you
please check if my vision of how GCC would handle multiple offload
targets? Here it is:
We have GCC with a set of plugins for compiling code for each available
offloading target.  These plugins work similarly to lto-plugin, i.e.
they consume gimple as the input, but produce a code for the specific
target.  Libgomp also has similar set of plugins for HW specific
implementation of functions for remote running code, data transferring,
getting device status etc.
For example, for Intel MIC, AMD HSAIL and Nvidia PTX we'll have host-GCC
with three plugins and host-libgomp, also with three plugins.
Invoking GCC with options like '-mmic', '-mhsail', '-mptx' triggers
usage of a corresponding plugins in GCC driver.  In result, after the
compilation we'd have four binaries: one for host and three for possible
targets.
Now, libgomp part.  The host binary consists calls to libgomp.so, which
is target-independent (i.e. it's host-specific).  It should be able to
call plugins for all three targets, so in functions like
gomp_offload_available it probably would iterate through all available
plugins, asking for device status and the code availability.  This
iterating would include dlopen of the corresponding plugin, calling a
function from it and moving to the next plugin.
Is this correct?

---
Thanks, Michael

On 22 Aug 16:28, Jakub Jelinek wrote:
> On Thu, Aug 22, 2013 at 06:08:10PM +0400, Michael V. Zolotukhin wrote:
> > We're working on design for offloading support in GCC (part of OpenMP4), and I
> > have a question regarding libgomp part.
> > 
> > Suppose we expand '#pragma omp target' like we expand '#pragma omp parallel',
> > i.e. the compiler expands the following code:
> >   #pragma omp target
> >   {
> >     body;
> >   }
> > to this:
> >   void subfunction (void *data)
> >   {
> >     use data;
> >     body;
> >   }
> > 
> >   setup data;
> >   function_name = "subfunction";
> >   GOMP_offload (subfunction, &data, function_name);
> 
> Roughly.  We have 3 directives here,
> #pragma omp target
> #pragma omp target data
> #pragma omp target update
> and all of them have various clauses, some that are allowed at most once
> (e.g. the device clause, if clause) and others that can be used many times
> (the data movement clauses).
> I'd prefer GOMP_target instead of GOMP_offload for the function name, to
> make it clearly related to the corresponding directive.
> The question is if we want to emit multiple calls for the single directive,
> say one for each data movement clause (where for each one we need address,
> length, direction and some way how to propagate the transformed address
> to the accelerator code), or if we build an array of the data movement
> structures and just pass that down to a single routine.  Because of the
> device clause which should be probably passed just as an integer with -1
> meaning the default, perhaps single routine might be better.
> 
> > GOMP_offload is a call to libgomp, which will be implemented somehow like this:
> >   void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
> >   {
> >     if (gomp_offload_available ())
> 
> This really isn't just check whether accelerator is available, we need to
> query all accelerators in the system (and cache that somehow in the
> library), assign device numbers to individual devices (say, you could have
> two Intel MIC cards, one AMD HSAIL capable GPGPU and 4 Nvidia PTX capable
> GPGPUs or similar), ensure that already assigned device numbers aren't
> reused when discovering new ones and then just check what device user
> requested (if not available, fall back to host), next check see if we
> have corresponding code for that accelerator (again, fallback to host
> otherwise), optionally compile the code if not compiled yet (HSAIL/PTX code
> only) then finally do the name lookup and spawn it.
> Stuff specific to the HW should be in libgomp plugins IMHO, so we have one
> dlopenable module for each of the 3 variants, where one fn in the plugin
> would be about checking what HW is available, one about trying to run the
> code etc.
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-23 12:16   ` Michael V. Zolotukhin
@ 2013-08-23 12:37     ` Jakub Jelinek
  2013-08-24  6:17       ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-23 12:37 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Fri, Aug 23, 2013 at 01:28:10PM +0400, Michael V. Zolotukhin wrote:
> Sure, I used '#pragma omp target' just for a simple example.  The
> question about '#pragma omp target data' is still open.  As far as I
> understand, all three of the pragmas could require data marshalling (but
> not necessary - 'omp target', if it's located inside 'omp target data'
> which specifies all needed for 'omp dtarget' variables, won't need any
> data marshalling - right?).  This data movement could be done, as you
> noted by a single call or a set of calls (one for each clause) - and
> while single call seems appealing, it could be better to use separate
> calls in case, when e.g. we have a 'target update' for only a subset of
> all described in 'target data' variables.  One-call approach has
> difficulties with specifying, which subset of the data we want to
> update.

The single call approach would just be passed array of control structures
that would describe each of the MAP clauses, and you'd simply loop over
them; the standard requires that if some object is already mapped into the
device, then nothing is performed (well, target update is an exception).
So you'd have some data structure that maps [address,address+length)
intervals for each device into target addresses and start with doing a
lookup on that (note, if [addr,addr+6) -> devaddr is already in, then addr+4
should be just mapped to devaddr+4).  If not found, you'd allocate the
memory on the device, add into the mapping data structure and record in a
vector for the target/target data in question, so that on exit from that it
would be copied back if requested, and deallocated.  I don't see how single
call vs. multiple calls would change anything on that.  Plus a single call
allows the device id to be looked up just once and treat all the mappings
as a batch (GOMP_target_data would probably need corresponding
GOMP_target_data_end call, perhaps just with a device_id and would pop from
the stack of pending target data regions for that device.
[B
> > I'd prefer GOMP_target instead of GOMP_offload for the function name, to
> > make it clearly related to the corresponding directive.
> That makes sense - I just used first name that came to my mind here.
> 
> > > GOMP_offload is a call to libgomp, which will be implemented somehow like this:
> > >   void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
> > >   {
> > >     if (gomp_offload_available ())
> > 
> > This really isn't just check whether accelerator is available, we need to
> > query all accelerators in the system (and cache that somehow in the
> > library), assign device numbers to individual devices (say, you could have
> > two Intel MIC cards, one AMD HSAIL capable GPGPU and 4 Nvidia PTX capable
> > GPGPUs or similar), ensure that already assigned device numbers aren't
> > reused when discovering new ones and then just check what device user
> > requested (if not available, fall back to host), next check see if we
> > have corresponding code for that accelerator (again, fallback to host
> > otherwise), optionally compile the code if not compiled yet (HSAIL/PTX code
> > only) then finally do the name lookup and spawn it.
> Multi-target option arises another bunch of questions:)  Could you
> please check if my vision of how GCC would handle multiple offload
> targets? Here it is:
> We have GCC with a set of plugins for compiling code for each available
> offloading target.  These plugins work similarly to lto-plugin, i.e.
> they consume gimple as the input, but produce a code for the specific
> target.  Libgomp also has similar set of plugins for HW specific
> implementation of functions for remote running code, data transferring,
> getting device status etc.
> For example, for Intel MIC, AMD HSAIL and Nvidia PTX we'll have host-GCC
> with three plugins and host-libgomp, also with three plugins.
> Invoking GCC with options like '-mmic', '-mhsail', '-mptx' triggers
> usage of a corresponding plugins in GCC driver.  In result, after the
> compilation we'd have four binaries: one for host and three for possible
> targets.

I meant just a single plugin that would handle all of them, or as richi
said, perhaps teach LTO plugin to do that.
For options, my vision was something like:
-ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
which would mean:
1) compile LTO IL from the accelerator section for mic with
   the originally recorded gcc command line options with the Target options
   removed and no extra options added
2) compile LTO IL also for hsail target, with originally recorded gcc
   command line options but Target options and -mfoobaz=4 -mbazbaz
   options added
3) don't compile for ptx
The thing is if you originally compile with
-O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
the -m* options might not apply to the target compiler at all.
So you'd construct the command line from the original command line sans
CL_TARGET options, append to that the link time override for the
accelerator.  Then another thing is how to find out the corresponding
compiler (including its driver) for the target from the plugin.

> Now, libgomp part.  The host binary consists calls to libgomp.so, which
> is target-independent (i.e. it's host-specific).  It should be able to
> call plugins for all three targets, so in functions like
> gomp_offload_available it probably would iterate through all available
> plugins, asking for device status and the code availability.  This
> iterating would include dlopen of the corresponding plugin, calling a
> function from it and moving to the next plugin.
> Is this correct?

libgomp would start by trying to dlopen all available plugins,
and for each of them call some routine in them that would query the hw
for available devices, then libgomp would assign device ids to them (0 and
up) and then for target specific parts just dispatch again to the plugin
corresponding to the chosen device.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-23 12:37     ` Jakub Jelinek
@ 2013-08-24  6:17       ` Michael V. Zolotukhin
  2013-08-25 16:24         ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-24  6:17 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> The single call approach would just be passed array of control structures
> that would describe each of the MAP clauses, and you'd simply loop over
> them; the standard requires that if some object is already mapped into the
> device, then nothing is performed (well, target update is an exception).
> So you'd have some data structure that maps [address,address+length)
> intervals for each device into target addresses and start with doing a
> lookup on that (note, if [addr,addr+6) -> devaddr is already in, then addr+4
> should be just mapped to devaddr+4).  If not found, you'd allocate the
> memory on the device, add into the mapping data structure and record in a
> vector for the target/target data in question, so that on exit from that it
> would be copied back if requested, and deallocated.  I don't see how single
> call vs. multiple calls would change anything on that.  Plus a single call
> allows the device id to be looked up just once and treat all the mappings
> as a batch (GOMP_target_data would probably need corresponding
> GOMP_target_data_end call, perhaps just with a device_id and would pop from
> the stack of pending target data regions for that device.
That makes sense.  We could maintain a vector of descriptors for each
encountered MAP clause and push to and pop from it when needed (when
e.g. new mapping is encountered inside 'pragma omp target data').  The
desciptor should contain address in the host memory, size of the mapped
block, type of mapping, related device, and handler, which would be
returned for this mapping by runtime.  Having vector of such
descriptors, we could pass it as an argument for outlined functions - in
them we need to extract needed addresses from the vector before
executing the body.  Did I get it right?

Also, a bit unclear point here is how should we generate these
extractions in target-version of the outlined function - seemingly we
won't pass this entire vector to it, so it's unclear out of what should
we extract the data.  What do you think on this?

> I meant just a single plugin that would handle all of them, or as richi
> said, perhaps teach LTO plugin to do that.
> For options, my vision was something like:
> -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
> which would mean:
> 1) compile LTO IL from the accelerator section for mic with
>    the originally recorded gcc command line options with the Target options
>    removed and no extra options added
> 2) compile LTO IL also for hsail target, with originally recorded gcc
>    command line options but Target options and -mfoobaz=4 -mbazbaz
>    options added
> 3) don't compile for ptx
> The thing is if you originally compile with
> -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
> the -m* options might not apply to the target compiler at all.
> So you'd construct the command line from the original command line sans
> CL_TARGET options, append to that the link time override for the
> accelerator.  Then another thing is how to find out the corresponding
> compiler (including its driver) for the target from the plugin.
Could we set some correspondance between '-ftarget' option value and
corresponding compiler?  E.g. for '-ftarget=xyz' we would look for
xyz-cc1.  I haven't looked in details at how the compiler plugins work,
so maybe I said something unfeasable:)

> libgomp would start by trying to dlopen all available plugins,
> and for each of them call some routine in them that would query the hw
> for available devices, then libgomp would assign device ids to them (0 and
> up) and then for target specific parts just dispatch again to the plugin
> corresponding to the chosen device.
In libgomp we have a similar problem - there we would need to find out
plugins names from somewhere.  The difference is that libgomp would
always iterate through all plugins independently on compiler options,
but even with this I currently have no idea of how to populate list of
plugins names (I suppose, this should be done somewhere at
configure/make step of libgomp building process?).


On 23 Aug 11:52, Jakub Jelinek wrote:
> On Fri, Aug 23, 2013 at 01:28:10PM +0400, Michael V. Zolotukhin wrote:
> > Sure, I used '#pragma omp target' just for a simple example.  The
> > question about '#pragma omp target data' is still open.  As far as I
> > understand, all three of the pragmas could require data marshalling (but
> > not necessary - 'omp target', if it's located inside 'omp target data'
> > which specifies all needed for 'omp dtarget' variables, won't need any
> > data marshalling - right?).  This data movement could be done, as you
> > noted by a single call or a set of calls (one for each clause) - and
> > while single call seems appealing, it could be better to use separate
> > calls in case, when e.g. we have a 'target update' for only a subset of
> > all described in 'target data' variables.  One-call approach has
> > difficulties with specifying, which subset of the data we want to
> > update.
> 
> The single call approach would just be passed array of control structures
> that would describe each of the MAP clauses, and you'd simply loop over
> them; the standard requires that if some object is already mapped into the
> device, then nothing is performed (well, target update is an exception).
> So you'd have some data structure that maps [address,address+length)
> intervals for each device into target addresses and start with doing a
> lookup on that (note, if [addr,addr+6) -> devaddr is already in, then addr+4
> should be just mapped to devaddr+4).  If not found, you'd allocate the
> memory on the device, add into the mapping data structure and record in a
> vector for the target/target data in question, so that on exit from that it
> would be copied back if requested, and deallocated.  I don't see how single
> call vs. multiple calls would change anything on that.  Plus a single call
> allows the device id to be looked up just once and treat all the mappings
> as a batch (GOMP_target_data would probably need corresponding
> GOMP_target_data_end call, perhaps just with a device_id and would pop from
> the stack of pending target data regions for that device.
> [B
> > > I'd prefer GOMP_target instead of GOMP_offload for the function name, to
> > > make it clearly related to the corresponding directive.
> > That makes sense - I just used first name that came to my mind here.
> > 
> > > > GOMP_offload is a call to libgomp, which will be implemented somehow like this:
> > > >   void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
> > > >   {
> > > >     if (gomp_offload_available ())
> > > 
> > > This really isn't just check whether accelerator is available, we need to
> > > query all accelerators in the system (and cache that somehow in the
> > > library), assign device numbers to individual devices (say, you could have
> > > two Intel MIC cards, one AMD HSAIL capable GPGPU and 4 Nvidia PTX capable
> > > GPGPUs or similar), ensure that already assigned device numbers aren't
> > > reused when discovering new ones and then just check what device user
> > > requested (if not available, fall back to host), next check see if we
> > > have corresponding code for that accelerator (again, fallback to host
> > > otherwise), optionally compile the code if not compiled yet (HSAIL/PTX code
> > > only) then finally do the name lookup and spawn it.
> > Multi-target option arises another bunch of questions:)  Could you
> > please check if my vision of how GCC would handle multiple offload
> > targets? Here it is:
> > We have GCC with a set of plugins for compiling code for each available
> > offloading target.  These plugins work similarly to lto-plugin, i.e.
> > they consume gimple as the input, but produce a code for the specific
> > target.  Libgomp also has similar set of plugins for HW specific
> > implementation of functions for remote running code, data transferring,
> > getting device status etc.
> > For example, for Intel MIC, AMD HSAIL and Nvidia PTX we'll have host-GCC
> > with three plugins and host-libgomp, also with three plugins.
> > Invoking GCC with options like '-mmic', '-mhsail', '-mptx' triggers
> > usage of a corresponding plugins in GCC driver.  In result, after the
> > compilation we'd have four binaries: one for host and three for possible
> > targets.
> 
> I meant just a single plugin that would handle all of them, or as richi
> said, perhaps teach LTO plugin to do that.
> For options, my vision was something like:
> -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
> which would mean:
> 1) compile LTO IL from the accelerator section for mic with
>    the originally recorded gcc command line options with the Target options
>    removed and no extra options added
> 2) compile LTO IL also for hsail target, with originally recorded gcc
>    command line options but Target options and -mfoobaz=4 -mbazbaz
>    options added
> 3) don't compile for ptx
> The thing is if you originally compile with
> -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
> the -m* options might not apply to the target compiler at all.
> So you'd construct the command line from the original command line sans
> CL_TARGET options, append to that the link time override for the
> accelerator.  Then another thing is how to find out the corresponding
> compiler (including its driver) for the target from the plugin.
> 
> > Now, libgomp part.  The host binary consists calls to libgomp.so, which
> > is target-independent (i.e. it's host-specific).  It should be able to
> > call plugins for all three targets, so in functions like
> > gomp_offload_available it probably would iterate through all available
> > plugins, asking for device status and the code availability.  This
> > iterating would include dlopen of the corresponding plugin, calling a
> > function from it and moving to the next plugin.
> > Is this correct?
> 
> libgomp would start by trying to dlopen all available plugins,
> and for each of them call some routine in them that would query the hw
> for available devices, then libgomp would assign device ids to them (0 and
> up) and then for target specific parts just dispatch again to the plugin
> corresponding to the chosen device.
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-24  6:17       ` Michael V. Zolotukhin
@ 2013-08-25 16:24         ` Jakub Jelinek
  2013-08-27  0:36           ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-25 16:24 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote:
> That makes sense.  We could maintain a vector of descriptors for each
> encountered MAP clause and push to and pop from it when needed (when
> e.g. new mapping is encountered inside 'pragma omp target data').  The
> desciptor should contain address in the host memory, size of the mapped
> block, type of mapping, related device, and handler, which would be
> returned for this mapping by runtime.  Having vector of such
> descriptors, we could pass it as an argument for outlined functions - in
> them we need to extract needed addresses from the vector before
> executing the body.  Did I get it right?

No need for the device and handler IMHO, each vector would correspond to
one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
and all those calls would be called with device id.

> Also, a bit unclear point here is how should we generate these
> extractions in target-version of the outlined function - seemingly we
> won't pass this entire vector to it, so it's unclear out of what should
> we extract the data.  What do you think on this?

Let's talk about some concrete example (though, I see the gimplifier
doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
FE, ++todo).

void baz (float *, float *, int);

#pragma omp declare target
int v = 6;
int tgt ()
{
  #pragma omp atomic update
    v++;
  return 0;
}
#pragma omp end declare target

float
bar (int x, int y, int z)
{
  float b[1024], c[1024], s = 0;
  int i, j;
  baz (b, c, x);
  #pragma omp target data map(to: b)
  {
    #pragma omp target map(tofrom: c)
#if 0
      #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s)
        #pragma omp distribute dist_schedule(static, 4) collapse(1)
          for (j=0; j < x; j += y)
#else
	  j = 0;
#endif
            #pragma omp parallel for reduction(+:s)
              for (i = j; i < j + y; i++)
                tgt (), s += b[i] * c[i];
    #pragma omp target update from(b, v)
  }
  return s;
}

float
foo (int x)
{
  float b[1024], c[1024], s = 0;
  int i;
  baz (b, c, x);        
  #pragma omp target map(to: b, c)
    #pragma omp parallel for reduction(+:s)
      for (i = 0; i < x; i++)
        tgt (), s += b[i] * c[i];
  return s;
}

This ICEs during ompexp right now otherwise and obviously even omplower
doesn't DTRT.

So we have something like:

  #pragma omp target data map(to:b)
  #pragma omp target map(tofrom:j)
  j = 0;
  #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) [child fn: _Z3bariii._omp_fn.0 (???)]
  #pragma omp for nowait private(i)
  for (i = j; i < D.2235; i = i + 1)
    {
      tgt ();
      D.2236 = b[i];
      D.2237 = c[i];
      D.2238 = D.2236 * D.2237;
      s = D.2238 + s;
    }
  #pragma omp target update from(v) from(b)

On #pragma omp target it clearly is missing many other map clauses,
like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to
debug later on why they disappeared or weren't added.

In any case, the only thing GOMP_target_data can do is take the vector
of the map clauses { mapkind, hostaddr, length } and look them up
one by one in the mapping of the device and if not present there, allocate
and/or copy and remember.

Now, for GOMP_target we want omplower to replace the var references
like b or c with something like .omp_target_data->b, .omp_target_data->c
etc., where the structure will contain the target addresses of the
variables.  So, GOMP_target would again receive vector of the
{ mapkind, hostaddr, length }, do the lookups, allocations / copying
like for GOMP_target_data, but also prepare a vector of the corresponding
target addresses that it would pass to the target function.

Automatic variables defined in the scope of #pragma omp target body
don't need any special treatment (but I hope gimplifier doesn't do anything
for them), they will be just automatic variables inside of the target
outlined body.  Other automatic variables in the function containing #pragma omp
target could have some optimization for them, if there aren't any #pragma
omp target data directives referencing them around the #pragma omp target
that references them, such variables are guaranteed not to be mapped
in the target device upon GOMP_target call, thus such vars could be e.g.
allocated in a flexible array at the end of the .omp_target_data
structure.  Also for non-addressable variables supposedly we could consider
promoting them into a temporary variable (at the start of GOMP_target
body load them from .omp_target_data->something, at the end store them back
(well, depending on map kind)).  But let's start with non-optimized code,
everything is passed as target address of the allocated spot.

Also, GOMP_target{_data,} could just lookup addresses from the whole vector
and remember what succeeded and what failed (i.e. what has been already
mapped and thus noop and what needs mapping and depending on mapkind
copying) and sum up the amount of memory that needs allocation for the
latter ones, then just allocate in the device everything at once and just
partition it for the individual vars.

> > I meant just a single plugin that would handle all of them, or as richi
> > said, perhaps teach LTO plugin to do that.
> > For options, my vision was something like:
> > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
> > which would mean:
> > 1) compile LTO IL from the accelerator section for mic with
> >    the originally recorded gcc command line options with the Target options
> >    removed and no extra options added
> > 2) compile LTO IL also for hsail target, with originally recorded gcc
> >    command line options but Target options and -mfoobaz=4 -mbazbaz
> >    options added
> > 3) don't compile for ptx
> > The thing is if you originally compile with
> > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
> > the -m* options might not apply to the target compiler at all.
> > So you'd construct the command line from the original command line sans
> > CL_TARGET options, append to that the link time override for the
> > accelerator.  Then another thing is how to find out the corresponding
> > compiler (including its driver) for the target from the plugin.
> Could we set some correspondance between '-ftarget' option value and
> corresponding compiler?  E.g. for '-ftarget=xyz' we would look for
> xyz-cc1.  I haven't looked in details at how the compiler plugins work,
> so maybe I said something unfeasable:)

As specs are target specific, I'm afraid you'll need to be looking for
the gcc driver for the target, not lto1 binary.

> > libgomp would start by trying to dlopen all available plugins,
> > and for each of them call some routine in them that would query the hw
> > for available devices, then libgomp would assign device ids to them (0 and
> > up) and then for target specific parts just dispatch again to the plugin
> > corresponding to the chosen device.
> In libgomp we have a similar problem - there we would need to find out
> plugins names from somewhere.  The difference is that libgomp would
> always iterate through all plugins independently on compiler options,
> but even with this I currently have no idea of how to populate list of
> plugins names (I suppose, this should be done somewhere at
> configure/make step of libgomp building process?).

Configure could record the names, or you could scan a directory with the
plugins and dlopen all shared libraries in there, ...

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-25 16:24         ` Jakub Jelinek
@ 2013-08-27  0:36           ` Michael V. Zolotukhin
  2013-08-27  0:38             ` Jakub Jelinek
  2013-08-28 12:56             ` Richard Biener
  0 siblings, 2 replies; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-27  0:36 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> No need for the device and handler IMHO, each vector would correspond to
> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
> and all those calls would be called with device id.
Probably yes.

> Let's talk about some concrete example (though, I see the gimplifier
> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
> FE, ++todo).
That's a great idea, I wanted to go into an example too, but I'd take a simpler
case for the beginning.

Here is a modified version of your test - I just removed all pragmas, that
don't deal with offloading.  They aren't new in OpenMP4, and GCC supports them
well, so we could concentrate on the others - like 'pragma target' and 'pragma
target data'.

So, here is the original code:

  #pragma omp declare target
  int v = 6;
  int tgt ()
  {
    #pragma omp atomic update
      v++;
    return 0;
  }
  #pragma omp end declare target

  float
  bar (int x, int y, int z)
  {
    float b[1024], c[1024], s = 0;
    int i, j;
    baz (b, c, x);
    #pragma omp target data map(to: b)
    {
      #pragma omp target map(tofrom: c) map(from:s)
      for (i = 0; i < 1024; i++)
	tgt (), s += b[i] * c[i];
      #pragma omp target update from(b, v)
    }
    return s;
  }
Let's write what we want this to be expanded to.  For now let's ignore obvious
problems leading to ICEs that you mentioned - they are certainly need to be
addressed, but I don't think they affect the overall design which we're
discussing here.

As I currently see it, the given code would be expanded to something like
this:

  // Create two versions of V: for host and for target
  int v;
  int v_target __attribute(target);

  // The same for TGT function
  int tgt ()
  {
    .. update v ..
  }
  int tgt_target () __attribute(target)
  {
    .. update v_target ..
  }

  float
  bar (int x, int y, int z)
  {
    float b[1024], c[1024], s = 0;
    int i, j;
    baz (b, c, x);
    // #pragma omp target data map(to: b)
    vec<data_descriptor> data_desc;
    data_desc.push ({&b, 1024*sizeof(float), TO});
    GOMP_target_data (&data_desc);
    {
      // #pragma omp target map(tofrom: c) map(from:s)
      data_desc.push ({&c, 1024*sizeof(float), TOFROM});
      data_desc.push ({&s, sizeof(float), FROM});
      GOMP_target_data (&data_desc); // Add mapping for S and C variables,
				     // mapping for B shouldn't change
      GOMP_target (foo1, "foo1", &data_desc); // Call either FOO1 or offloaded
					      // FOO1_TARGET with arguments
					      // from vector DATA_DESC

      // #pragma omp target update from(b, v)
      vec<data_descriptor> data_desc_update; // target update pragma require a
					     // separate vector
      data_desc_update.push ({&b, 1024*sizeof(float), FROM});
      data_desc_update.push ({&v, sizeof(int), FROM});
      GOMP_target_data (&data_desc_update);
    }
    return s;
  }
  void
  foo1 (vec<data_descriptor> data_desc)
  {
    float b = *data_desc[0].host_address;
    float c = *data_desc[1].host_address;
    float s = 0;
    int i;
    for (i = 0; i < 1024; i++)
      tgt (), s += b[i] * c[i];
    *data_desc[2].host_address = s;
  }
  void
  foo1_target (int n, void **arguments) __attribute(target)
  {
    float b = *arguments[0];
    float c = *arguments[1];
    float s = 0;
    int i;
    for (i = 0; i < 1024; i++)
      tgt_target (), s += b[i] * c[i];
    *arguments[2] = s;
  }

That's how I think the code should look like after omp-expanding.  I.e. all
variables and functions marked with 'target declare' are cloned so that we have
host and target versions available (if we have N different targets, then we need
N+1 versions).  All regions, corresponding to 'pragma omp target' are outlined
and, also, cloned to N+1 versions.  'pragma target data' are replaced with
generating of vector of data-descriptors and call to GOMP_target_data which
performs actual data mapping and marshalling.  We also could call
GOMP_target_data to invoke additional data-mapping/marshalling - e.g. we need
this in 'pragma target' where we add clause MAP(TOFROM:C).  'pragma omp target
update' needs a separate vector, as its mapping could differ from mapping of the
embracing 'pragma target data'.

But all that could be quite meaningless unless internals of GOMP_target{_data}
are discussed.  So, let's proceed to libgomp part and discuss what these
functions should do.  From my POV, thesy would perform the following:
1) GOMP_target_data:
for each element of the input vector check whether it's already had a mapping
and if not, create a new one.  Also, necessary marshalling is triggered from
here (but probably, it's better to move data transferring to a separate
routine).
Result of this function work would be a consistent data structure containing all
mapped memory entries as well as their handlers, representing target-side
addresses.
2) GOMP_target:
First of all, this function would call gomp_choose_device_for_offload that would
check all available targets and choose a one for offloading.  Host could also be
chosen here.
If host is chosen, we just call host-version of the routine (the function
address is passed via the first argument) and pass data_descriptor vector to it.
If target-device is chosen we do the following:
Create vector of of handlers corrseponding to data descriptors from the input
vector.  Pass the routine name as well as the vector of handlers to function 
gomp_run_offloaded_function from the target plugin.  That routine perform the
actual offloading, waits for the end and returns.

Does this overall scheme sounds ok to you?

> Now, for GOMP_target we want omplower to replace the var references
> like b or c with something like .omp_target_data->b, .omp_target_data->c
> etc., where the structure will contain the target addresses of the
> variables.  So, GOMP_target would again receive vector of the
> { mapkind, hostaddr, length }, do the lookups, allocations / copying
> like for GOMP_target_data, but also prepare a vector of the corresponding
> target addresses that it would pass to the target function.
Agreed.  I tried to describe and rephrase that a bit above to make sure we both
mean the same here.

> Automatic variables defined in the scope of #pragma omp target body
> don't need any special treatment (but I hope gimplifier doesn't do anything
> for them), they will be just automatic variables inside of the target
> outlined body.
I hope that too.

> But let's start with non-optimized code,
> everything is passed as target address of the allocated spot.
Agreed.

> As specs are target specific, I'm afraid you'll need to be looking for
> the gcc driver for the target, not lto1 binary.
I think I didn't get it.  Could you explain this point?  What are the specs
here?

> Configure could record the names, or you could scan a directory with the
> plugins and dlopen all shared libraries in there, ...
I'd prefer recording at configure step - to me it looks more robust, in general
I'm ok with both options.

---
Thanks, Michael



On 23 Aug 18:16, Jakub Jelinek wrote:
> On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote:
> > That makes sense.  We could maintain a vector of descriptors for each
> > encountered MAP clause and push to and pop from it when needed (when
> > e.g. new mapping is encountered inside 'pragma omp target data').  The
> > desciptor should contain address in the host memory, size of the mapped
> > block, type of mapping, related device, and handler, which would be
> > returned for this mapping by runtime.  Having vector of such
> > descriptors, we could pass it as an argument for outlined functions - in
> > them we need to extract needed addresses from the vector before
> > executing the body.  Did I get it right?
> 
> No need for the device and handler IMHO, each vector would correspond to
> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
> and all those calls would be called with device id.
> 
> > Also, a bit unclear point here is how should we generate these
> > extractions in target-version of the outlined function - seemingly we
> > won't pass this entire vector to it, so it's unclear out of what should
> > we extract the data.  What do you think on this?
> 
> Let's talk about some concrete example (though, I see the gimplifier
> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
> FE, ++todo).
> 
> void baz (float *, float *, int);
> 
> #pragma omp declare target
> int v = 6;
> int tgt ()
> {
>   #pragma omp atomic update
>     v++;
>   return 0;
> }
> #pragma omp end declare target
> 
> float
> bar (int x, int y, int z)
> {
>   float b[1024], c[1024], s = 0;
>   int i, j;
>   baz (b, c, x);
>   #pragma omp target data map(to: b)
>   {
>     #pragma omp target map(tofrom: c)
> #if 0
>       #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s)
>         #pragma omp distribute dist_schedule(static, 4) collapse(1)
>           for (j=0; j < x; j += y)
> #else
> 	  j = 0;
> #endif
>             #pragma omp parallel for reduction(+:s)
>               for (i = j; i < j + y; i++)
>                 tgt (), s += b[i] * c[i];
>     #pragma omp target update from(b, v)
>   }
>   return s;
> }
> 
> float
> foo (int x)
> {
>   float b[1024], c[1024], s = 0;
>   int i;
>   baz (b, c, x);        
>   #pragma omp target map(to: b, c)
>     #pragma omp parallel for reduction(+:s)
>       for (i = 0; i < x; i++)
>         tgt (), s += b[i] * c[i];
>   return s;
> }
> 
> This ICEs during ompexp right now otherwise and obviously even omplower
> doesn't DTRT.
> 
> So we have something like:
> 
>   #pragma omp target data map(to:b)
>   #pragma omp target map(tofrom:j)
>   j = 0;
>   #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) [child fn: _Z3bariii._omp_fn.0 (???)]
>   #pragma omp for nowait private(i)
>   for (i = j; i < D.2235; i = i + 1)
>     {
>       tgt ();
>       D.2236 = b[i];
>       D.2237 = c[i];
>       D.2238 = D.2236 * D.2237;
>       s = D.2238 + s;
>     }
>   #pragma omp target update from(v) from(b)
> 
> On #pragma omp target it clearly is missing many other map clauses,
> like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to
> debug later on why they disappeared or weren't added.
> 
> In any case, the only thing GOMP_target_data can do is take the vector
> of the map clauses { mapkind, hostaddr, length } and look them up
> one by one in the mapping of the device and if not present there, allocate
> and/or copy and remember.
> 
> Now, for GOMP_target we want omplower to replace the var references
> like b or c with something like .omp_target_data->b, .omp_target_data->c
> etc., where the structure will contain the target addresses of the
> variables.  So, GOMP_target would again receive vector of the
> { mapkind, hostaddr, length }, do the lookups, allocations / copying
> like for GOMP_target_data, but also prepare a vector of the corresponding
> target addresses that it would pass to the target function.
> 
> Automatic variables defined in the scope of #pragma omp target body
> don't need any special treatment (but I hope gimplifier doesn't do anything
> for them), they will be just automatic variables inside of the target
> outlined body.  Other automatic variables in the function containing #pragma omp
> target could have some optimization for them, if there aren't any #pragma
> omp target data directives referencing them around the #pragma omp target
> that references them, such variables are guaranteed not to be mapped
> in the target device upon GOMP_target call, thus such vars could be e.g.
> allocated in a flexible array at the end of the .omp_target_data
> structure.  Also for non-addressable variables supposedly we could consider
> promoting them into a temporary variable (at the start of GOMP_target
> body load them from .omp_target_data->something, at the end store them back
> (well, depending on map kind)).  But let's start with non-optimized code,
> everything is passed as target address of the allocated spot.
> 
> Also, GOMP_target{_data,} could just lookup addresses from the whole vector
> and remember what succeeded and what failed (i.e. what has been already
> mapped and thus noop and what needs mapping and depending on mapkind
> copying) and sum up the amount of memory that needs allocation for the
> latter ones, then just allocate in the device everything at once and just
> partition it for the individual vars.
> 
> > > I meant just a single plugin that would handle all of them, or as richi
> > > said, perhaps teach LTO plugin to do that.
> > > For options, my vision was something like:
> > > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
> > > which would mean:
> > > 1) compile LTO IL from the accelerator section for mic with
> > >    the originally recorded gcc command line options with the Target options
> > >    removed and no extra options added
> > > 2) compile LTO IL also for hsail target, with originally recorded gcc
> > >    command line options but Target options and -mfoobaz=4 -mbazbaz
> > >    options added
> > > 3) don't compile for ptx
> > > The thing is if you originally compile with
> > > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
> > > the -m* options might not apply to the target compiler at all.
> > > So you'd construct the command line from the original command line sans
> > > CL_TARGET options, append to that the link time override for the
> > > accelerator.  Then another thing is how to find out the corresponding
> > > compiler (including its driver) for the target from the plugin.
> > Could we set some correspondance between '-ftarget' option value and
> > corresponding compiler?  E.g. for '-ftarget=xyz' we would look for
> > xyz-cc1.  I haven't looked in details at how the compiler plugins work,
> > so maybe I said something unfeasable:)
> 
> As specs are target specific, I'm afraid you'll need to be looking for
> the gcc driver for the target, not lto1 binary.
> 
> > > libgomp would start by trying to dlopen all available plugins,
> > > and for each of them call some routine in them that would query the hw
> > > for available devices, then libgomp would assign device ids to them (0 and
> > > up) and then for target specific parts just dispatch again to the plugin
> > > corresponding to the chosen device.
> > In libgomp we have a similar problem - there we would need to find out
> > plugins names from somewhere.  The difference is that libgomp would
> > always iterate through all plugins independently on compiler options,
> > but even with this I currently have no idea of how to populate list of
> > plugins names (I suppose, this should be done somewhere at
> > configure/make step of libgomp building process?).
> 
> Configure could record the names, or you could scan a directory with the
> plugins and dlopen all shared libraries in there, ...
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27  0:36           ` Michael V. Zolotukhin
@ 2013-08-27  0:38             ` Jakub Jelinek
  2013-08-27  6:16               ` Michael V. Zolotukhin
  2013-08-28 12:56             ` Richard Biener
  1 sibling, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-27  0:38 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Mon, Aug 26, 2013 at 03:59:11PM +0400, Michael V. Zolotukhin wrote:
> As I currently see it, the given code would be expanded to something like
> this:
> 
>   // Create two versions of V: for host and for target
>   int v;
>   int v_target __attribute(target);
> 
>   // The same for TGT function
>   int tgt ()
>   {
>     .. update v ..
>   }
>   int tgt_target () __attribute(target)
>   {
>     .. update v_target ..
>   }

Actually, not two versions of those during the compilation, you have
just one v and one tgt, both have __attribute__(("omp declare target"))
on them (note, you can't specify that attribute manually).
And just when streaming into .gnu.target_lto_* sections you only stream
everything that has those attributes and types used by it, but nothing else.
> 
>   float
>   bar (int x, int y, int z)
>   {
>     float b[1024], c[1024], s = 0;
>     int i, j;
>     baz (b, c, x);
>     // #pragma omp target data map(to: b)
>     vec<data_descriptor> data_desc;
>     data_desc.push ({&b, 1024*sizeof(float), TO});
>     GOMP_target_data (&data_desc);

Nope.  It would be:
  struct data_descriptor data_desc1[1] = { { &b, 1024*sizeof(float), TO } };
  GOMP_target_data (-1, data_desc1, 1);
or so.  The compiler always knows how many vector elements it needs, there
is no point in making the vector dynamic, and vec<> is a compiler data
structure, while you want to emit runtime code.  The -1 in there stands
for missing device(device-id) clause, otherwise it would be the provided
device-id expression.  For the if clause, the question is if we want to pass
it down to the runtime library too (as bool, defaulting to true if missing),
or do something else.

>     {
>       // #pragma omp target map(tofrom: c) map(from:s)
>       data_desc.push ({&c, 1024*sizeof(float), TOFROM});
>       data_desc.push ({&s, sizeof(float), FROM});
>       GOMP_target_data (&data_desc); // Add mapping for S and C variables,
> 				     // mapping for B shouldn't change

Nope, there is only one target data pragma, so you would use here just:

>       GOMP_target (foo1, "foo1", &data_desc); // Call either FOO1 or offloaded
> 					      // FOO1_TARGET with arguments
> 					      // from vector DATA_DESC

  struct data_descriptor data_desc2[2] = { ... };
  GOMP_target (-1, bar.omp_fn.1, "bar.omp_fn.1", data_desc2, 2);

> 
>       // #pragma omp target update from(b, v)
>       vec<data_descriptor> data_desc_update; // target update pragma require a
> 					     // separate vector
>       data_desc_update.push ({&b, 1024*sizeof(float), FROM});
>       data_desc_update.push ({&v, sizeof(int), FROM});
>       GOMP_target_data (&data_desc_update);

Similarly here.

>     }
>     return s;
>   }
>   void
>   foo1 (vec<data_descriptor> data_desc)
>   {
>     float b = *data_desc[0].host_address;
>     float c = *data_desc[1].host_address;
>     float s = 0;
>     int i;
>     for (i = 0; i < 1024; i++)
>       tgt (), s += b[i] * c[i];
>     *data_desc[2].host_address = s;

No, I didn't mean you'd do this.  omp-lower.c would simply create
a type here that would have the same layout as what would the runtime
library pass to it.
So it would be:

void
bar.omp_fn.1 (struct omp_target_data *.omp_data_in)
{
  int i;
  *.omp_data_in->s = 0;
  for (i = 0; i < 1024; i++)
    tgt (), *.omp_data_in->s += .omp_data_in->b[i] * .omp_data_in->c[i];
}

Just look what omplower pass does for normal OpenMP code, say
#pragma omp parallel, task etc.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27  0:38             ` Jakub Jelinek
@ 2013-08-27  6:16               ` Michael V. Zolotukhin
  2013-08-27  8:06                 ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-27  6:16 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> Actually, not two versions of those during the compilation, you have
> just one v and one tgt, both have __attribute__(("omp declare target"))
> on them (note, you can't specify that attribute manually).
> And just when streaming into .gnu.target_lto_* sections you only stream
> everything that has those attributes and types used by it, but nothing else.
Agreed.  The point was that in the bytecode we would have two versions.

> Nope.  It would be:
>   struct data_descriptor data_desc1[1] = { { &b, 1024*sizeof(float), TO } };
>   GOMP_target_data (-1, data_desc1, 1);
> or so.  The compiler always knows how many vector elements it needs, there
> is no point in making the vector dynamic ...
Yes, that's a good point.  We really don't need a dynamic type here.

> ... and vec<> is a compiler data
> structure, while you want to emit runtime code...
Yep, I know that - I just probably chose poor notation here.  I meant
that data_desc would be some vector storing structures
'data_descriptor'.  And now I see that there is even no need in vector -
array would be sufficient.

> ...  For the if clause, the question is if we want to pass
> it down to the runtime library too (as bool, defaulting to true if missing),
> or do something else.
I think we should do that in the same way as it's done in 'pragma
parallel'.

> Nope, there is only one target data pragma, so you would use here just:
> 
>   struct data_descriptor data_desc2[2] = { ... };
>   GOMP_target (-1, bar.omp_fn.1, "bar.omp_fn.1", data_desc2, 2);
This 'pragma target' is placed inside a 'pragma target data' - so all
variables for 'pragma target data' should be available for the 'pragma
target'.  So we need to pass to GOMP_target an array, that contains
united set of mapped variables from both pragmas - in our example these
would be variables B, C, and S.  So as I see it, we need to use the same
array of descriptors both in outer 'pragma target data' and in inner
'pragma target'.  Is it correct?  If data_desc2 contains descriptors of
only C and S, how B would be passed to bar.omp_fn.1?

> No, I didn't mean you'd do this.  omp-lower.c would simply create
> a type here that would have the same layout as what would the runtime
> library pass to it.
> So it would be:
> 
> void
> bar.omp_fn.1 (struct omp_target_data *.omp_data_in)
> {
>   int i;
>   *.omp_data_in->s = 0;
>   for (i = 0; i < 1024; i++)
>     tgt (), *.omp_data_in->s += .omp_data_in->b[i] * .omp_data_in->c[i];
> }
> 
> Just look what omplower pass does for normal OpenMP code, say
> #pragma omp parallel, task etc.
Actually, I meant the same (but probably used a poor notation for this
as well) - I like the idea of having similar approaches in 'pragma
target' and 'pragma parallel/etc.'.

On 26 Aug 14:51, Jakub Jelinek wrote:
> On Mon, Aug 26, 2013 at 03:59:11PM +0400, Michael V. Zolotukhin wrote:
> > As I currently see it, the given code would be expanded to something like
> > this:
> > 
> >   // Create two versions of V: for host and for target
> >   int v;
> >   int v_target __attribute(target);
> > 
> >   // The same for TGT function
> >   int tgt ()
> >   {
> >     .. update v ..
> >   }
> >   int tgt_target () __attribute(target)
> >   {
> >     .. update v_target ..
> >   }
> 
> Actually, not two versions of those during the compilation, you have
> just one v and one tgt, both have __attribute__(("omp declare target"))
> on them (note, you can't specify that attribute manually).
> And just when streaming into .gnu.target_lto_* sections you only stream
> everything that has those attributes and types used by it, but nothing else.
> > 
> >   float
> >   bar (int x, int y, int z)
> >   {
> >     float b[1024], c[1024], s = 0;
> >     int i, j;
> >     baz (b, c, x);
> >     // #pragma omp target data map(to: b)
> >     vec<data_descriptor> data_desc;
> >     data_desc.push ({&b, 1024*sizeof(float), TO});
> >     GOMP_target_data (&data_desc);
> 
> Nope.  It would be:
>   struct data_descriptor data_desc1[1] = { { &b, 1024*sizeof(float), TO } };
>   GOMP_target_data (-1, data_desc1, 1);
> or so.  The compiler always knows how many vector elements it needs, there
> is no point in making the vector dynamic, and vec<> is a compiler data
> structure, while you want to emit runtime code.  The -1 in there stands
> for missing device(device-id) clause, otherwise it would be the provided
> device-id expression.  For the if clause, the question is if we want to pass
> it down to the runtime library too (as bool, defaulting to true if missing),
> or do something else.
> 
> >     {
> >       // #pragma omp target map(tofrom: c) map(from:s)
> >       data_desc.push ({&c, 1024*sizeof(float), TOFROM});
> >       data_desc.push ({&s, sizeof(float), FROM});
> >       GOMP_target_data (&data_desc); // Add mapping for S and C variables,
> > 				     // mapping for B shouldn't change
> 
> Nope, there is only one target data pragma, so you would use here just:
> 
> >       GOMP_target (foo1, "foo1", &data_desc); // Call either FOO1 or offloaded
> > 					      // FOO1_TARGET with arguments
> > 					      // from vector DATA_DESC
> 
>   struct data_descriptor data_desc2[2] = { ... };
>   GOMP_target (-1, bar.omp_fn.1, "bar.omp_fn.1", data_desc2, 2);
> 
> > 
> >       // #pragma omp target update from(b, v)
> >       vec<data_descriptor> data_desc_update; // target update pragma require a
> > 					     // separate vector
> >       data_desc_update.push ({&b, 1024*sizeof(float), FROM});
> >       data_desc_update.push ({&v, sizeof(int), FROM});
> >       GOMP_target_data (&data_desc_update);
> 
> Similarly here.
> 
> >     }
> >     return s;
> >   }
> >   void
> >   foo1 (vec<data_descriptor> data_desc)
> >   {
> >     float b = *data_desc[0].host_address;
> >     float c = *data_desc[1].host_address;
> >     float s = 0;
> >     int i;
> >     for (i = 0; i < 1024; i++)
> >       tgt (), s += b[i] * c[i];
> >     *data_desc[2].host_address = s;
> 
> No, I didn't mean you'd do this.  omp-lower.c would simply create
> a type here that would have the same layout as what would the runtime
> library pass to it.
> So it would be:
> 
> void
> bar.omp_fn.1 (struct omp_target_data *.omp_data_in)
> {
>   int i;
>   *.omp_data_in->s = 0;
>   for (i = 0; i < 1024; i++)
>     tgt (), *.omp_data_in->s += .omp_data_in->b[i] * .omp_data_in->c[i];
> }
> 
> Just look what omplower pass does for normal OpenMP code, say
> #pragma omp parallel, task etc.
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27  6:16               ` Michael V. Zolotukhin
@ 2013-08-27  8:06                 ` Jakub Jelinek
  2013-08-27 15:47                   ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-27  8:06 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Mon, Aug 26, 2013 at 05:29:36PM +0400, Michael V. Zolotukhin wrote:
> > Nope, there is only one target data pragma, so you would use here just:
> > 
> >   struct data_descriptor data_desc2[2] = { ... };
> >   GOMP_target (-1, bar.omp_fn.1, "bar.omp_fn.1", data_desc2, 2);
> This 'pragma target' is placed inside a 'pragma target data' - so all
> variables for 'pragma target data' should be available for the 'pragma
> target'.  So we need to pass to GOMP_target an array, that contains
> united set of mapped variables from both pragmas - in our example these
> would be variables B, C, and S.  So as I see it, we need to use the same
> array of descriptors both in outer 'pragma target data' and in inner
> 'pragma target'.  Is it correct?  If data_desc2 contains descriptors of
> only C and S, how B would be passed to bar.omp_fn.1?

Actually no, that should be the responsibility of the runtime library.
Note, the #pragma omp target data directive doesn't have to be in the same
function as #pragma omp target.  And, I'm sorry for having to confuse this
by hacking in a premature optimization in the gimplifier.  It is true
that if the target data is around target directive that the target will
always be able to only look stuff up, will not need to allocate it,
but 1) the gimplifier doesn't verify it is the same device between those two
2) and as discussed earlier we need it also for the mapping in GOMP_target
So, the way it should actually work IMHO is that both GOMP_target_data
is passed a descriptor for b, and also GOMP_target.
In a more complicated testcase where you have a pointer based array section:
void
foo (int *p)
{
  #pragma omp target data map (tofrom: p[:1024])
  {
    #pragma omp target
    for (int i = 0; i < 1024; i++)
      p[i] += 2;
  }
}
GOMP_target_data does two mappings - one where it maps
(char *) p+0 ... (char *) p+1024*sizeof(int)-1
region (tofrom) and one where it maps
&p ... (char *)(&p+1)+1
region with pointerassign type (i.e. that it is initialized to
the address of the target pointer section).
And then GOMP_target during gimplification determines that p
is used, but not explicitly mapped, so it is added automatically
to #pragma omp target as implicit map(tofrom:p).  That doesn't
do anything with the corresponding array section, and while it is tofrom,
it will actually be always ignored, since the region is already mapped.
Perhaps as optimization the compiler could hint the runtime library that
it can just look it up and doesn't need to allocate/copy anything.

Anyway, the GOMP_target_data implementation and part of GOMP_target would
be something along the lines of following pseudocode:

device_data = lookup_device_id (device_id);
if (device_data == NULL)
  do host fallback;
else
  {
    size_t i, length_sum = 0;
    target_data_env = create_target_data_env ();
    void *target_addrs[num_device_descs]; // VLA or alloca etc.
    char *target_addr = NULL;
    memset (target_addrs, 0, sizeof (target_addrs));
    vec_safe_push (device_data->target_data_envs, target_data_env);
    for (i = 0; i < num_device_descs; i++)
      {
	target_addrs[i] = lookup_in_target_address_tree (device_data->addr_tree, device_desc[i].host_addr, device_desc[i].length);
        if (target_addrs[i] == NULL)
	  length_sum += device_desc[i].length;
      }
    if (length_sum)
      {
        target_addr = target_malloc (device_data, length_sum);
	length_sum = 0;
	for (i = 0; i < num_device_descs; i++)
	  if (target_addrs[i] == NULL)
	    {
	      target_addrs[i] = target_addr + length_sum;
	      length_sum += device_desc[i].length;
	      switch (device_desc[i].kind)
		{
		case ALLOC: case FROM: /* nothing */ break;
		case TO:
		case TOFROM: target_copy_todevice (device_data, device_desc[i].host_addr, target_addrs[i], device_desc[i].length); break;
		case POINTER: lookup + copy to; break;
		}
	      ptr = add_to_target_address_tree (device_data->addr_tree, device_desc[i].host_addr, device_desc[i].length, target_addrs[i], device_desc[i].kind);
	      vec_safe_push (target_data_env->vec, ptr);
	    }
      }
    if (GOMP_target call)
      {
      target_call (fn_name, target_addrs);
      FOR_EACH_SAFE_VEC (target_data_env->vec, ptr)
	{
	  switch (ptr->kind)
	    {
	    case FROM: case TOFROM: target_copy_fromdevice (device_data, ...); break;
	    }
	  remove_from_target_address_tree (device_data->addr_tree, ptr);
	}
      vec_pop_and_free(device_data->target_data_envs);
      }
  }

and for GOMP_target_data_end it would do pretty much the stuff in between
FOR_EACH_SAFE_VEC and vec_pop_and_free.
All names above subject to change for something better, I just wanted to
make the picture clear.  There needs to be some address -> target address
data structure (addr_tree), probably some tree (AVL, whatever), in any case
a lookup doesn't need to be exact, you can e.g. look up part of an existing
mapping.  Trying to map something that overlaps an existing mapping, but is
larger than that, is a user bug.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27  8:06                 ` Jakub Jelinek
@ 2013-08-27 15:47                   ` Michael V. Zolotukhin
  2013-08-27 16:22                     ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-27 15:47 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

Hi Jakub,

> Anyway, the GOMP_target_data implementation and part of GOMP_target would
> be something along the lines of following pseudocode:
> 
> device_data = lookup_device_id (device_id);
> ...
Thanks, I've seen that similarly.  But the problem with passing
arguments to the target is still open.  I'll try to explain, what is the
problem.

Remember what we did for 'pragma parallel':
  struct .omp_data_s.0 .omp_data_o.2;
  .omp_data_o.2.s = 0.0;
  .omp_data_o.2.b = &b;
  .omp_data_o.2.c = &c;
  .omp_data_o.2.y = y_7(D);
  .omp_data_o.2.j = j_9(D);
  __builtin_GOMP_parallel (bar._omp_fn.0, &.omp_data_o.2, 0, 0);
  s_12 = .omp_data_o.2.s;
  y_13 = .omp_data_o.2.y;
  j_14 = .omp_data_o.2.j;

I.e. compiler prepares a structure with all arguments and pass it to the
runtime.  Runtime passes this structure as-is to callee (i.e. to
bar._omp_fn.0).

In bar._omp_fn.0 the compiler just emits code that extracts
corresponding fields from the given struct and thus initialize all
needed local vars:
  bar._omp_fn.0 (struct .omp_data_s.0 * .omp_data_i)
  {
    int _12;
    int _13;
    ...
    _12 = .omp_data_i_11(D)->y;
    _13 = .omp_data_i_11(D)->j;
    ...
  }

That scheme would work perfectly for implementing host fallback, but as
I see it, can't be applied as is for target offloading.  The reason is
the following:
*) Compiler doesn't know runtime info, i.e. it doesn't know target
addresses so it can't fill the structure for passing to target version
of the routine.
*) Runtime doesn't know the structure layout - runtime should firstly
translate addresses and only then pass it to the callee, but it don't
know which addresses to translate, because it doesn't know which
variables are used by the callee.

Currently, I see two possible solutions for this:
1) add to the structure with arguments fields, describing size of each
field.  Then GOMP_target parses this struct and replace every found
address with the corresponding target address, and only then call
target_call.
2) Lift mapping/allocation stuff from runtime to compile time, i.e.
allow the compiler to generate calls like this:
  .omp_data_o.2.s = 0.0;
  .omp_data_o.2.b = &b;
  .omp_data_o.2.c = &c;
  .omp_data_o.2.y = y_7(D);
  .omp_data_o.2.j = j_9(D);
  .omp_data_o.target.2.s = GOMP_translate_target_address (0.0);
  .omp_data_o.target.2.b = GOMP_translate_target_address (&b);
  .omp_data_o.target.2.c = GOMP_translate_target_address (&c);
  .omp_data_o.target.2.y = GOMP_translate_target_address (y_7(D));
  .omp_data_o.target.2.j = GOMP_translate_target_address (j_9(D));
  GOMP_target (bar._omp_fn.0, &.omp_data_o.2, &.omp_data_o.target.2, 0, 0, );
Thus runtime would have two versions of structure with arguments and
will be able to pass it as-is to target callee.  But probably we'll need
a version of that struct for each target and that would look very ugly.

What do you think on that?  Maybe I'm missing or overcomplicating
something, but for now I can't get how all this stuff could work
together without answers to these questions.

Referencing to your code, the question could be rephrased as following:
Having thist string
>       target_call (fn_name, target_addrs);
how would FN_NAME, called on the target, figure out where to find its
arguments (as TARGET_ADDRS contains all mapped at runtime addresses)?

---
Thanks, Michael

On 26 Aug 16:11, Jakub Jelinek wrote:
> On Mon, Aug 26, 2013 at 05:29:36PM +0400, Michael V. Zolotukhin wrote:
> > > Nope, there is only one target data pragma, so you would use here just:
> > > 
> > >   struct data_descriptor data_desc2[2] = { ... };
> > >   GOMP_target (-1, bar.omp_fn.1, "bar.omp_fn.1", data_desc2, 2);
> > This 'pragma target' is placed inside a 'pragma target data' - so all
> > variables for 'pragma target data' should be available for the 'pragma
> > target'.  So we need to pass to GOMP_target an array, that contains
> > united set of mapped variables from both pragmas - in our example these
> > would be variables B, C, and S.  So as I see it, we need to use the same
> > array of descriptors both in outer 'pragma target data' and in inner
> > 'pragma target'.  Is it correct?  If data_desc2 contains descriptors of
> > only C and S, how B would be passed to bar.omp_fn.1?
> 
> Actually no, that should be the responsibility of the runtime library.
> Note, the #pragma omp target data directive doesn't have to be in the same
> function as #pragma omp target.  And, I'm sorry for having to confuse this
> by hacking in a premature optimization in the gimplifier.  It is true
> that if the target data is around target directive that the target will
> always be able to only look stuff up, will not need to allocate it,
> but 1) the gimplifier doesn't verify it is the same device between those two
> 2) and as discussed earlier we need it also for the mapping in GOMP_target
> So, the way it should actually work IMHO is that both GOMP_target_data
> is passed a descriptor for b, and also GOMP_target.
> In a more complicated testcase where you have a pointer based array section:
> void
> foo (int *p)
> {
>   #pragma omp target data map (tofrom: p[:1024])
>   {
>     #pragma omp target
>     for (int i = 0; i < 1024; i++)
>       p[i] += 2;
>   }
> }
> GOMP_target_data does two mappings - one where it maps
> (char *) p+0 ... (char *) p+1024*sizeof(int)-1
> region (tofrom) and one where it maps
> &p ... (char *)(&p+1)+1
> region with pointerassign type (i.e. that it is initialized to
> the address of the target pointer section).
> And then GOMP_target during gimplification determines that p
> is used, but not explicitly mapped, so it is added automatically
> to #pragma omp target as implicit map(tofrom:p).  That doesn't
> do anything with the corresponding array section, and while it is tofrom,
> it will actually be always ignored, since the region is already mapped.
> Perhaps as optimization the compiler could hint the runtime library that
> it can just look it up and doesn't need to allocate/copy anything.
> 
> Anyway, the GOMP_target_data implementation and part of GOMP_target would
> be something along the lines of following pseudocode:
> 
> device_data = lookup_device_id (device_id);
> if (device_data == NULL)
>   do host fallback;
> else
>   {
>     size_t i, length_sum = 0;
>     target_data_env = create_target_data_env ();
>     void *target_addrs[num_device_descs]; // VLA or alloca etc.
>     char *target_addr = NULL;
>     memset (target_addrs, 0, sizeof (target_addrs));
>     vec_safe_push (device_data->target_data_envs, target_data_env);
>     for (i = 0; i < num_device_descs; i++)
>       {
> 	target_addrs[i] = lookup_in_target_address_tree (device_data->addr_tree, device_desc[i].host_addr, device_desc[i].length);
>         if (target_addrs[i] == NULL)
> 	  length_sum += device_desc[i].length;
>       }
>     if (length_sum)
>       {
>         target_addr = target_malloc (device_data, length_sum);
> 	length_sum = 0;
> 	for (i = 0; i < num_device_descs; i++)
> 	  if (target_addrs[i] == NULL)
> 	    {
> 	      target_addrs[i] = target_addr + length_sum;
> 	      length_sum += device_desc[i].length;
> 	      switch (device_desc[i].kind)
> 		{
> 		case ALLOC: case FROM: /* nothing */ break;
> 		case TO:
> 		case TOFROM: target_copy_todevice (device_data, device_desc[i].host_addr, target_addrs[i], device_desc[i].length); break;
> 		case POINTER: lookup + copy to; break;
> 		}
> 	      ptr = add_to_target_address_tree (device_data->addr_tree, device_desc[i].host_addr, device_desc[i].length, target_addrs[i], device_desc[i].kind);
> 	      vec_safe_push (target_data_env->vec, ptr);
> 	    }
>       }
>     if (GOMP_target call)
>       {
>       target_call (fn_name, target_addrs);
>       FOR_EACH_SAFE_VEC (target_data_env->vec, ptr)
> 	{
> 	  switch (ptr->kind)
> 	    {
> 	    case FROM: case TOFROM: target_copy_fromdevice (device_data, ...); break;
> 	    }
> 	  remove_from_target_address_tree (device_data->addr_tree, ptr);
> 	}
>       vec_pop_and_free(device_data->target_data_envs);
>       }
>   }
> 
> and for GOMP_target_data_end it would do pretty much the stuff in between
> FOR_EACH_SAFE_VEC and vec_pop_and_free.
> All names above subject to change for something better, I just wanted to
> make the picture clear.  There needs to be some address -> target address
> data structure (addr_tree), probably some tree (AVL, whatever), in any case
> a lookup doesn't need to be exact, you can e.g. look up part of an existing
> mapping.  Trying to map something that overlaps an existing mapping, but is
> larger than that, is a user bug.
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27 15:47                   ` Michael V. Zolotukhin
@ 2013-08-27 16:22                     ` Jakub Jelinek
  2013-08-27 19:54                       ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-27 16:22 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Tue, Aug 27, 2013 at 03:26:09PM +0400, Michael V. Zolotukhin wrote:
> > Anyway, the GOMP_target_data implementation and part of GOMP_target would
> > be something along the lines of following pseudocode:
> > 
> > device_data = lookup_device_id (device_id);
> > ...
> Thanks, I've seen that similarly.  But the problem with passing
> arguments to the target is still open.  I'll try to explain, what is the
> problem.
> 
> Remember what we did for 'pragma parallel':
>   struct .omp_data_s.0 .omp_data_o.2;
>   .omp_data_o.2.s = 0.0;
>   .omp_data_o.2.b = &b;
>   .omp_data_o.2.c = &c;
>   .omp_data_o.2.y = y_7(D);
>   .omp_data_o.2.j = j_9(D);
>   __builtin_GOMP_parallel (bar._omp_fn.0, &.omp_data_o.2, 0, 0);
>   s_12 = .omp_data_o.2.s;
>   y_13 = .omp_data_o.2.y;
>   j_14 = .omp_data_o.2.j;
> 
> I.e. compiler prepares a structure with all arguments and pass it to the
> runtime.  Runtime passes this structure as-is to callee (i.e. to
> bar._omp_fn.0).
> 
> In bar._omp_fn.0 the compiler just emits code that extracts
> corresponding fields from the given struct and thus initialize all
> needed local vars:
>   bar._omp_fn.0 (struct .omp_data_s.0 * .omp_data_i)
>   {
>     int _12;
>     int _13;
>     ...
>     _12 = .omp_data_i_11(D)->y;
>     _13 = .omp_data_i_11(D)->j;
>     ...
>   }
> 
> That scheme would work perfectly for implementing host fallback, but as
> I see it, can't be applied as is for target offloading.  The reason is
> the following:
> *) Compiler doesn't know runtime info, i.e. it doesn't know target
> addresses so it can't fill the structure for passing to target version
> of the routine.
> *) Runtime doesn't know the structure layout - runtime should firstly
> translate addresses and only then pass it to the callee, but it don't
> know which addresses to translate, because it doesn't know which
> variables are used by the callee.
> 
> Currently, I see two possible solutions for this:
> 1) add to the structure with arguments fields, describing size of each
> field.  Then GOMP_target parses this struct and replace every found
> address with the corresponding target address, and only then call
> target_call.
> 2) Lift mapping/allocation stuff from runtime to compile time, i.e.
> allow the compiler to generate calls like this:
>   .omp_data_o.2.s = 0.0;
>   .omp_data_o.2.b = &b;
>   .omp_data_o.2.c = &c;
>   .omp_data_o.2.y = y_7(D);
>   .omp_data_o.2.j = j_9(D);
>   .omp_data_o.target.2.s = GOMP_translate_target_address (0.0);
>   .omp_data_o.target.2.b = GOMP_translate_target_address (&b);
>   .omp_data_o.target.2.c = GOMP_translate_target_address (&c);
>   .omp_data_o.target.2.y = GOMP_translate_target_address (y_7(D));
>   .omp_data_o.target.2.j = GOMP_translate_target_address (j_9(D));
>   GOMP_target (bar._omp_fn.0, &.omp_data_o.2, &.omp_data_o.target.2, 0, 0, );
> Thus runtime would have two versions of structure with arguments and
> will be able to pass it as-is to target callee.  But probably we'll need
> a version of that struct for each target and that would look very ugly.
> 
> What do you think on that?  Maybe I'm missing or overcomplicating
> something, but for now I can't get how all this stuff could work
> together without answers to these questions.

What I meant was just that if you call GOMP_target with
num_descs N, then the structure will look like:
struct .omp_target_data
{
  sometype0 *var0;
  sometype1 *var1;
  ...
  sometypeNminus1 *varNminus1;
};
so pretty much the runtime will call the target routine with address of
an array of N pointers, and the compiler generated target routine will
just use a struct to access it to make it more debuggable.  As there won't
be any paddings in the structure, I'd hope the structure layout will be
exactly the same as the array.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27 16:22                     ` Jakub Jelinek
@ 2013-08-27 19:54                       ` Michael V. Zolotukhin
  2013-08-28 11:21                         ` Jakub Jelinek
  2013-09-13  9:35                         ` Michael Zolotukhin
  0 siblings, 2 replies; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-27 19:54 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> What I meant was just that if you call GOMP_target with
> num_descs N, then the structure will look like:
> struct .omp_target_data
> {
>   sometype0 *var0;
>   sometype1 *var1;
>   ...
>   sometypeNminus1 *varNminus1;
> };
> so pretty much the runtime will call the target routine with address of
> an array of N pointers, and the compiler generated target routine will
> just use a struct to access it to make it more debuggable.  As there won't
> be any paddings in the structure, I'd hope the structure layout will be
> exactly the same as the array.
Ok, such assumption about struct layout seems to be enough for runtime
to figure out what to translate.
I'll try to put to rights everything we've discussed and prepare some
description of the overall scheme.  After that, if you are ok with that,
we could start implementing it - that's the time we'll face the most
interesting stuff I guess:)

Thanks, Michael

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27 19:54                       ` Michael V. Zolotukhin
@ 2013-08-28 11:21                         ` Jakub Jelinek
  2013-08-29 10:44                           ` Michael V. Zolotukhin
  2013-09-10 15:02                           ` Michael V. Zolotukhin
  2013-09-13  9:35                         ` Michael Zolotukhin
  1 sibling, 2 replies; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-28 11:21 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Tue, Aug 27, 2013 at 03:55:38PM +0400, Michael V. Zolotukhin wrote:
> > What I meant was just that if you call GOMP_target with
> > num_descs N, then the structure will look like:
> > struct .omp_target_data
> > {
> >   sometype0 *var0;
> >   sometype1 *var1;
> >   ...
> >   sometypeNminus1 *varNminus1;
> > };
> > so pretty much the runtime will call the target routine with address of
> > an array of N pointers, and the compiler generated target routine will
> > just use a struct to access it to make it more debuggable.  As there won't
> > be any paddings in the structure, I'd hope the structure layout will be
> > exactly the same as the array.
> Ok, such assumption about struct layout seems to be enough for runtime
> to figure out what to translate.
> I'll try to put to rights everything we've discussed and prepare some
> description of the overall scheme.  After that, if you are ok with that,
> we could start implementing it - that's the time we'll face the most
> interesting stuff I guess:)

Perhaps instead of passing array of { void *hostaddr; size_t length; char kind; }
and length we could pass 3 arrays and length (the same for all of them).
I can see 2 advantages of doing that:
1) the sizes are often constant and the kinds are always constant, so
we could often allocate those last 2 or just last array in .rodata, wouldn't
need to initialize it dynamically
2) for the host fallback, we could just pass the first array unmodified as
the .omp_target_data structure, no need to copy the host addresses

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27  0:36           ` Michael V. Zolotukhin
  2013-08-27  0:38             ` Jakub Jelinek
@ 2013-08-28 12:56             ` Richard Biener
  2013-08-28 15:26               ` Jakub Jelinek
  1 sibling, 1 reply; 56+ messages in thread
From: Richard Biener @ 2013-08-28 12:56 UTC (permalink / raw)
  To: Michael V. Zolotukhin
  Cc: Jakub Jelinek, Kirill Yukhin, Richard Henderson, GCC Development,
	Torvald Riegel

On Mon, Aug 26, 2013 at 1:59 PM, Michael V. Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
>> No need for the device and handler IMHO, each vector would correspond to
>> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
>> and all those calls would be called with device id.
> Probably yes.
>
>> Let's talk about some concrete example (though, I see the gimplifier
>> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
>> FE, ++todo).
> That's a great idea, I wanted to go into an example too, but I'd take a simpler
> case for the beginning.

(jumping into the middle of the discussion where there is a small example)

> Here is a modified version of your test - I just removed all pragmas, that
> don't deal with offloading.  They aren't new in OpenMP4, and GCC supports them
> well, so we could concentrate on the others - like 'pragma target' and 'pragma
> target data'.
>
> So, here is the original code:
>
>   #pragma omp declare target
>   int v = 6;
>   int tgt ()
>   {
>     #pragma omp atomic update
>       v++;
>     return 0;
>   }
>   #pragma omp end declare target
>
>   float
>   bar (int x, int y, int z)
>   {
>     float b[1024], c[1024], s = 0;
>     int i, j;
>     baz (b, c, x);
>     #pragma omp target data map(to: b)
>     {
>       #pragma omp target map(tofrom: c) map(from:s)
>       for (i = 0; i < 1024; i++)
>         tgt (), s += b[i] * c[i];
>       #pragma omp target update from(b, v)
>     }
>     return s;
>   }

You could even simplify this more by removing tgt and its use?  That is,
s += b[i] * c[i] would still be executed on the accelerator?  What's
this omp target map stuff?  Just guessing from the names shouldn't it be

>     {
>       #pragma omp target map(from: c) map(from: b) map(tofrom:s)
>       for (i = 0; i < 1024; i++)
>         s += b[i] * c[i];
>       #pragma omp target update from(b, v)
>     }

that is, b and c are read, s is read and written.  Not sure what the
last pragma should even do ... (sync and wait so following code could
read from b and v?)

Coming from the HSA side I'd like to see that we can easily auto-accelerate
(as we auto-vectorize) regular OMP code like

   #pragma omp parallel for
    for (i = 0; i < 1024; i++)
       s += b[i] * c[i];

but as we lower this stuff very early we'd have to lower it as omp target?
Or can we make the libgomp interfacing for the workers the same so
we can easily modify them later?

With HSA we don't have to bother about accelerator memory handling
because of the unified address space and the (appearant) cache coherency.
So for HSA it would have been enough to have the omp parallel for
getting a descriptor instead of a function pointer where possible accelerator
implementations are queued up for use by the OMP scheduler.

That is, AFAIK all the omp target stuff is decoupled from scheduling "regular"
CPU OMP tasks?  And omp target implies a parallel region following, so
it's not just additional hints?

From the accelerator BOF video I gather we agreed on using the GOMP
representation as unified middle-end.  What I didn't get is whether we
agreed on libgomp being the unified single runtime (that eventually
dispatches to accelerator specific runtimes, opened via dlopen)?

Thanks,
Richard.

> Let's write what we want this to be expanded to.  For now let's ignore obvious
> problems leading to ICEs that you mentioned - they are certainly need to be
> addressed, but I don't think they affect the overall design which we're
> discussing here.
>
> As I currently see it, the given code would be expanded to something like
> this:
>
>   // Create two versions of V: for host and for target
>   int v;
>   int v_target __attribute(target);
>
>   // The same for TGT function
>   int tgt ()
>   {
>     .. update v ..
>   }
>   int tgt_target () __attribute(target)
>   {
>     .. update v_target ..
>   }
>
>   float
>   bar (int x, int y, int z)
>   {
>     float b[1024], c[1024], s = 0;
>     int i, j;
>     baz (b, c, x);
>     // #pragma omp target data map(to: b)
>     vec<data_descriptor> data_desc;
>     data_desc.push ({&b, 1024*sizeof(float), TO});
>     GOMP_target_data (&data_desc);
>     {
>       // #pragma omp target map(tofrom: c) map(from:s)
>       data_desc.push ({&c, 1024*sizeof(float), TOFROM});
>       data_desc.push ({&s, sizeof(float), FROM});
>       GOMP_target_data (&data_desc); // Add mapping for S and C variables,
>                                      // mapping for B shouldn't change
>       GOMP_target (foo1, "foo1", &data_desc); // Call either FOO1 or offloaded
>                                               // FOO1_TARGET with arguments
>                                               // from vector DATA_DESC
>
>       // #pragma omp target update from(b, v)
>       vec<data_descriptor> data_desc_update; // target update pragma require a
>                                              // separate vector
>       data_desc_update.push ({&b, 1024*sizeof(float), FROM});
>       data_desc_update.push ({&v, sizeof(int), FROM});
>       GOMP_target_data (&data_desc_update);
>     }
>     return s;
>   }
>   void
>   foo1 (vec<data_descriptor> data_desc)
>   {
>     float b = *data_desc[0].host_address;
>     float c = *data_desc[1].host_address;
>     float s = 0;
>     int i;
>     for (i = 0; i < 1024; i++)
>       tgt (), s += b[i] * c[i];
>     *data_desc[2].host_address = s;
>   }
>   void
>   foo1_target (int n, void **arguments) __attribute(target)
>   {
>     float b = *arguments[0];
>     float c = *arguments[1];
>     float s = 0;
>     int i;
>     for (i = 0; i < 1024; i++)
>       tgt_target (), s += b[i] * c[i];
>     *arguments[2] = s;
>   }
>
> That's how I think the code should look like after omp-expanding.  I.e. all
> variables and functions marked with 'target declare' are cloned so that we have
> host and target versions available (if we have N different targets, then we need
> N+1 versions).  All regions, corresponding to 'pragma omp target' are outlined
> and, also, cloned to N+1 versions.  'pragma target data' are replaced with
> generating of vector of data-descriptors and call to GOMP_target_data which
> performs actual data mapping and marshalling.  We also could call
> GOMP_target_data to invoke additional data-mapping/marshalling - e.g. we need
> this in 'pragma target' where we add clause MAP(TOFROM:C).  'pragma omp target
> update' needs a separate vector, as its mapping could differ from mapping of the
> embracing 'pragma target data'.
>
> But all that could be quite meaningless unless internals of GOMP_target{_data}
> are discussed.  So, let's proceed to libgomp part and discuss what these
> functions should do.  From my POV, thesy would perform the following:
> 1) GOMP_target_data:
> for each element of the input vector check whether it's already had a mapping
> and if not, create a new one.  Also, necessary marshalling is triggered from
> here (but probably, it's better to move data transferring to a separate
> routine).
> Result of this function work would be a consistent data structure containing all
> mapped memory entries as well as their handlers, representing target-side
> addresses.
> 2) GOMP_target:
> First of all, this function would call gomp_choose_device_for_offload that would
> check all available targets and choose a one for offloading.  Host could also be
> chosen here.
> If host is chosen, we just call host-version of the routine (the function
> address is passed via the first argument) and pass data_descriptor vector to it.
> If target-device is chosen we do the following:
> Create vector of of handlers corrseponding to data descriptors from the input
> vector.  Pass the routine name as well as the vector of handlers to function
> gomp_run_offloaded_function from the target plugin.  That routine perform the
> actual offloading, waits for the end and returns.
>
> Does this overall scheme sounds ok to you?
>
>> Now, for GOMP_target we want omplower to replace the var references
>> like b or c with something like .omp_target_data->b, .omp_target_data->c
>> etc., where the structure will contain the target addresses of the
>> variables.  So, GOMP_target would again receive vector of the
>> { mapkind, hostaddr, length }, do the lookups, allocations / copying
>> like for GOMP_target_data, but also prepare a vector of the corresponding
>> target addresses that it would pass to the target function.
> Agreed.  I tried to describe and rephrase that a bit above to make sure we both
> mean the same here.
>
>> Automatic variables defined in the scope of #pragma omp target body
>> don't need any special treatment (but I hope gimplifier doesn't do anything
>> for them), they will be just automatic variables inside of the target
>> outlined body.
> I hope that too.
>
>> But let's start with non-optimized code,
>> everything is passed as target address of the allocated spot.
> Agreed.
>
>> As specs are target specific, I'm afraid you'll need to be looking for
>> the gcc driver for the target, not lto1 binary.
> I think I didn't get it.  Could you explain this point?  What are the specs
> here?
>
>> Configure could record the names, or you could scan a directory with the
>> plugins and dlopen all shared libraries in there, ...
> I'd prefer recording at configure step - to me it looks more robust, in general
> I'm ok with both options.
>
> ---
> Thanks, Michael
>
>
>
> On 23 Aug 18:16, Jakub Jelinek wrote:
>> On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote:
>> > That makes sense.  We could maintain a vector of descriptors for each
>> > encountered MAP clause and push to and pop from it when needed (when
>> > e.g. new mapping is encountered inside 'pragma omp target data').  The
>> > desciptor should contain address in the host memory, size of the mapped
>> > block, type of mapping, related device, and handler, which would be
>> > returned for this mapping by runtime.  Having vector of such
>> > descriptors, we could pass it as an argument for outlined functions - in
>> > them we need to extract needed addresses from the vector before
>> > executing the body.  Did I get it right?
>>
>> No need for the device and handler IMHO, each vector would correspond to
>> one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
>> and all those calls would be called with device id.
>>
>> > Also, a bit unclear point here is how should we generate these
>> > extractions in target-version of the outlined function - seemingly we
>> > won't pass this entire vector to it, so it's unclear out of what should
>> > we extract the data.  What do you think on this?
>>
>> Let's talk about some concrete example (though, I see the gimplifier
>> doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
>> FE, ++todo).
>>
>> void baz (float *, float *, int);
>>
>> #pragma omp declare target
>> int v = 6;
>> int tgt ()
>> {
>>   #pragma omp atomic update
>>     v++;
>>   return 0;
>> }
>> #pragma omp end declare target
>>
>> float
>> bar (int x, int y, int z)
>> {
>>   float b[1024], c[1024], s = 0;
>>   int i, j;
>>   baz (b, c, x);
>>   #pragma omp target data map(to: b)
>>   {
>>     #pragma omp target map(tofrom: c)
>> #if 0
>>       #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s)
>>         #pragma omp distribute dist_schedule(static, 4) collapse(1)
>>           for (j=0; j < x; j += y)
>> #else
>>         j = 0;
>> #endif
>>             #pragma omp parallel for reduction(+:s)
>>               for (i = j; i < j + y; i++)
>>                 tgt (), s += b[i] * c[i];
>>     #pragma omp target update from(b, v)
>>   }
>>   return s;
>> }
>>
>> float
>> foo (int x)
>> {
>>   float b[1024], c[1024], s = 0;
>>   int i;
>>   baz (b, c, x);
>>   #pragma omp target map(to: b, c)
>>     #pragma omp parallel for reduction(+:s)
>>       for (i = 0; i < x; i++)
>>         tgt (), s += b[i] * c[i];
>>   return s;
>> }
>>
>> This ICEs during ompexp right now otherwise and obviously even omplower
>> doesn't DTRT.
>>
>> So we have something like:
>>
>>   #pragma omp target data map(to:b)
>>   #pragma omp target map(tofrom:j)
>>   j = 0;
>>   #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) [child fn: _Z3bariii._omp_fn.0 (???)]
>>   #pragma omp for nowait private(i)
>>   for (i = j; i < D.2235; i = i + 1)
>>     {
>>       tgt ();
>>       D.2236 = b[i];
>>       D.2237 = c[i];
>>       D.2238 = D.2236 * D.2237;
>>       s = D.2238 + s;
>>     }
>>   #pragma omp target update from(v) from(b)
>>
>> On #pragma omp target it clearly is missing many other map clauses,
>> like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to
>> debug later on why they disappeared or weren't added.
>>
>> In any case, the only thing GOMP_target_data can do is take the vector
>> of the map clauses { mapkind, hostaddr, length } and look them up
>> one by one in the mapping of the device and if not present there, allocate
>> and/or copy and remember.
>>
>> Now, for GOMP_target we want omplower to replace the var references
>> like b or c with something like .omp_target_data->b, .omp_target_data->c
>> etc., where the structure will contain the target addresses of the
>> variables.  So, GOMP_target would again receive vector of the
>> { mapkind, hostaddr, length }, do the lookups, allocations / copying
>> like for GOMP_target_data, but also prepare a vector of the corresponding
>> target addresses that it would pass to the target function.
>>
>> Automatic variables defined in the scope of #pragma omp target body
>> don't need any special treatment (but I hope gimplifier doesn't do anything
>> for them), they will be just automatic variables inside of the target
>> outlined body.  Other automatic variables in the function containing #pragma omp
>> target could have some optimization for them, if there aren't any #pragma
>> omp target data directives referencing them around the #pragma omp target
>> that references them, such variables are guaranteed not to be mapped
>> in the target device upon GOMP_target call, thus such vars could be e.g.
>> allocated in a flexible array at the end of the .omp_target_data
>> structure.  Also for non-addressable variables supposedly we could consider
>> promoting them into a temporary variable (at the start of GOMP_target
>> body load them from .omp_target_data->something, at the end store them back
>> (well, depending on map kind)).  But let's start with non-optimized code,
>> everything is passed as target address of the allocated spot.
>>
>> Also, GOMP_target{_data,} could just lookup addresses from the whole vector
>> and remember what succeeded and what failed (i.e. what has been already
>> mapped and thus noop and what needs mapping and depending on mapkind
>> copying) and sum up the amount of memory that needs allocation for the
>> latter ones, then just allocate in the device everything at once and just
>> partition it for the individual vars.
>>
>> > > I meant just a single plugin that would handle all of them, or as richi
>> > > said, perhaps teach LTO plugin to do that.
>> > > For options, my vision was something like:
>> > > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
>> > > which would mean:
>> > > 1) compile LTO IL from the accelerator section for mic with
>> > >    the originally recorded gcc command line options with the Target options
>> > >    removed and no extra options added
>> > > 2) compile LTO IL also for hsail target, with originally recorded gcc
>> > >    command line options but Target options and -mfoobaz=4 -mbazbaz
>> > >    options added
>> > > 3) don't compile for ptx
>> > > The thing is if you originally compile with
>> > > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
>> > > the -m* options might not apply to the target compiler at all.
>> > > So you'd construct the command line from the original command line sans
>> > > CL_TARGET options, append to that the link time override for the
>> > > accelerator.  Then another thing is how to find out the corresponding
>> > > compiler (including its driver) for the target from the plugin.
>> > Could we set some correspondance between '-ftarget' option value and
>> > corresponding compiler?  E.g. for '-ftarget=xyz' we would look for
>> > xyz-cc1.  I haven't looked in details at how the compiler plugins work,
>> > so maybe I said something unfeasable:)
>>
>> As specs are target specific, I'm afraid you'll need to be looking for
>> the gcc driver for the target, not lto1 binary.
>>
>> > > libgomp would start by trying to dlopen all available plugins,
>> > > and for each of them call some routine in them that would query the hw
>> > > for available devices, then libgomp would assign device ids to them (0 and
>> > > up) and then for target specific parts just dispatch again to the plugin
>> > > corresponding to the chosen device.
>> > In libgomp we have a similar problem - there we would need to find out
>> > plugins names from somewhere.  The difference is that libgomp would
>> > always iterate through all plugins independently on compiler options,
>> > but even with this I currently have no idea of how to populate list of
>> > plugins names (I suppose, this should be done somewhere at
>> > configure/make step of libgomp building process?).
>>
>> Configure could record the names, or you could scan a directory with the
>> plugins and dlopen all shared libraries in there, ...
>>
>>       Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 12:56             ` Richard Biener
@ 2013-08-28 15:26               ` Jakub Jelinek
  2013-08-28 17:03                 ` Richard Biener
  2013-08-28 18:43                 ` Torvald Riegel
  0 siblings, 2 replies; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-28 15:26 UTC (permalink / raw)
  To: Richard Biener
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, Torvald Riegel

On Wed, Aug 28, 2013 at 12:39:00PM +0200, Richard Biener wrote:
> > So, here is the original code:
> >
> >   #pragma omp declare target
> >   int v = 6;
> >   int tgt ()
> >   {
> >     #pragma omp atomic update
> >       v++;
> >     return 0;
> >   }
> >   #pragma omp end declare target
> >
> >   float
> >   bar (int x, int y, int z)
> >   {
> >     float b[1024], c[1024], s = 0;
> >     int i, j;
> >     baz (b, c, x);
> >     #pragma omp target data map(to: b)
> >     {
> >       #pragma omp target map(tofrom: c) map(from:s)
> >       for (i = 0; i < 1024; i++)
> >         tgt (), s += b[i] * c[i];
> >       #pragma omp target update from(b, v)
> >     }
> >     return s;
> >   }
> 
> You could even simplify this more by removing tgt and its use?  That is,
> s += b[i] * c[i] would still be executed on the accelerator?

Sure.  The intent of the testcase was to test various stuff from the OpenMP
4.0 accelerator support.

>  What's
> this omp target map stuff?  Just guessing from the names shouldn't it be

map clauses tell the compiler what objects should be mapped into the target
device address space and what kind of operation it is
(to/from/alloc/tofrom).  They are either explicit, or implicit (added during
gimplification, the implicit ones are always tofrom), can be either
variables, or array sections (special syntax to tell that you e.g. want to
map array of some length from some pointer).  The runtime only performs
the allocation and corresponding copying if any is requested when the object
isn't mapped already.
> 
> >     {
> >       #pragma omp target map(from: c) map(from: b) map(tofrom:s)
> >       for (i = 0; i < 1024; i++)
> >         s += b[i] * c[i];
> >       #pragma omp target update from(b, v)
> >     }
> 
> that is, b and c are read, s is read and written.  Not sure what the
> last pragma should even do ... (sync and wait so following code could
> read from b and v?)

to: is copy to the device, from: is copy back from device to host at the end
of the construct, tofrom: both, alloc: no copying.  #pragma omp target data
construct is just about doing the mappings and then executing some further
host code, and undoing that at the end, while #pragma omp target is
about doing the mappings, and then executing some code on the target,
then undoing that at the end.

#pragma omp target update is just explicit copying between target and host,
called from host code.  Sure, the above can be changed like you're
mentioning, the point of the testcase was to test all of the actions.

> Coming from the HSA side I'd like to see that we can easily auto-accelerate
> (as we auto-vectorize) regular OMP code like
> 
>    #pragma omp parallel for
>     for (i = 0; i < 1024; i++)
>        s += b[i] * c[i];
> 
> but as we lower this stuff very early we'd have to lower it as omp target?
> Or can we make the libgomp interfacing for the workers the same so
> we can easily modify them later?

The APIs to the runtime library will be there, so guess you can do something
like our auto-parallelization does now, or you could of course insert
OMP_TARGET etc. early on (during gimplification at latest though); the big
question is how would you find out if auto-acceleration is desirable or not.
In any case, my focus for the time being is on the explicit acceleration
(along with all the other OpenMP 4.0 stuff that still needs doing).

> With HSA we don't have to bother about accelerator memory handling
> because of the unified address space and the (appearant) cache coherency.
> So for HSA it would have been enough to have the omp parallel for
> getting a descriptor instead of a function pointer where possible accelerator
> implementations are queued up for use by the OMP scheduler.
> 
> That is, AFAIK all the omp target stuff is decoupled from scheduling "regular"
> CPU OMP tasks?  And omp target implies a parallel region following, so
> it's not just additional hints?

In OpenMP 4.0, omp target is separate from omp teams (I guess this is mainly
NVidia HW specific stuff, for others I guess we'll just use one team),
distribute (something like omp for to parallelize code across teams),
then parallel, then perhaps some worksharing inside of it and then possibly
simd.  So you can have even combined construct, like:
#pragma omp target teams distribute parallel for simd
for (i = 0; i < 10000000; i++)
  do_something (i);
which for non-NVidia would just have one team on the accelerator,
parallelized across all target device cores and vectorized, while for NVidia
would be parallelized over some set of teams, distributed across them,
parallelized across cores in each of the teams, workshared across that, and
vectorized.

If you have just:
#pragma omp target
do_something ();
then do_something () is only executed on one thread on the accelerator.
> 
> >From the accelerator BOF video I gather we agreed on using the GOMP
> representation as unified middle-end.  What I didn't get is whether we
> agreed on libgomp being the unified single runtime (that eventually
> dispatches to accelerator specific runtimes, opened via dlopen)?

I guess that is up to discussions.  It can be e.g. that libgomp library
dlopens libgomp specific plugins, or that those plugins are written to be
usable by more libraries (libopenacc, etc.), or some code for those plugins
is shared.
Important is also what target "libraries" we actually provide, e.g. OpenMP
4.0 says basically that from target code you can only call code declared
or defined in #pragma omp declare target ... #pragma omp end declare target
region, but it pretty much assumes that you can use various omp_* library
calls, various #pragma omp ... directives (which probably need some library
implementation) and stuff like printf and various math library functions.

In the Intel MIC case (the only thing I've looked briefly at for how the
offloading works - the COI library) you can load binaries and shared
libraries either from files or from host memory image, so e.g. you can
embed the libgomp library, some kind of libm and some kind of libc
(would that be glibc, newlib, something else?) compiled for the target
into some data section inside of the plugin or something
(or load it from files of course).  No idea how you do this in the
HSAIL case, or PTX.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 15:26               ` Jakub Jelinek
@ 2013-08-28 17:03                 ` Richard Biener
  2013-08-28 17:15                   ` Jakub Jelinek
  2013-08-28 18:54                   ` Torvald Riegel
  2013-08-28 18:43                 ` Torvald Riegel
  1 sibling, 2 replies; 56+ messages in thread
From: Richard Biener @ 2013-08-28 17:03 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, Torvald Riegel

On Wed, Aug 28, 2013 at 1:06 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Aug 28, 2013 at 12:39:00PM +0200, Richard Biener wrote:
>> > So, here is the original code:
>> >
>> >   #pragma omp declare target
>> >   int v = 6;
>> >   int tgt ()
>> >   {
>> >     #pragma omp atomic update
>> >       v++;
>> >     return 0;
>> >   }
>> >   #pragma omp end declare target
>> >
>> >   float
>> >   bar (int x, int y, int z)
>> >   {
>> >     float b[1024], c[1024], s = 0;
>> >     int i, j;
>> >     baz (b, c, x);
>> >     #pragma omp target data map(to: b)
>> >     {
>> >       #pragma omp target map(tofrom: c) map(from:s)
>> >       for (i = 0; i < 1024; i++)
>> >         tgt (), s += b[i] * c[i];
>> >       #pragma omp target update from(b, v)
>> >     }
>> >     return s;
>> >   }
>>
>> You could even simplify this more by removing tgt and its use?  That is,
>> s += b[i] * c[i] would still be executed on the accelerator?
>
> Sure.  The intent of the testcase was to test various stuff from the OpenMP
> 4.0 accelerator support.
>
>>  What's
>> this omp target map stuff?  Just guessing from the names shouldn't it be
>
> map clauses tell the compiler what objects should be mapped into the target
> device address space and what kind of operation it is
> (to/from/alloc/tofrom).  They are either explicit, or implicit (added during
> gimplification, the implicit ones are always tofrom), can be either
> variables, or array sections (special syntax to tell that you e.g. want to
> map array of some length from some pointer).  The runtime only performs
> the allocation and corresponding copying if any is requested when the object
> isn't mapped already.
>>
>> >     {
>> >       #pragma omp target map(from: c) map(from: b) map(tofrom:s)
>> >       for (i = 0; i < 1024; i++)
>> >         s += b[i] * c[i];
>> >       #pragma omp target update from(b, v)
>> >     }
>>
>> that is, b and c are read, s is read and written.  Not sure what the
>> last pragma should even do ... (sync and wait so following code could
>> read from b and v?)
>
> to: is copy to the device, from: is copy back from device to host at the end
> of the construct, tofrom: both, alloc: no copying.  #pragma omp target data
> construct is just about doing the mappings and then executing some further
> host code, and undoing that at the end, while #pragma omp target is
> about doing the mappings, and then executing some code on the target,
> then undoing that at the end.
>
> #pragma omp target update is just explicit copying between target and host,
> called from host code.  Sure, the above can be changed like you're
> mentioning, the point of the testcase was to test all of the actions.
>
>> Coming from the HSA side I'd like to see that we can easily auto-accelerate
>> (as we auto-vectorize) regular OMP code like
>>
>>    #pragma omp parallel for
>>     for (i = 0; i < 1024; i++)
>>        s += b[i] * c[i];
>>
>> but as we lower this stuff very early we'd have to lower it as omp target?
>> Or can we make the libgomp interfacing for the workers the same so
>> we can easily modify them later?
>
> The APIs to the runtime library will be there, so guess you can do something
> like our auto-parallelization does now, or you could of course insert
> OMP_TARGET etc. early on (during gimplification at latest though); the big
> question is how would you find out if auto-acceleration is desirable or not.
> In any case, my focus for the time being is on the explicit acceleration
> (along with all the other OpenMP 4.0 stuff that still needs doing).
>
>> With HSA we don't have to bother about accelerator memory handling
>> because of the unified address space and the (appearant) cache coherency.
>> So for HSA it would have been enough to have the omp parallel for
>> getting a descriptor instead of a function pointer where possible accelerator
>> implementations are queued up for use by the OMP scheduler.
>>
>> That is, AFAIK all the omp target stuff is decoupled from scheduling "regular"
>> CPU OMP tasks?  And omp target implies a parallel region following, so
>> it's not just additional hints?
>
> In OpenMP 4.0, omp target is separate from omp teams (I guess this is mainly
> NVidia HW specific stuff, for others I guess we'll just use one team),
> distribute (something like omp for to parallelize code across teams),
> then parallel, then perhaps some worksharing inside of it and then possibly
> simd.  So you can have even combined construct, like:
> #pragma omp target teams distribute parallel for simd
> for (i = 0; i < 10000000; i++)
>   do_something (i);
> which for non-NVidia would just have one team on the accelerator,
> parallelized across all target device cores and vectorized, while for NVidia
> would be parallelized over some set of teams, distributed across them,
> parallelized across cores in each of the teams, workshared across that, and
> vectorized.
>
> If you have just:
> #pragma omp target
> do_something ();
> then do_something () is only executed on one thread on the accelerator.
>>
>> >From the accelerator BOF video I gather we agreed on using the GOMP
>> representation as unified middle-end.  What I didn't get is whether we
>> agreed on libgomp being the unified single runtime (that eventually
>> dispatches to accelerator specific runtimes, opened via dlopen)?
>
> I guess that is up to discussions.  It can be e.g. that libgomp library
> dlopens libgomp specific plugins, or that those plugins are written to be
> usable by more libraries (libopenacc, etc.), or some code for those plugins
> is shared.
> Important is also what target "libraries" we actually provide, e.g. OpenMP
> 4.0 says basically that from target code you can only call code declared
> or defined in #pragma omp declare target ... #pragma omp end declare target
> region, but it pretty much assumes that you can use various omp_* library
> calls, various #pragma omp ... directives (which probably need some library
> implementation) and stuff like printf and various math library functions.

My thought was that we need to have control over scheduling and thus have
a single runtime to be able to execute the following in parallel on the
accelerator and the CPU:

#pragma omp parallel
{
#pragma omp target
   for (;;)
     ...
#pragma omp for
  for (;;)
     ...
}
#pragma omp wait

that is, the omp target dispatch may not block the CPU.  I can hardly
see how you can make multiple runtimes co-exist from the GCC code
generation side.  Of course dependent on the actual accelerator runtime
doing that in the libgomp scheduling code may be equally hard (or
even impossible).  For HSA I envisioned simply adding a single
libgomp 'team' ontop of the available CPU cores that ends up doing
the dispatch / wait with the HSA runtime.

So here I merely wonder how to make the interfacing to libgomp
generic enough to cover all bits (going to the extreme to eventually
allow the libgomp runtime to be replaced by one that uses the
accelerator runtime scheduling code if that turns out to be more
powerful - the HSA one at least looks like so, in theory).

> In the Intel MIC case (the only thing I've looked briefly at for how the
> offloading works - the COI library) you can load binaries and shared
> libraries either from files or from host memory image, so e.g. you can
> embed the libgomp library, some kind of libm and some kind of libc
> (would that be glibc, newlib, something else?) compiled for the target
> into some data section inside of the plugin or something
> (or load it from files of course).  No idea how you do this in the
> HSAIL case, or PTX.

For HSA you can do arbitrary calls to CPU code (that will then of course
execute on the CPU).

Richard.

>         Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 17:03                 ` Richard Biener
@ 2013-08-28 17:15                   ` Jakub Jelinek
  2013-08-29 21:09                     ` Richard Biener
  2013-08-28 18:54                   ` Torvald Riegel
  1 sibling, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-08-28 17:15 UTC (permalink / raw)
  To: Richard Biener
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, Torvald Riegel

On Wed, Aug 28, 2013 at 01:21:53PM +0200, Richard Biener wrote:
> My thought was that we need to have control over scheduling and thus have
> a single runtime to be able to execute the following in parallel on the
> accelerator and the CPU:
> 
> #pragma omp parallel
> {
> #pragma omp target
>    for (;;)
>      ...
> #pragma omp for
>   for (;;)
>      ...
> }
> #pragma omp wait
> 
> that is, the omp target dispatch may not block the CPU.  I can hardly

OpenMP #pragma omp target blocks the host CPU until the accelerator code
finishes.  So if the goal is to spawn some accelerator code in parallel with
parallelized host code, you'd need to make the code more complicated.
I guess you could
#pragma omp parallel
{
#pragma omp single
#pragma omp target
{
#pragma omp parallel
...
}
#pragma omp for schedule(dynamic, N)
for (;;)
...
}
or similar, then only one of the host parallel threads would spawn the
target code, wait for it to be done and other threads in the mean time
would do the worksharing (and the dynamic schedule would make sure that
if the target region took long time, then no work or almost no work would be
scheduled for the thread executing the target region).

> > In the Intel MIC case (the only thing I've looked briefly at for how the
> > offloading works - the COI library) you can load binaries and shared
> > libraries either from files or from host memory image, so e.g. you can
> > embed the libgomp library, some kind of libm and some kind of libc
> > (would that be glibc, newlib, something else?) compiled for the target
> > into some data section inside of the plugin or something
> > (or load it from files of course).  No idea how you do this in the
> > HSAIL case, or PTX.
> 
> For HSA you can do arbitrary calls to CPU code (that will then of course
> execute on the CPU).

GCC compiles into assembly or bytecode for HSAIL, right, and that then is
further processed by some (right now proprietary?) blob.  The question is
does this allow linking of multiple HSAIL bytecode objects/libraries, etc.
Say you have something providing (a subset of) C library, math library,
libgomp, then say for OpenMP one host shared library provides some
#pragma omp declare target
...
#pragma omp end declare target
routine, and another shared library uses #pragma omp target and calls that
routine from there.  So, I'd assume you have some HSAIL assembly/bytecode
in each of the shared libraries, can you link that together and tell the
runtime to execute some (named?) routine in there?

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 15:26               ` Jakub Jelinek
  2013-08-28 17:03                 ` Richard Biener
@ 2013-08-28 18:43                 ` Torvald Riegel
  1 sibling, 0 replies; 56+ messages in thread
From: Torvald Riegel @ 2013-08-28 18:43 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, Michael V. Zolotukhin, Kirill Yukhin,
	Richard Henderson, GCC Development

On Wed, 2013-08-28 at 13:06 +0200, Jakub Jelinek wrote:
> On Wed, Aug 28, 2013 at 12:39:00PM +0200, Richard Biener wrote:
> > >From the accelerator BOF video I gather we agreed on using the GOMP
> > representation as unified middle-end.  What I didn't get is whether we
> > agreed on libgomp being the unified single runtime (that eventually
> > dispatches to accelerator specific runtimes, opened via dlopen)?
> 
> I guess that is up to discussions.

Yes.  We didn't have time to discuss this; also, my impression was that
we (meaning people in the room) weren't ready to discuss this yet
because there were too many open questions, including how the particular
platforms/archs that we would be targeting would actually look like
(e.g., on the Linux userspace side).

> In the Intel MIC case (the only thing I've looked briefly at for how the
> offloading works - the COI library) you can load binaries and shared
> libraries either from files or from host memory image, so e.g. you can
> embed the libgomp library, some kind of libm and some kind of libc
> (would that be glibc, newlib, something else?) compiled for the target
> into some data section inside of the plugin or something
> (or load it from files of course).

That's another interesting question: how do we deploy.  The "static
linking" into the plugin might be worthwhile if the number of "target
libraries" is small.  But if there are several, and/or if we want the
libraries to be decoupled from each other (eg, to make updates easier),
then we'd need a mechanism to load them.

> No idea how you do this in the
> HSAIL case, or PTX.

Yes, this is something we should discuss at some point, for each
platform that we want accelerators to be supported on.  I'm not sure
whether GCC is the optimal place for this conversation (GCC would rather
be a user of whatever is built), but maybe it is in absence of
alternatives :-)
Looking at the Linux side of this is something that I'm interested in.

Torvald

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 17:03                 ` Richard Biener
  2013-08-28 17:15                   ` Jakub Jelinek
@ 2013-08-28 18:54                   ` Torvald Riegel
  1 sibling, 0 replies; 56+ messages in thread
From: Torvald Riegel @ 2013-08-28 18:54 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, Michael V. Zolotukhin, Kirill Yukhin,
	Richard Henderson, GCC Development

On Wed, 2013-08-28 at 13:21 +0200, Richard Biener wrote:
> On Wed, Aug 28, 2013 at 1:06 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Wed, Aug 28, 2013 at 12:39:00PM +0200, Richard Biener wrote:
> >> >From the accelerator BOF video I gather we agreed on using the GOMP
> >> representation as unified middle-end.  What I didn't get is whether we
> >> agreed on libgomp being the unified single runtime (that eventually
> >> dispatches to accelerator specific runtimes, opened via dlopen)?
> >
> > I guess that is up to discussions.  It can be e.g. that libgomp library
> > dlopens libgomp specific plugins, or that those plugins are written to be
> > usable by more libraries (libopenacc, etc.), or some code for those plugins
> > is shared.
> > Important is also what target "libraries" we actually provide, e.g. OpenMP
> > 4.0 says basically that from target code you can only call code declared
> > or defined in #pragma omp declare target ... #pragma omp end declare target
> > region, but it pretty much assumes that you can use various omp_* library
> > calls, various #pragma omp ... directives (which probably need some library
> > implementation) and stuff like printf and various math library functions.
> 
> My thought was that we need to have control over scheduling and thus have
> a single runtime to be able to execute the following in parallel on the
> accelerator and the CPU:
> 
> #pragma omp parallel
> {
> #pragma omp target
>    for (;;)
>      ...
> #pragma omp for
>   for (;;)
>      ...
> }
> #pragma omp wait
> 
> that is, the omp target dispatch may not block the CPU.

And that's not the only combination we have to consider.  ISO C++ will
come up with something eventually (both for parallelism and likely as
well for concurrency), and ISO C has a study group (CPLEX) looking at
Cilk with some OpenMP mixed in.  So we will have different programming
abstractions (OpenMP, some Cilk-like, perhaps some kind of lightweight
threads for concurrency, ...) to support, and currently they all use
different schedulers.  There are conversations going on in the
respective ISO C++ and C study groups about how to tame the scheduler
side of this, but nothing tangible has emerged from that so far.

> I can hardly
> see how you can make multiple runtimes co-exist from the GCC code
> generation side.

Perhaps having several runtimes is not as much of a problem as
potentially having several runtimes that can't agree on the same
semantics of how to share resources, and how parallel/concurrent tasks
look like.  IOW, we might have to do more "unification" work in the
intermediate representation too (i.e., current GOMP + something +
changes)

> > In the Intel MIC case (the only thing I've looked briefly at for how the
> > offloading works - the COI library) you can load binaries and shared
> > libraries either from files or from host memory image, so e.g. you can
> > embed the libgomp library, some kind of libm and some kind of libc
> > (would that be glibc, newlib, something else?) compiled for the target
> > into some data section inside of the plugin or something
> > (or load it from files of course).  No idea how you do this in the
> > HSAIL case, or PTX.
> 
> For HSA you can do arbitrary calls to CPU code (that will then of course
> execute on the CPU).

Right, which means that we would have a dispatch for both directions,
controlled by some part of the HSA runtime.  This would probably also
mean that the execution of parts executed on the CPU can be parallel, so
the HSA runtime would probably want to do that by calling back into the
generic scheduler code responsible for all parallel/concurrent tasks.

Torvald

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 11:21                         ` Jakub Jelinek
@ 2013-08-29 10:44                           ` Michael V. Zolotukhin
  2013-09-10 15:02                           ` Michael V. Zolotukhin
  1 sibling, 0 replies; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-08-29 10:44 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> Perhaps instead of passing array of { void *hostaddr; size_t length; char kind; }
> and length we could pass 3 arrays and length (the same for all of them).
> I can see 2 advantages of doing that:
> 1) the sizes are often constant and the kinds are always constant, so
> we could often allocate those last 2 or just last array in .rodata, wouldn't
> need to initialize it dynamically
> 2) for the host fallback, we could just pass the first array unmodified as
> the .omp_target_data structure, no need to copy the host addresses
Agree with both points, very nice idea.

Michael
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 17:15                   ` Jakub Jelinek
@ 2013-08-29 21:09                     ` Richard Biener
  0 siblings, 0 replies; 56+ messages in thread
From: Richard Biener @ 2013-08-29 21:09 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, Torvald Riegel

On Wed, Aug 28, 2013 at 1:37 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Aug 28, 2013 at 01:21:53PM +0200, Richard Biener wrote:
>> My thought was that we need to have control over scheduling and thus have
>> a single runtime to be able to execute the following in parallel on the
>> accelerator and the CPU:
>>
>> #pragma omp parallel
>> {
>> #pragma omp target
>>    for (;;)
>>      ...
>> #pragma omp for
>>   for (;;)
>>      ...
>> }
>> #pragma omp wait
>>
>> that is, the omp target dispatch may not block the CPU.  I can hardly
>
> OpenMP #pragma omp target blocks the host CPU until the accelerator code
> finishes.  So if the goal is to spawn some accelerator code in parallel with
> parallelized host code, you'd need to make the code more complicated.
> I guess you could
> #pragma omp parallel
> {
> #pragma omp single
> #pragma omp target
> {
> #pragma omp parallel
> ...
> }
> #pragma omp for schedule(dynamic, N)
> for (;;)
> ...
> }
> or similar, then only one of the host parallel threads would spawn the
> target code, wait for it to be done and other threads in the mean time
> would do the worksharing (and the dynamic schedule would make sure that
> if the target region took long time, then no work or almost no work would be
> scheduled for the thread executing the target region).
>
>> > In the Intel MIC case (the only thing I've looked briefly at for how the
>> > offloading works - the COI library) you can load binaries and shared
>> > libraries either from files or from host memory image, so e.g. you can
>> > embed the libgomp library, some kind of libm and some kind of libc
>> > (would that be glibc, newlib, something else?) compiled for the target
>> > into some data section inside of the plugin or something
>> > (or load it from files of course).  No idea how you do this in the
>> > HSAIL case, or PTX.
>>
>> For HSA you can do arbitrary calls to CPU code (that will then of course
>> execute on the CPU).
>
> GCC compiles into assembly or bytecode for HSAIL, right, and that then is
> further processed by some (right now proprietary?) blob.  The question is
> does this allow linking of multiple HSAIL bytecode objects/libraries, etc.
> Say you have something providing (a subset of) C library, math library,
> libgomp, then say for OpenMP one host shared library provides some
> #pragma omp declare target
> ...
> #pragma omp end declare target
> routine, and another shared library uses #pragma omp target and calls that
> routine from there.  So, I'd assume you have some HSAIL assembly/bytecode
> in each of the shared libraries, can you link that together and tell the
> runtime to execute some (named?) routine in there?

(un)fortunately the HSA runtime spec doesn't talk about the whole relocation
business so for now we end up passing all object addresses as arguments
to the HSAIL code and access everything indirectly.  For the above case it
means you have to glue everything together manually in some weird ways.
Eventually the HSA folks need to think about this of course.

Richard.

>         Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-28 11:21                         ` Jakub Jelinek
  2013-08-29 10:44                           ` Michael V. Zolotukhin
@ 2013-09-10 15:02                           ` Michael V. Zolotukhin
  2013-09-10 15:15                             ` Jakub Jelinek
  1 sibling, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-09-10 15:02 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

Hi Jakub,
I continued playing with plugins for libgomp, and I have several questions
regarding that:

1) Would it be ok, at least for the beginning, if we'd look for plugins in a
folder, specified by some environment variable?  A plugin would be considered
as suitable, if it's named "*.so" and if dlsym finds a certain set of functions
in it (e.g. "device_available", "offload_function" - names are subjected to
change of course).

2) We need to perform all libgomp initialization once at the first entry to
libgomp.  Should we add corresponding checks to all GOMP_* routines or should
the compiler add calls to GOMP_init (which also needs to be introduced) by
itself before all other calls to libgomp?

3) Also, would it be ok if we store libgomp status (already initialized or not)
in some static variable?  I haven't seen such examples in the existing code
base, so I don't sure it is a good way to go.

4) We'll need to store some information about available devices:
  - a search tree with data about mapping
  - corresponding plugin handler
  - handlers for functions from the corresponding plugin
  - maybe some other info
I guess that's a bad idea to store all this data in some static-sized global
variables, and it's better to dynamically allocate memory for that.  But it
implies that we need to care about deallocation, which should be called at some
moment on the program end.  Shouldn't we introduce something like
GOMP_deinitialize and insert calls to it during the compilation?

5) We mentioned that similar to a tree data-structure for storing info about
mapping.  Am I getting it correctly, that currently there is no such
data-structure at all and we need to design and implement it from scratch?

--
Thanks, Michael

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-10 15:02                           ` Michael V. Zolotukhin
@ 2013-09-10 15:15                             ` Jakub Jelinek
  2013-09-10 15:31                               ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-09-10 15:15 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Tue, Sep 10, 2013 at 07:01:26PM +0400, Michael V. Zolotukhin wrote:
> I continued playing with plugins for libgomp, and I have several questions
> regarding that:
> 
> 1) Would it be ok, at least for the beginning, if we'd look for plugins in a
> folder, specified by some environment variable?  A plugin would be considered
> as suitable, if it's named "*.so" and if dlsym finds a certain set of functions
> in it (e.g. "device_available", "offload_function" - names are subjected to
> change of course).

Trying to dlopen random libraries is bad, so when libgomp dlopens something,
it better should be a plugin and not something else.
I'd suggest that the name should be matching libgomp-plugin-*.so.1 or
similar wildcard.

> 2) We need to perform all libgomp initialization once at the first entry to
> libgomp.  Should we add corresponding checks to all GOMP_* routines or should
> the compiler add calls to GOMP_init (which also needs to be introduced) by
> itself before all other calls to libgomp?

Why?  If this is the plugin stuff, then IMNSHO it should be initialized only
on the first call to GOMP_target{,_data,_update} or omp_get_num_devices.
Just use pthread_once to initialize it just once.

> 3) Also, would it be ok if we store libgomp status (already initialized or not)
> in some static variable?  I haven't seen such examples in the existing code
> base, so I don't sure it is a good way to go.

Sure.

> 4) We'll need to store some information about available devices:
>   - a search tree with data about mapping

For the search tree, I was going to actually implement it myself, but got
interrupted this week with work on UDRs again.  I wanted to write just
temporarily a dummy device that would execute on the host, but remap all
memory to something allocated elsewhere in the same address space by malloc.
Sure, #pragma omp declare target vars wouldn't work that way, but otherwise
it could work fine.  Each device that would have a flag set that it doesn't
have shared address space between host and device (I belive HSAIL might have
shared address space, host fallback of course has shared address space,
the rest do not?) would have its own splay tree plus some host mutex to
guard accesses to the tree.

>   - corresponding plugin handler
>   - handlers for functions from the corresponding plugin
>   - maybe some other info

> I guess that's a bad idea to store all this data in some static-sized global
> variables, and it's better to dynamically allocate memory for that.  But it
> implies that we need to care about deallocation, which should be called at some
> moment on the program end.  Shouldn't we introduce something like
> GOMP_deinitialize and insert calls to it during the compilation?

We don't need to care about deallocation, if it is not per-host-thread
stuff, but per-device stuff.  If we wanted, we could add some magic function
for valgrind that could be called (like e.g. glibc has), but it is
definitely not very important and we don't do it right now for parallels
etc.

> 5) We mentioned that similar to a tree data-structure for storing info about
> mapping.  Am I getting it correctly, that currently there is no such
> data-structure at all and we need to design and implement it from scratch?

See above.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-10 15:15                             ` Jakub Jelinek
@ 2013-09-10 15:31                               ` Michael V. Zolotukhin
  2013-09-10 15:36                                 ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-09-10 15:31 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> Trying to dlopen random libraries is bad, so when libgomp dlopens something,
> it better should be a plugin and not something else.
> I'd suggest that the name should be matching libgomp-plugin-*.so.1 or
> similar wildcard.
Ok, sounds reasonable.

> Why?  If this is the plugin stuff, then IMNSHO it should be initialized only
> on the first call to GOMP_target{,_data,_update} or omp_get_num_devices.
> Just use pthread_once to initialize it just once.
Ok, once we don't care about deallocation, that seems reasonable too.

> > 4) We'll need to store some information about available devices:
> >   - a search tree with data about mapping
> 
> For the search tree, I was going to actually implement it myself, but got
> interrupted this week with work on UDRs again.  I wanted to write just
> temporarily a dummy device that would execute on the host, but remap all
> memory to something allocated elsewhere in the same address space by malloc.
> Sure, #pragma omp declare target vars wouldn't work that way, but otherwise
> it could work fine.  Each device that would have a flag set that it doesn't
> have shared address space between host and device (I belive HSAIL might have
> shared address space, host fallback of course has shared address space,
> the rest do not?) would have its own splay tree plus some host mutex to
> guard accesses to the tree.
Ok.  Do you need all plugin infrastructure ready for that or you could
experiment with dummy device without plugins?

Michael
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-10 15:31                               ` Michael V. Zolotukhin
@ 2013-09-10 15:36                                 ` Jakub Jelinek
  2013-09-10 15:38                                   ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-09-10 15:36 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Tue, Sep 10, 2013 at 07:30:53PM +0400, Michael V. Zolotukhin wrote:
> > > 4) We'll need to store some information about available devices:
> > >   - a search tree with data about mapping
> > 
> > For the search tree, I was going to actually implement it myself, but got
> > interrupted this week with work on UDRs again.  I wanted to write just
> > temporarily a dummy device that would execute on the host, but remap all
> > memory to something allocated elsewhere in the same address space by malloc.
> > Sure, #pragma omp declare target vars wouldn't work that way, but otherwise
> > it could work fine.  Each device that would have a flag set that it doesn't
> > have shared address space between host and device (I belive HSAIL might have
> > shared address space, host fallback of course has shared address space,
> > the rest do not?) would have its own splay tree plus some host mutex to
> > guard accesses to the tree.
> Ok.  Do you need all plugin infrastructure ready for that or you could
> experiment with dummy device without plugins?

I don't need that infrastructure for that, I meant just a hack that say for
OMP_DEFAULT_DEVICE=257 I'd use this hackish device, and store the splay tree
root and lock in a global var with a comment that that in the future will
belong into the per-device structure.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-10 15:36                                 ` Jakub Jelinek
@ 2013-09-10 15:38                                   ` Michael V. Zolotukhin
  2013-09-13 11:30                                     ` Michael V. Zolotukhin
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-09-10 15:38 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> I don't need that infrastructure for that, I meant just a hack that say for
> OMP_DEFAULT_DEVICE=257 I'd use this hackish device, and store the splay tree
> root and lock in a global var with a comment that that in the future will
> belong into the per-device structure.
Okay, hopefully I would have something committable soon on the infrastructure
side as well.

Michael
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-08-27 19:54                       ` Michael V. Zolotukhin
  2013-08-28 11:21                         ` Jakub Jelinek
@ 2013-09-13  9:35                         ` Michael Zolotukhin
  2013-09-13 10:52                           ` Kirill Yukhin
                                             ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Michael Zolotukhin @ 2013-09-13  9:35 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, GCC Development, triegel

Hi Jakub et al.,
We prepared a draft for design document for offloading support in GCC - could
you please take a look?  It is intended to give a common comprehension of what
is going on in this part.

We might publish it to a GCC wiki, if it is ok.  And later we could fill it with
more details if needed.

Here it is:
------------------------------
CONTENTS

1.  High level view on the compilation process with openmp plugins
1.1.  Compilation
1.2.  Linking
1.3.  Execution
2.  Linker plugins infrastructure
2.1.  Overview
2.2.  Multi-target support
3.  OpenMP pragma target handling in middle-end
4.  Runtime support in libGOMP
4.1.  General interface for offloading
4.2.  Maintaining info about mapped regions
4.3.  Preparing arguments for offloaded calls
4.4.  Plugins, usage of device-specific interfaces

1. HIGH LEVEL VIEW ON THE COMPILATION PROCESS WITH OPENMP PLUGINS

1.1.  Compilation

When host version of GCC compiles a file, the following stages happen:
  * After OpenMP pragmas lowering and expanding a new outlined function with
'target'-attribute emerges - it later will be compiled both by host and target
GCC to produce two versions (or N+1 versions in case of N different targets).
  * Expanding replaces pragmas with corresponding calls to runtime library
(libgomp).  These calls are preceded by initialization of special structures,
containing arguments for outlined routines - that is done similar to 'pragma
omp parallel' processing.
  * Gimple for routines with 'target' attribute is streamed into a special
section of the assembler (similar to LTO-sections).
  * Usual compilation continues, producing host-side assembler.
  * Assembler generates a FAT-object, containing host-side code and Gimple IR
for the outlined functions (they were marked with 'target' attribute).

TODO: add something about routines and variables inside 'pragma declare target'.

1.2.  Linking

When all source files are compiled, a linker is invoked.  The linker is passed
a special option to invoke openmp-plugin.  The plugin is responsible for
producing target-side executables - for each target it calls the corresponding
target compiler and linker.
The target-side GCC is invoked to load Gimple IR from .gnu.target_lto sections
of the FAT-object and compile it to target-side objects which later will be
used by target-side linker.

The host-side linker needs libgomp along side with standard libraries like
libc/libm to successfully resolve symbols, generated by the host compiler.  The
target-side linker needs CRT.O, containing main-routine for target-side
executable and target-specific versions of standard libraries.

As a result of the work, the plugin produces N target executables and exits,
allowing the host linker to continue its work and produce host-side executable.

TBD: Should the main routine always contain a message-waiting loop (like in COI
implementation) or other options are also possible?
TBD: probably, it's better to have a separate plugin for each target, that a
single openmp plugin.

1.3.  Execution

Host-side executable contains calls to libgomp library, which interfaces all
interactions with target-devices.
On loading, the executable calls GOMP_target_init from libgomp.so, which will
load the target executables onto target-devices and start them.  Since this
moment, the devices are ready to execute requested code and interact with the
main host-process.

When a host-side program calls libgomp functions related to the offloading,
libgomp decides, whether it's profitable to offload, and which device to choose
for that.  In order to do that, libgomp calls available plugins and checks
which devices are ready to execute offloaded code.  Available plugins should be
located in a specified folder and should implement a certain interface.

Another important function of libgomp is host-target memory mapping and keeping
information about mapped regions and their types.

TBD: probably, it's better to call GOMP_target_init on the first attempt to
offload something to the given device.
TBD: probably, it's better to 'hard-code' available plugin during build of
libgomp (e.g., at configure step).

2.  LINKER PLUGINS INFRASTRUCTURE

2.1.  Overview

When -flto or -fopenmp option is given to the GCC driver, linker plugin
invocation is triggered.  The plugin claims the input files containing
.gnu.lto* or .gnu.target_lto* sections for further processing and creates
resolutions file.
After this preliminary work, LTO-wrapper is called.  It is responsible for
sequential calls of GCC.

The first call is needed to run WPA, which performs usual LTO partitioning as
well as partitioning of OpenMP-target sections.  WPA reads bytecode of:
  1) all functions and variables with "omp declare target" attribute;
  2) the outlined bodies of #pragma omp target turned into '*.ompfn' functions;
  3) all the types, symtab etc. needed for that;
from .gnu.target_lto* sections and stores them into an extra partition.

The second call invokes GCC on the partitioned ltrans-files and produces
LTO-optimized host-side executable.

The third call invokes target-side GCC (which in turn would call a linker,
which could start LTO for target-side code) and produces target-side
executable.  GCC-target uses lto1 frontend to read bytecode from OpenMP-target
partition, produced during WPA stage.  Further it generates optimized code for
target and links it with ld-target.
This target-executable is added to host-linker input files and is placed into a
.rodata section of host-side executable.

2.2.  Multi-target support

If several different targets are used for offloading, .gnu.target_lto code must
be compiled for each of them.  In order to do that, several target-side
compilers need to be called.
LTO-wrapper scans a specified folder and runs every version of GCC located
there, assuming that these are the target-side compilers.

TBD: This scheme might need to be reconsidered.

3.  OPENMP PRAGMA TARGET HANDLING IN MIDDLE-END

Middle end work is done in two omp passes.  Specifically, omp-lower pass:
  * Creates outlined function with no body
  * Adds #pragma omp return in the end of the region
  * Creates empty struct args_data
  * For each var referenced in clauses  (e.g. int i):
    -  Adds entry to data_arr, data_sizes and data_kind arrays describing this
       variable, its size and mapping type
    -  Adds assignment before call to outlined function : args_data.i = &i
    -  Replace uses of i with uses of args_data->i inside the region

Then, omp-expand pass:
  * Moves the region to the outlined function
  * Adds a call to libGOMP to maybe offload this function:
    GOMP_target (condition /* evaluated expression from IF clause */,
		 device_no /* a number from DEVICE clause */,
		 foo, .foo.,
		 data_arr, data_size, data_kinds, 1 /* size of arrays */);

Finally, gimple of the outlined function and needed parts of symtab are
streamed to .gnu.target_lto* sections.

4.  RUNTIME SUPPORT IN LIBGOMP

4.1.  General interface for offloading

LibGOMP implements generic routines, such as GOMP_target, GOMP_target_data,
GOMP_target_data_end, GOMP_target_update and others.  The compiler replaces
'pragma target' with calls to these routines, surrounded by a code with
arguments preparation.

All of these routines expects three arrays as a parameters: these arrays
contains info about variables mapping.  The first array contains host addresses
of the variables, the second - sizes of mapped regions, and the third - type of
mapping (TO, TOFROM, FROM, ALLOC).  We prefer three separate arrays to one
array of corresponding structures due to the following reasons:
  * The first array could be used as an argument list for outlined calls
  * The second and third arrays are expected to often contain only constants
and thus might be stored in .rodata section and not be allocated dynamically.

GOMP_target routine takes additional arguments:
  * Address of the host version of outlined function.  It is used when runtime
decides to perform host fallback instead of offloading to an accelerator.
  * Name of the target version of outlined function.  This is used when runtime
decides to offload.  It cannot directly call a function on a target device, so
it calls the corresponding plugin and gives it a function name to invoke.

GOMP_target, GOMP_target_data, GOMP_target_data_end, GOMP_target_update routines
performs maintaining of a global structure describing current mapping, which
will be covered in the next section, and an actual data marshalling:
  * GOMP_target copies regions with kind TO or TOFROM to device before
offloading and copies regions with kind FROM or TOFROM from the device when the
offloading is finished.  In case of host fallback no copying is performed.
  * GOMP_target_data copies regions with kind TO or TOFROM to the device.
  * GOMP_target_data_end copies regions with kind FROM or TOFROM from the
device.
  * GOMP_target_data_update copies all regions according to their types to and
from the device.

4.2.  Maintaining info about mapped regions

Every libGOMP routine dealing with mapped memory regions is responsible for an
accurate maintaining of a global data structure describing this mapping.  This
data structure is a binary search tree containing structures 
struct
  {
    void *host_address;
    void *target_address;
    size_t region_size;
    enum {TO, TOFROM, FROM, ALLOC} region_type;
  }
with host addresses used as a key.

The data structure allows to check whether a given host address is mapped, or
not.  In order to do that, on every request it needs to find out whether the
requested interval is covered with already mapped ones and check if all of them
have a corresponding type.

TBD: We could think of usage of some other data structure here.
TBD: Describe, when we need to emit errors, and when we might remap a region
(e.g. if we request to map a region with another type of mapping).

4.3.  Preparing arguments for offloaded calls

GOMP_target reuses functionality of GOMP_parallel for outlining and preparing
arguments for calls of a host-side version of the function.  In fact, host
fallback of GOMP_target behaves almost exactly as GOMP_parallel, except that
GOMP_target does not create any threads for its execution.
The other part of GOMP_target, which is responsible for performing offloading,
looks quite different.
Firstly, it is needed to map/allocate needed regions - that was described in
4.1.
Once the mapping is finished and the data is copied to the device, it is needed
to process arguments list for the offloaded routine.  The goal of this
processing is to translate host addresses to the corresponding target addresses.
As the arguments structure is always actually an array of pointers, it is enough
to know only its size (or the number of the arguments) to correctly translate
all contained addresses.  The translated addresses rewrite the host addresses in
the arguments structure, making it ready for passing to the target-side version
of the offloaded routine.

4.4.  Plugins, usage of device-specific interfaces

LibGOMP is designed to be independent of devices type it work with.  In order to
make it possible, plugins are used: while the libGOMP itself contains only a
generic interface and callbacks to the plugin for invoking target-dependent
functionality.
Plugins are shared object located in a specified folder, implementing a certain
set of routines.
TODO: list this set of necessary routines.
TBD: should we use all .SO located in a plugins folder or should we use
predefined set of plugins (e.g. defined during libgomp build)?

When required, libGOMP performs a search in the plugins folder for the plugins
and for each found .so file checks whether it is a proper plugin and if so
invokes a needed routine.  For instance, when libGOMP chooses a device to
offload to, it could iterate through every .SO from the plugins folder calling
gomp_is_device_available routine from it.

The plugins uses target-dependent libraries and perform low-level interaction
with the device.  For example, a plugin for Intel MIC could use COI interface
for implementing libgomp callbacks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13  9:35                         ` Michael Zolotukhin
@ 2013-09-13 10:52                           ` Kirill Yukhin
  2013-09-13 11:04                           ` Nathan Sidwell
  2013-09-16  9:35                           ` Jakub Jelinek
  2 siblings, 0 replies; 56+ messages in thread
From: Kirill Yukhin @ 2013-09-13 10:52 UTC (permalink / raw)
  To: Michael Zolotukhin, Richard Biener
  Cc: Jakub Jelinek, Richard Henderson, GCC Development, triegel

Hello,
Adding Richard who might want to take a look at LTO stuff.

--
Thanks, K

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13  9:35                         ` Michael Zolotukhin
  2013-09-13 10:52                           ` Kirill Yukhin
@ 2013-09-13 11:04                           ` Nathan Sidwell
  2013-09-13 11:21                             ` Michael V. Zolotukhin
  2013-09-16  9:35                           ` Jakub Jelinek
  2 siblings, 1 reply; 56+ messages in thread
From: Nathan Sidwell @ 2013-09-13 11:04 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Jakub Jelinek, Kirill Yukhin, Richard Henderson, GCC Development,
	triegel

On 09/13/13 10:34, Michael Zolotukhin wrote:
> Hi Jakub et al.,
> We prepared a draft for design document for offloading support in GCC - could
> you please take a look?  It is intended to give a common comprehension of what
> is going on in this part.

This is an interesting design.  It appears similar to how we'd envisioned 
implementing openacc support -- namely leverage the LTO machinery to communicate 
from the host compiler to the device compiler.  Your design looks more detailed, 
which is good.

Are you envisioning the device compilers to be stand alone compilers, built 
separately.  Or are you envisioning extending the configuration machinery by 
adding something like --enable-acclerator=<list> so that:
   .../configure --target=x86_64-linux --enable-accelerator=foo,baz
causes
* a build of an x86_64 compiler aware of the foo and baz accelerators
* build of an appropriate runtime support library
* a build of a foo lto accelerator backend, assembler (and linker?)
* (if needed) build of a foo support library
* a build of a baz lto accelerator backend
* (if needed) build of a baz support library, assembler (and linker?)

or are you expecting something more like 3 separate configures & build?
   .../configure --target=x86_64-linux --enable-accelerator=foo,baz
   .../configure --target=foo --enable-languages=lto-accelerator
   .../configure --target=baz --enable-languages=lto-accelerator

I'd been imagining the former scheme, as it provides for a more integrated 
build, fwiw.

nathan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13 11:04                           ` Nathan Sidwell
@ 2013-09-13 11:21                             ` Michael V. Zolotukhin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-09-13 11:21 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Jakub Jelinek, Kirill Yukhin, Richard Henderson, GCC Development,
	triegel

Hi Nathan,

> This is an interesting design.  It appears similar to how we'd
> envisioned implementing openacc support -- namely leverage the LTO
> machinery to communicate from the host compiler to the device
> compiler.  Your design looks more detailed, which is good.
Thanks, do you have a similar description of your design? It would be pretty
interesting to take a look at it.

> Are you envisioning the device compilers to be stand alone
> compilers, built separately.  Or are you envisioning extending the
> configuration machinery by adding something like
> --enable-acclerator=<list> so that:
>   .../configure --target=x86_64-linux --enable-accelerator=foo,baz
> causes
> * a build of an x86_64 compiler aware of the foo and baz accelerators
> * build of an appropriate runtime support library
> * a build of a foo lto accelerator backend, assembler (and linker?)
> * (if needed) build of a foo support library
> * a build of a baz lto accelerator backend
> * (if needed) build of a baz support library, assembler (and linker?)
> 
> or are you expecting something more like 3 separate configures & build?
>   .../configure --target=x86_64-linux --enable-accelerator=foo,baz
>   .../configure --target=foo --enable-languages=lto-accelerator
>   .../configure --target=baz --enable-languages=lto-accelerator
> 
> I'd been imagining the former scheme, as it provides for a more
> integrated build, fwiw.
That's an open question, and we'd like to clarify it too.  We'd appreciate any
inputs on this.

Personally, I see actually one more option.  Similar of how libgomp figures out
which runtimes are available (by looking for the corresponding plugins), we
could look for available target compilers at compile-time and produce as many
target images as number of compilers we have.  Thus, we won't need to rebuild
host compiler to support more targets - we'd just need to place the
corresponding target compiler somewhere.  That looks more like your second
option, but differs a bit from it in that we don't need to specify enabled
accelerators.

Michael
> nathan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-10 15:38                                   ` Michael V. Zolotukhin
@ 2013-09-13 11:30                                     ` Michael V. Zolotukhin
  2013-09-13 12:36                                       ` Jakub Jelinek
  2014-07-17  7:52                                       ` Thomas Schwinge
  0 siblings, 2 replies; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-09-13 11:30 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

Hi Jakub,
Here is the first patch for adding plugins support in libgomp - could you please
take a look at it?

I changed configure.ac to add dl-library, but I am not sure if I regenerated all
related to configure files properly.  I'd appreciate your help here, if I did
it wrong.

Any feedback is welcome too.

Thanks, Michael

---
 libgomp/configure    |  46 +++++++++++++++
 libgomp/configure.ac |   2 +
 libgomp/target.c     | 155 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 203 insertions(+)

diff --git a/libgomp/configure b/libgomp/configure
index 238b1af..2086fdb 100755
--- a/libgomp/configure
+++ b/libgomp/configure
@@ -15046,6 +15046,52 @@ fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for dlsym in -ldl" >&5
+$as_echo_n "checking for dlsym in -ldl... " >&6; }
+if test "${ac_cv_lib_dl_dlsym+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-ldl  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char dlsym ();
+int
+main ()
+{
+return dlsym ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_dl_dlsym=yes
+else
+  ac_cv_lib_dl_dlsym=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_dl_dlsym" >&5
+$as_echo "$ac_cv_lib_dl_dlsym" >&6; }
+if test "x$ac_cv_lib_dl_dlsym" = x""yes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBDL 1
+_ACEOF
+
+  LIBS="-ldl $LIBS"
+
+fi
+
+
 # Check for functions needed.
 for ac_func in getloadavg clock_gettime strtoull
 do :
diff --git a/libgomp/configure.ac b/libgomp/configure.ac
index d87ed29..1c78239 100644
--- a/libgomp/configure.ac
+++ b/libgomp/configure.ac
@@ -193,6 +193,8 @@ AC_LINK_IFELSE(
    [],
    [AC_MSG_ERROR([Pthreads are required to build libgomp])])])
 
+AC_CHECK_LIB(dl, dlsym)
+
 # Check for functions needed.
 AC_CHECK_FUNCS(getloadavg clock_gettime strtoull)
 
diff --git a/libgomp/target.c b/libgomp/target.c
index 0a874d4..73f656c 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -28,6 +28,45 @@
 #include "libgomp.h"
 #include <stdlib.h>
 #include <string.h>
+#include <dirent.h>
+
+#ifdef HAVE_DLFCN_H
+# include <dlfcn.h>
+#endif
+
+static void gomp_target_init (void);
+
+/* This structure describes accelerator device.
+   It contains name of the corresponding libgomp plugin, function handlers for
+   interaction with the device, ID-number of the device, and information about
+   mapped memory.  */
+struct gomp_device_descr
+{
+  /* This is the ID number of device.  It could be specified in DEVICE-clause of
+     TARGET construct.  */
+  int id;
+
+  /* Plugin file name.  */
+  char plugin_name[PATH_MAX];
+
+  /* Plugin file handler.  */
+  void *plugin_handle;
+
+  /* Function handlers.  */
+  bool (*device_available_func) (void);
+
+  /* Information about mapping.  Not implemented yet.  */
+  /* SplayTree map_info;  */
+};
+
+/* Array of descriptors of all available devices.  */
+static struct gomp_device_descr *devices;
+
+/* Total number of available devices.  */
+static int num_devices;
+
+static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT;
+
 
 static int
 resolve_device (int device)
@@ -49,6 +88,7 @@ GOMP_target (int device, void (*fn) (void *), const char *fnname,
 	     size_t mapnum, void **hostaddrs, size_t *sizes,
 	     unsigned char *kinds)
 {
+  (void) pthread_once (&gomp_is_initialized, gomp_target_init);
   if (resolve_device (device) == -1)
     {
       fn (hostaddrs);
@@ -60,6 +100,7 @@ void
 GOMP_target_data (int device, size_t mapnum, void **hostaddrs, size_t *sizes,
 		  unsigned char *kinds)
 {
+  (void) pthread_once (&gomp_is_initialized, gomp_target_init);
   if (resolve_device (device) == -1)
     return;
 }
@@ -73,6 +114,7 @@ void
 GOMP_target_update (int device, size_t mapnum, void **hostaddrs, size_t *sizes,
 		    unsigned char *kinds)
 {
+  (void) pthread_once (&gomp_is_initialized, gomp_target_init);
   if (resolve_device (device) == -1)
     return;
 }
@@ -81,3 +123,116 @@ void
 GOMP_teams (unsigned int num_teams, unsigned int thread_limit)
 {
 }
+
+
+#ifdef HAVE_DLFCN_H
+/* This function checks if the given string FNAME matches
+   "libgomp-plugin-*.so.1".  */
+static bool
+gomp_check_plugin_file_name (const char *fname)
+{
+  const char *prefix = "libgomp-plugin-";
+  const char *suffix = ".so.1";
+  if (!fname)
+    return false;
+  if (strncmp (fname, prefix, strlen (prefix)) != 0)
+    return false;
+  if (strncmp (fname + strnlen (fname, NAME_MAX + 1) - strlen (suffix),
+	       suffix,
+	       strlen (suffix)) != 0)
+    return false;
+  return true;
+}
+
+/* This function tries to load plugin for DEVICE.  Name of plugin should be
+   stored in PLUGIN_NAME field.
+   Plugin handle and handles of the found functions are stored in the
+   corresponding fields of DEVICE.
+   The function returns TRUE on success and FALSE otherwise.  */
+static bool
+gomp_load_plugin_for_device (struct gomp_device_descr *device)
+{
+  if (!device || !device->plugin_name)
+    return false;
+
+  device->plugin_handle = dlopen (device->plugin_name, RTLD_LAZY);
+  if (!device->plugin_handle)
+    return false;
+
+  /* Clear any existing error.  */
+  dlerror ();
+
+  /* Check if all required functions are available in the plugin and store
+     their handlers.
+     TODO: check for other routines as well.  */
+  *(void **) (&device->device_available_func) = dlsym (device->plugin_handle,
+						       "device_available");
+  if (dlerror () != NULL)
+    {
+      dlclose (device->plugin_handle);
+      return false;
+    }
+
+  return true;
+}
+
+/* This functions scans folder, specified in environment variable
+   LIBGOMP_PLUGIN_PATH, and loads all suitable libgomp plugins from this folder.
+   For a plugin to be suitable, its name should be "libgomp-plugin-*.so.1" and
+   it should implement a certain set of functions.
+   Result of this function is properly initialized variable NUM_DEVICES and
+   array DEVICES, containing all plugins and their callback handles.  */
+static void
+gomp_find_available_plugins (void)
+{
+  char *plugin_path = NULL;
+  DIR *dir = NULL;
+  struct dirent *ent;
+
+  num_devices = 0;
+  devices = NULL;
+
+  plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
+  if (!plugin_path)
+    return;
+
+  dir = opendir (plugin_path);
+  if (!dir)
+    return;
+
+  while ((ent = readdir (dir)) != NULL)
+    {
+      struct gomp_device_descr current_device;
+      if (!gomp_check_plugin_file_name (ent->d_name))
+	continue;
+      strncpy (current_device.plugin_name, plugin_path, PATH_MAX);
+      strcat (current_device.plugin_name, "/");
+      strcat (current_device.plugin_name, ent->d_name);
+      if (!gomp_load_plugin_for_device (&current_device))
+	continue;
+      devices = realloc (devices, (num_devices + 1)
+				  * sizeof (struct gomp_device_descr));
+
+      devices[num_devices] = current_device;
+      devices[num_devices].id = num_devices + 1;
+      num_devices++;
+    }
+  closedir (dir);
+}
+
+/* This function initializes runtime needed for offloading.
+   It loads plugins, sets up a connection with devices, etc.  */
+static void
+gomp_target_init (void)
+{
+  gomp_find_available_plugins ();
+}
+
+#else /* HAVE_DLFCN_H */
+/* If dlfcn.h is unavailable we always fallback to host execution.
+   GOMP_target* routines are just stubs for this case.  */
+static void
+gomp_target_init (void)
+{
+}
+#endif /* HAVE_DLFCN_H */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13 11:30                                     ` Michael V. Zolotukhin
@ 2013-09-13 12:36                                       ` Jakub Jelinek
  2013-09-13 13:11                                         ` Michael V. Zolotukhin
  2013-09-13 15:34                                         ` Jakub Jelinek
  2014-07-17  7:52                                       ` Thomas Schwinge
  1 sibling, 2 replies; 56+ messages in thread
From: Jakub Jelinek @ 2013-09-13 12:36 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Fri, Sep 13, 2013 at 03:29:30PM +0400, Michael V. Zolotukhin wrote:
> Here is the first patch for adding plugins support in libgomp - could you please
> take a look at it?
> 
> I changed configure.ac to add dl-library, but I am not sure if I regenerated all
> related to configure files properly.  I'd appreciate your help here, if I did
> it wrong.

The configure stuff looks reasonable.

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -28,6 +28,45 @@
>  #include "libgomp.h"
>  #include <stdlib.h>
>  #include <string.h>
> +#include <dirent.h>

But I doubt dirent.h is portable to all targets we support, so I believe it
needs another configure test, and perhaps we want to define some macro
whether we actually support offloading at all (HAVE_DLFCN_H would be one
precondition, HAVE_DIRENT_H (with opendir etc.) another one (for this the
question is if we are building libgomp with LFS flags also, i.e. opendir64
etc. if available) another requirement we have is that sizeof (void *) ==
sizeof (uintptr_t), etc.

>  static int
>  resolve_device (int device)
> @@ -49,6 +88,7 @@ GOMP_target (int device, void (*fn) (void *), const char *fnname,
>  	     size_t mapnum, void **hostaddrs, size_t *sizes,
>  	     unsigned char *kinds)
>  {
> +  (void) pthread_once (&gomp_is_initialized, gomp_target_init);

resolve_device should be changed to return struct gomp_device_descr *
(or NULL for host fallback), and this pthread_once done inside of
resolve_device, not in all the callers.

> +static bool
> +gomp_check_plugin_file_name (const char *fname)
> +{
> +  const char *prefix = "libgomp-plugin-";
> +  const char *suffix = ".so.1";
> +  if (!fname)
> +    return false;
> +  if (strncmp (fname, prefix, strlen (prefix)) != 0)
> +    return false;
> +  if (strncmp (fname + strnlen (fname, NAME_MAX + 1) - strlen (suffix),

I'm afraid strnlen isn't sufficiently portable.  Why don't you just use
strlen?

> +  /* Check if all required functions are available in the plugin and store
> +     their handlers.
> +     TODO: check for other routines as well.  */
> +  *(void **) (&device->device_available_func) = dlsym (device->plugin_handle,
> +						       "device_available");

Aliasing violation, don't do that.

FYI, I'm attaching a WIP patch with the splay tree stuff, debugging
target-1.c with OMP_DEFAULT_DEVICE=257 right now (with all tgtv related
stuff removed), but hitting some error regarding OMP_CLAUSE_MAP_POINTER
reallocation, supposedly a bug on the compiler side.  But e.g. fn2 and fn3
already seem to pass with that, only fn4 is problematic.

There are various FIXMEs in the patch, the routines that create
target_mem_desc should actually get an extra struct gomp_device_descr *
argument, store it into *tgt and then the spots where I'm using
gomp_malloc/free/memcpy for device allocation/deallocation/to/from
data transfer should be adjusted to use callbacks from the plugin.

After the fname to void * __OPENMP_OFFLOAD__ or whatever change
for GOMP_target, I think we need to pass the same argument to
GOMP_target_data and GOMP_target_update too, pass it through
to resolve_device and that will actually need to also find out
if the selected target has corresponding offload support compiled in,
and will need to upload the DSO to target if not done already,
and register into the splay tree all the static vars
(and if any of that fails, return NULL for host fallback).

--- libgomp/target.c.jj	2013-09-09 17:41:02.290429613 +0200
+++ libgomp/target.c	2013-09-13 13:17:31.703502392 +0200
@@ -1,4 +1,4 @@
-/* Copyright (C) 2013 Free Software Foundation, Inc.
+/* Copyright (C) 1998-2013 Free Software Foundation, Inc.
    Contributed by Jakub Jelinek <jakub@redhat.com>.
 
    This file is part of the GNU OpenMP Library (libgomp).
@@ -26,15 +26,567 @@
    creation and termination.  */
 
 #include "libgomp.h"
+#include <stdbool.h>
 #include <stdlib.h>
 #include <string.h>
 
+/* The splay tree code copied from include/splay-tree.h and adjusted,
+   so that all the data lives directly in splay_tree_node_s structure
+   and no extra allocations are needed.  */
+
+/* For an easily readable description of splay-trees, see:
+
+     Lewis, Harry R. and Denenberg, Larry.  Data Structures and Their
+     Algorithms.  Harper-Collins, Inc.  1991.  
+
+   The major feature of splay trees is that all basic tree operations
+   are amortized O(log n) time for a tree with n nodes.  */
+
+/* Forward declaration for a node in the tree.  */
+typedef struct splay_tree_node_s *splay_tree_node;
+typedef splay_tree_node *splay_tree;
+
+struct target_mem_desc {
+  /* Reference count.  */
+  uintptr_t refcount;
+  /* All the splay nodes allocated together.  */
+  splay_tree_node array;
+  /* Start of the target region.  */
+  uintptr_t tgt_start;
+  /* End of the targer region.  */
+  uintptr_t tgt_end;
+  /* Handle to free.  */
+  void *to_free;
+  /* Previous target_mem_desc.  */
+  struct target_mem_desc *prev;
+  /* Number of items in following list.  */
+  size_t list_count;
+  /* List of splay nodes to remove (or decrease refcount)
+     at the end of region.  */
+  splay_tree_node list[];
+};
+
+/* The nodes in the splay tree.  */
+struct splay_tree_node_s {
+  /* Address of the host object.  */
+  uintptr_t host_start;
+  /* Address immediately after the host object.  */
+  uintptr_t host_end;
+  /* Descriptor of the target memory.  */
+  struct target_mem_desc *tgt;
+  /* Offset from tgt->tgt_start to the start of the target object.  */
+  uintptr_t tgt_offset;
+  /* Reference count.  */
+  uintptr_t refcount;
+  /* True if data should be copied from device to host at the end.  */
+  bool copy_from;
+  /* The left and right children, respectively.  */
+  splay_tree_node left;
+  splay_tree_node right;
+};
+
+/* Rotate the edge joining the left child N with its parent P.  PP is the
+   grandparents' pointer to P.  */
+
+static inline void
+rotate_left (splay_tree_node *pp, splay_tree_node p, splay_tree_node n)
+{
+  splay_tree_node tmp;
+  tmp = n->right;
+  n->right = p;
+  p->left = tmp;
+  *pp = n;
+}
+
+/* Rotate the edge joining the right child N with its parent P.  PP is the
+   grandparents' pointer to P.  */
+
+static inline void
+rotate_right (splay_tree_node *pp, splay_tree_node p, splay_tree_node n)
+{
+  splay_tree_node tmp;
+  tmp = n->left;
+  n->left = p;
+  p->right = tmp;
+  *pp = n;
+}
+
+static int
+splay_compare (splay_tree_node x, splay_tree_node y)
+{
+  if (x->host_start == x->host_end
+      && y->host_start == y->host_end)
+    return 0;
+  if (x->host_end <= y->host_start)
+    return -1;
+  if (x->host_start >= y->host_end)
+    return 1;
+  return 0;
+}
+
+/* Bottom up splay of NODE.  */
+
+static void
+splay_tree_splay (splay_tree sp, splay_tree_node node)
+{
+  if (*sp == NULL)
+    return;
+
+  do {
+    int cmp1, cmp2;
+    splay_tree_node n, c;
+
+    n = *sp;
+    cmp1 = splay_compare (node, n);
+
+    /* Found.  */
+    if (cmp1 == 0)
+      return;
+
+    /* Left or right?  If no child, then we're done.  */
+    if (cmp1 < 0)
+      c = n->left;
+    else
+      c = n->right;
+    if (!c)
+      return;
+
+    /* Next one left or right?  If found or no child, we're done
+       after one rotation.  */
+    cmp2 = splay_compare (node, c);
+    if (cmp2 == 0
+        || (cmp2 < 0 && !c->left)
+        || (cmp2 > 0 && !c->right))
+      {
+	if (cmp1 < 0)
+	  rotate_left (sp, n, c);
+	else
+	  rotate_right (sp, n, c);
+        return;
+      }
+
+    /* Now we have the four cases of double-rotation.  */
+    if (cmp1 < 0 && cmp2 < 0)
+      {
+	rotate_left (&n->left, c, c->left);
+	rotate_left (sp, n, n->left);
+      }
+    else if (cmp1 > 0 && cmp2 > 0)
+      {
+	rotate_right (&n->right, c, c->right);
+	rotate_right (sp, n, n->right);
+      }
+    else if (cmp1 < 0 && cmp2 > 0)
+      {
+	rotate_right (&n->left, c, c->right);
+	rotate_left (sp, n, n->left);
+      }
+    else if (cmp1 > 0 && cmp2 < 0)
+      {
+	rotate_left (&n->right, c, c->left);
+	rotate_right (sp, n, n->right);
+      }
+  } while (1);
+}
+
+/* Insert a new NODE into SP.  The NODE shouldn't exist in the tree.  */
+
+void
+splay_tree_insert (splay_tree sp, splay_tree_node node)
+{
+  int comparison = 0;
+
+  splay_tree_splay (sp, node);
+
+  if (*sp)
+    comparison = splay_compare (*sp, node);
+
+  if (*sp && comparison == 0)
+    abort ();
+  else 
+    {
+      /* Insert it at the root.  */
+      if (*sp == NULL)
+	node->left = node->right = NULL;
+      else if (comparison < 0)
+	{
+	  node->left = *sp;
+	  node->right = node->left->right;
+	  node->left->right = NULL;
+	}
+      else
+	{
+	  node->right = *sp;
+	  node->left = node->right->left;
+	  node->right->left = NULL;
+	}
+
+      *sp = node;
+    }
+}
+
+/* Remove NODE from SP.  It is not an error if it did not exist.  */
+
+void
+splay_tree_remove (splay_tree sp, splay_tree_node node)
+{
+  splay_tree_splay (sp, node);
+
+  if (*sp && splay_compare (*sp, node) == 0)
+    {
+      splay_tree_node left, right;
+
+      left = (*sp)->left;
+      right = (*sp)->right;
+
+      /* One of the children is now the root.  Doesn't matter much
+	 which, so long as we preserve the properties of the tree.  */
+      if (left)
+	{
+	  *sp = left;
+
+	  /* If there was a right child as well, hang it off the 
+	     right-most leaf of the left child.  */
+	  if (right)
+	    {
+	      while (left->right)
+		left = left->right;
+	      left->right = right;
+	    }
+	}
+      else
+	*sp = right;
+    }
+}
+
+/* Lookup NODE in SP, returning VALUE if present, and NULL 
+   otherwise.  */
+
+splay_tree_node
+splay_tree_lookup (splay_tree sp, splay_tree_node node)
+{
+  splay_tree_splay (sp, node);
+
+  if (*sp && splay_compare (*sp, node) == 0)
+    return *sp;
+  else
+    return NULL;
+}
+
+attribute_hidden int
+gomp_get_num_devices (void)
+{
+  /* FIXME: Scan supported accelerators when called the first time.  */
+  return 0;
+}
+
 static int
 resolve_device (int device)
 {
+  if (device == -1)
+    {
+      struct gomp_task_icv *icv = gomp_icv (false);
+      device = icv->default_device_var;
+    }
+  /* FIXME: Temporary hack for testing non-shared address spaces on host.  */
+  if (device == 257)
+    return 257;
+  if (device >= gomp_get_num_devices ())
+    return -1;
   return -1;
 }
 
+/* These variables would be per-accelerator (which doesn't have shared address
+   space.  */
+static splay_tree_node dev_splay_tree;
+static gomp_mutex_t dev_env_lock;
+
+/* Handle the case where splay_tree_lookup found oldn for newn.
+   Helper function of gomp_map_vars.  */
+
+static inline void
+gomp_map_vars_existing (splay_tree_node oldn, splay_tree_node newn,
+			unsigned char kind)
+{
+  if (oldn->host_start > newn->host_start
+      || oldn->host_end < newn->host_end)
+    gomp_fatal ("Trying to map into device [%p..%p) object when"
+		"[%p..%p) is already mapped",
+		(void *) newn->host_start, (void *) newn->host_end,
+		(void *) oldn->host_start, (void *) oldn->host_end);
+  if (((kind & 7) == 2 || (kind & 7) == 3)
+      && !oldn->copy_from
+      && oldn->host_start == newn->host_start
+      && oldn->host_end == newn->host_end)
+    oldn->copy_from = true;
+  oldn->refcount++;
+}
+
+static struct target_mem_desc *
+gomp_map_vars (size_t mapnum, void **hostaddrs, size_t *sizes,
+	       unsigned char *kinds, bool is_target)
+{
+  size_t i, tgt_align, tgt_size, not_found_cnt = 0;
+  struct splay_tree_node_s cur_node;
+  struct target_mem_desc *tgt
+    = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
+  tgt->list_count = mapnum;
+  tgt->refcount = 1;
+
+  if (mapnum == 0)
+    return tgt;
+
+  tgt_align = sizeof (void *);
+  tgt_size = 0;
+  if (is_target)
+    {
+      size_t align = 4 * sizeof (void *);
+      tgt_align = align;
+      tgt_size = mapnum * sizeof (void *);
+    }
+
+  gomp_mutex_lock (&dev_env_lock);
+  for (i = 0; i < mapnum; i++)
+    {
+      cur_node.host_start = (uintptr_t) hostaddrs[i];
+      if ((kinds[i] & 7) != 4)
+	cur_node.host_end = cur_node.host_start + sizes[i];
+      else
+	cur_node.host_end = cur_node.host_start + sizeof (void *);
+      splay_tree_node n = splay_tree_lookup (&dev_splay_tree, &cur_node);
+      if (n)
+	{
+	  tgt->list[i] = n;
+	  gomp_map_vars_existing (n, &cur_node, kinds[i]);
+	}
+      else
+	{
+	  size_t align = (size_t) 1 << (kinds[i] >> 3);
+	  tgt->list[i] = NULL;
+	  not_found_cnt++;
+	  if (tgt_align < align)
+	    tgt_align = align;
+	  tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	  tgt_size += cur_node.host_end - cur_node.host_start;
+	}
+    }
+
+  if (not_found_cnt || is_target)
+    {
+      /* FIXME: This would be accelerator memory allocation, not
+	 host, and should allocate tgt_align aligned tgt_size block
+	 of memory.  */
+      tgt->to_free = gomp_malloc (tgt_size + tgt_align - 1);
+      tgt->tgt_start = (uintptr_t) tgt->to_free;
+      tgt->tgt_start = (tgt->tgt_start + tgt_align - 1) & ~(tgt_align - 1);
+      tgt->tgt_end = tgt->tgt_start + tgt_size;
+    }
+
+  tgt_size = 0;
+  if (is_target)
+    tgt_size = mapnum * sizeof (void *);
+
+  if (not_found_cnt)
+    {
+      tgt->array = gomp_malloc (not_found_cnt * sizeof (*tgt->array));
+      splay_tree_node array = tgt->array;
+
+      for (i = 0; i < mapnum; i++)
+	if (tgt->list[i] == NULL)
+	  {
+	    array->host_start = (uintptr_t) hostaddrs[i];
+	    if ((kinds[i] & 7) != 4)
+	      array->host_end = array->host_start + sizes[i];
+	    else
+	      array->host_end = array->host_start + sizeof (void *);
+	    splay_tree_node n = splay_tree_lookup (&dev_splay_tree, array);
+	    if (n)
+	      {
+		tgt->list[i] = n;
+		gomp_map_vars_existing (n, array, kinds[i]);
+	      }
+	    else
+	      {
+		size_t align = (size_t) 1 << (kinds[i] >> 3);
+		tgt->list[i] = array;
+		tgt_size = (tgt_size + align - 1) & ~(align - 1);
+		array->tgt = tgt;
+		array->tgt_offset = tgt_size;
+		tgt_size += array->host_end - array->host_start;
+		if ((kinds[i] & 7) == 2 || (kinds[i] & 7) == 3)
+		  array->copy_from = true;
+		array->refcount = 1;
+		tgt->refcount++;
+		array->left = NULL;
+		array->right = NULL;
+		splay_tree_insert (&dev_splay_tree, array);
+		switch (kinds[i] & 7)
+		  {
+		  case 0: /* ALLOC */
+		  case 2: /* FROM */
+		    break;
+		  case 1: /* TO */
+		  case 3: /* TOFROM */
+		    /* FIXME: This is supposed to be copy from host to device
+		       memory.  Perhaps add some smarts, like if copying
+		       several adjacent fields from host to target, use some
+		       host buffer to avoid sending each var individually.  */
+		    memcpy ((void *) (tgt->tgt_start + array->tgt_offset),
+			    (void *) array->host_start,
+			    array->host_end - array->host_start);
+		    break;
+		  case 4: /* POINTER */
+		    cur_node.host_start
+		      = (uintptr_t) *(void **) array->host_start;
+		    /* Add bias to the pointer value.  */
+		    cur_node.host_start += sizes[i];
+		    cur_node.host_end = cur_node.host_start + 1;
+		    n = splay_tree_lookup (&dev_splay_tree, &cur_node);
+		    if (n == NULL)
+		      {
+			/* Could be possibly zero size array section.  */
+			cur_node.host_end--;
+			n = splay_tree_lookup (&dev_splay_tree, &cur_node);
+			if (n == NULL)
+			  {
+			    cur_node.host_start--;
+			    n = splay_tree_lookup (&dev_splay_tree, &cur_node);
+			    cur_node.host_start++;
+			  }
+		      }
+		    if (n == NULL)
+		      gomp_fatal ("Pointer target of array section "
+				  "wasn't mapped");
+		    cur_node.host_start -= n->host_start;
+		    cur_node.tgt_offset = n->tgt->tgt_start + n->tgt_offset
+					  + cur_node.host_start;
+		    /* At this point tgt_offset is target address of the
+		       array section.  Now subtract bias to get what we want
+		       to initialize the pointer with.  */
+		    cur_node.tgt_offset -= sizes[i];
+		    /* FIXME: host to device copy, see above FIXME comment.  */
+		    memcpy ((void *) (tgt->tgt_start + array->tgt_offset),
+			    (void *) &cur_node.tgt_offset,
+			    sizeof (void *));
+		    break;
+		  }
+		array++;
+	      }
+	  }
+    }
+  if (is_target)
+    {
+      for (i = 0; i < mapnum; i++)
+	{
+	  cur_node.tgt_offset = tgt->list[i]->tgt->tgt_start
+				+ tgt->list[i]->tgt_offset;
+	  /* FIXME: host to device copy, see above FIXME comment.  */
+	  memcpy ((void *) (tgt->tgt_start + i * sizeof (void *)),
+		  (void *) &cur_node.tgt_offset,
+		  sizeof (void *));
+	}
+    }
+
+  gomp_mutex_unlock (&dev_env_lock);
+  return tgt;
+}
+
+static void
+gomp_unmap_tgt (struct target_mem_desc *tgt)
+{
+  /* FIXME: Deallocate on target the tgt->tgt_start .. tgt->tgt_end
+     region.  */
+  if (tgt->tgt_end)
+    free (tgt->to_free);
+
+  free (tgt->array);
+  free (tgt);
+}
+
+static void
+gomp_unmap_vars (struct target_mem_desc *tgt)
+{
+  if (tgt->list_count == 0)
+    {
+      free (tgt);
+      return;
+    }
+
+  size_t i;
+  gomp_mutex_lock (&dev_env_lock);
+  for (i = 0; i < tgt->list_count; i++)
+    if (tgt->list[i]->refcount > 1)
+      tgt->list[i]->refcount--;
+    else
+      {
+	splay_tree_node n = tgt->list[i];
+	if (n->copy_from)
+	  /* FIXME: device to host copy.  */
+	  memcpy ((void *) n->host_start,
+		  (void *) (n->tgt->tgt_start + n->tgt_offset),
+		  n->host_end - n->host_start);
+	splay_tree_remove (&dev_splay_tree, n);
+	if (n->tgt->refcount > 1)
+	  n->tgt->refcount--;
+	else
+	  gomp_unmap_tgt (n->tgt);
+      }
+
+  if (tgt->refcount > 1)
+    tgt->refcount--;
+  else
+    gomp_unmap_tgt (tgt);
+  gomp_mutex_unlock (&dev_env_lock);
+}
+
+static void
+gomp_update (size_t mapnum, void **hostaddrs, size_t *sizes,
+	     unsigned char *kinds)
+{
+  size_t i;
+  struct splay_tree_node_s cur_node;
+
+  if (mapnum == 0)
+    return;
+
+  gomp_mutex_lock (&dev_env_lock);
+  for (i = 0; i < mapnum; i++)
+    if (sizes[i])
+      {
+	cur_node.host_start = (uintptr_t) hostaddrs[i];
+	cur_node.host_end = cur_node.host_start + sizes[i];
+	splay_tree_node n = splay_tree_lookup (&dev_splay_tree, &cur_node);
+	if (n)
+	  {
+	    if (n->host_start > cur_node.host_start
+		|| n->host_end < cur_node.host_end)
+	      gomp_fatal ("Trying to update [%p..%p) object when"
+			  "only [%p..%p) is mapped",
+			  (void *) cur_node.host_start,
+			  (void *) cur_node.host_end,
+			  (void *) n->host_start,
+			  (void *) n->host_end);
+	    if ((kinds[i] & 7) == 1)
+	      /* FIXME: host to device copy.  */
+	      memcpy ((void *) (n->tgt->tgt_start + n->tgt_offset
+				+ cur_node.host_start - n->host_start),
+		      (void *) cur_node.host_start,
+		      cur_node.host_end - cur_node.host_start);
+	    else if ((kinds[i] & 7) == 2)
+	      /* FIXME: device to host copy.  */
+	      memcpy ((void *) cur_node.host_start,
+		      (void *) (n->tgt->tgt_start + n->tgt_offset
+				+ cur_node.host_start - n->host_start),
+		      cur_node.host_end - cur_node.host_start);
+	  }
+	else
+	  gomp_fatal ("Trying to update [%p..%p) object that is not mapped",
+		      (void *) cur_node.host_start,
+		      (void *) cur_node.host_end);
+      }
+  gomp_mutex_unlock (&dev_env_lock);
+}
+
 /* Called when encountering a target directive.  If DEVICE
    is -1, it means use device-var ICV.  If it is -2 (or any other value
    larger than last available hw device, use host fallback.
@@ -49,32 +601,77 @@ GOMP_target (int device, void (*fn) (voi
 	     size_t mapnum, void **hostaddrs, size_t *sizes,
 	     unsigned char *kinds)
 {
-  if (resolve_device (device) == -1)
+  device = resolve_device (device);
+  if (device == -1)
     {
+      /* Host fallback.  */
       fn (hostaddrs);
       return;
     }
+  if (device == 257)
+    {
+      struct target_mem_desc *tgt
+	= gomp_map_vars (mapnum, hostaddrs, sizes, kinds, true);
+      fn ((void *) tgt->tgt_start);
+      gomp_unmap_vars (tgt);
+    }
 }
 
 void
 GOMP_target_data (int device, size_t mapnum, void **hostaddrs, size_t *sizes,
 		  unsigned char *kinds)
 {
-  if (resolve_device (device) == -1)
-    return;
+  device = resolve_device (device);
+  if (device == -1)
+    {
+      /* Host fallback.  */
+      struct gomp_task_icv *icv = gomp_icv (false);
+      if (icv->target_data)
+	{
+	  /* Even when doing a host fallback, if there are any active
+	     #pragma omp target data constructs, need to remember the
+	     new #pragma omp target data, otherwise GOMP_target_end_data
+	     would get out of sync.  */
+	  struct target_mem_desc *tgt
+	    = gomp_map_vars (0, NULL, NULL, NULL, false);
+	  tgt->prev = icv->target_data;
+	  icv->target_data = tgt;
+	}
+      return;
+    }
+
+  if (device == 257)
+    {
+      struct target_mem_desc *tgt
+	= gomp_map_vars (mapnum, hostaddrs, sizes, kinds, false);
+      struct gomp_task_icv *icv = gomp_icv (true);
+      tgt->prev = icv->target_data;
+      icv->target_data = tgt;
+    }
 }
 
 void
 GOMP_target_end_data (void)
 {
+  struct gomp_task_icv *icv = gomp_icv (false);
+  if (icv->target_data)
+    {
+      struct target_mem_desc *tgt = icv->target_data;
+      icv->target_data = tgt->prev;
+      gomp_unmap_vars (tgt);
+    }
 }
 
 void
 GOMP_target_update (int device, size_t mapnum, void **hostaddrs, size_t *sizes,
 		    unsigned char *kinds)
 {
-  if (resolve_device (device) == -1)
+  device = resolve_device (device);
+  if (device == -1)
     return;
+
+  if (device == 257)
+    gomp_update (mapnum, hostaddrs, sizes, kinds);
 }
 
 void
--- libgomp/libgomp.h.jj	2013-09-09 17:41:02.388429108 +0200
+++ libgomp/libgomp.h	2013-09-13 12:19:13.489052710 +0200
@@ -214,18 +214,23 @@ struct gomp_team_state
   unsigned long static_trip;
 };
 
-/* These are the OpenMP 3.0 Internal Control Variables described in
+struct target_mem_desc;
+
+/* These are the OpenMP 4.0 Internal Control Variables described in
    section 2.3.1.  Those described as having one copy per task are
    stored within the structure; those described as having one copy
    for the whole program are (naturally) global variables.  */
-
+   
 struct gomp_task_icv
 {
   unsigned long nthreads_var;
   enum gomp_schedule_type run_sched_var;
   int run_sched_modifier;
+  int default_device_var;
   bool dyn_var;
   bool nest_var;
+  /* Internal ICV.  */
+  struct target_mem_desc *target_data;
 };
 
 extern struct gomp_task_icv gomp_global_icv;
@@ -496,6 +501,10 @@ extern void gomp_team_start (void (*) (v
 			     struct gomp_team *);
 extern void gomp_team_end (void);
 
+/* target.c */
+
+extern int gomp_get_num_devices (void);
+
 /* work.c */
 
 extern void gomp_init_work_share (struct gomp_work_share *, bool, unsigned);
--- libgomp/env.c.jj	2013-09-09 17:41:02.335429381 +0200
+++ libgomp/env.c	2013-09-12 17:39:42.435446713 +0200
@@ -56,6 +56,7 @@ struct gomp_task_icv gomp_global_icv = {
   .nthreads_var = 1,
   .run_sched_var = GFS_DYNAMIC,
   .run_sched_modifier = 1,
+  .default_device_var = 0,
   .dyn_var = false,
   .nest_var = false
 };
@@ -188,6 +189,24 @@ parse_unsigned_long (const char *name, u
   return false;
 }
 
+/* Parse a positive int environment variable.  Return true if one was
+   present and it was successfully parsed.  */
+
+static bool
+parse_int (const char *name, int *pvalue, bool allow_zero)
+{
+  unsigned long value;
+  if (!parse_unsigned_long (name, &value, allow_zero))
+    return false;
+  if (value > INT_MAX)
+    {
+      gomp_error ("Invalid value for environment variable %s", name);
+      return false;
+    }
+  *pvalue = (int) value;
+  return true;
+}
+
 /* Parse an unsigned long list environment variable.  Return true if one was
    present and it was successfully parsed.  */
 
@@ -658,8 +677,9 @@ handle_omp_display_env (bool proc_bind,
 
 /* FIXME: Unimplemented OpenMP 4.0 environment variables.
   fprintf (stderr, "  OMP_PLACES = ''\n");
-  fprintf (stderr, "  OMP_CANCELLATION = ''\n");
-  fprintf (stderr, "  OMP_DEFAULT_DEVICE = ''\n"); */
+  fprintf (stderr, "  OMP_CANCELLATION = ''\n"); */
+  fprintf (stderr, "  OMP_DEFAULT_DEVICE = '%d'\n",
+	   gomp_global_icv.default_device_var);
 
   if (verbose)
     {
@@ -699,6 +719,7 @@ initialize_env (void)
   parse_boolean ("OMP_DYNAMIC", &gomp_global_icv.dyn_var);
   parse_boolean ("OMP_NESTED", &gomp_global_icv.nest_var);
   parse_boolean ("OMP_PROC_BIND", &bind_var);
+  parse_int ("OMP_DEFAULT_DEVICE", &gomp_global_icv.default_device_var, true);
   parse_unsigned_long ("OMP_MAX_ACTIVE_LEVELS", &gomp_max_active_levels_var,
 		       true);
   parse_unsigned_long ("OMP_THREAD_LIMIT", &gomp_thread_limit_var, false);
@@ -881,36 +902,41 @@ omp_get_proc_bind (void)
 void
 omp_set_default_device (int device_num)
 {
-  (void) device_num;
+  struct gomp_task_icv *icv = gomp_icv (true);
+  icv->default_device_var = device_num >= 0 ? device_num : 0;
 }
 
 int
 omp_get_default_device (void)
 {
-  return 0;
+  struct gomp_task_icv *icv = gomp_icv (false);
+  return icv->default_device_var;
 }
 
 int
 omp_get_num_devices (void)
 {
-  return 0;
+  return gomp_get_num_devices ();
 }
 
 int
 omp_get_num_teams (void)
 {
+  /* Hardcoded to 1 on host, MIC, HSAIL?  Maybe variable on PTX.  */
   return 1;
 }
 
 int
 omp_get_team_num (void)
 {
+  /* Hardcoded to 0 on host, MIC, HSAIL?  Maybe variable on PTX.  */
   return 0;
 }
 
 int
 omp_is_initial_device (void)
 {
+  /* Hardcoded to 1 on host, should be 0 on MIC, HSAIL, PTX.  */
   return 1;
 }
 


	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13 12:36                                       ` Jakub Jelinek
@ 2013-09-13 13:11                                         ` Michael V. Zolotukhin
  2013-09-13 13:16                                           ` Jakub Jelinek
  2013-09-13 15:34                                         ` Jakub Jelinek
  1 sibling, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-09-13 13:11 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

> But I doubt dirent.h is portable to all targets we support, so I believe it
> needs another configure test, and perhaps we want to define some macro
> whether we actually support offloading at all (HAVE_DLFCN_H would be one
> precondition, HAVE_DIRENT_H (with opendir etc.) another one (for this the
> question is if we are building libgomp with LFS flags also, i.e. opendir64
> etc. if available) another requirement we have is that sizeof (void *) ==
> sizeof (uintptr_t), etc.
That sounds reasonable, I'll do it.

> I'm afraid strnlen isn't sufficiently portable.  Why don't you just use
> strlen?
strnlen was used as it's more secure than strlen (on non-constant strings).
However, maybe that's not so critical in this place.  And anyway, we could use
strnlen when it's available and strlen otherwise.

> resolve_device should be changed to return struct gomp_device_descr *
> (or NULL for host fallback), and this pthread_once done inside of
> resolve_device, not in all the callers.

> Aliasing violation, don't do that.
Will fix, thanks.

> FYI, I'm attaching a WIP patch with the splay tree stuff.
Thanks, I'll take a look.  By the way, isn't it better to move splay-tree
implementation to a separate file?


Michael
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13 13:11                                         ` Michael V. Zolotukhin
@ 2013-09-13 13:16                                           ` Jakub Jelinek
  2013-09-13 15:09                                             ` Ilya Tocar
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-09-13 13:16 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Fri, Sep 13, 2013 at 05:11:09PM +0400, Michael V. Zolotukhin wrote:
> > FYI, I'm attaching a WIP patch with the splay tree stuff.
> Thanks, I'll take a look.  By the way, isn't it better to move splay-tree
> implementation to a separate file?

As it is just a few routines, heavily modified from include/splay-tree.h
(e.g. the data structures contain all the target.c specific stuff), and will be
used just in target.c, I think it is fine to keep it in target.c.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13 13:16                                           ` Jakub Jelinek
@ 2013-09-13 15:09                                             ` Ilya Tocar
  0 siblings, 0 replies; 56+ messages in thread
From: Ilya Tocar @ 2013-09-13 15:09 UTC (permalink / raw)
  To: Jakub Jelinek, richard.guenther
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson, gcc

[-- Attachment #1: Type: text/plain, Size: 213 bytes --]

Hi,

I'm working on dumping gimple for "omp pragma target" stuff into
gnu.target_lto_ sections.
I've tried to reuse current lto infrastructure as much as possible.

Could you please take a look at attached patch?

[-- Attachment #2: gomp_out.patch --]
[-- Type: application/octet-stream, Size: 27085 bytes --]

diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
index 806b219..b044d0d 100644
--- a/gcc/ipa-inline-analysis.c
+++ b/gcc/ipa-inline-analysis.c
@@ -3997,7 +3997,7 @@ inline_write_summary (void)
 	}
     }
   streamer_write_char_stream (ob->main_stream, 0);
-  produce_asm (ob, NULL);
+  produce_asm (ob, NULL, false);
   destroy_output_block (ob);
 
   if (optimize && !flag_ipa_cp)
diff --git a/gcc/ipa-prop.c b/gcc/ipa-prop.c
index 9074a63..3d82b54 100644
--- a/gcc/ipa-prop.c
+++ b/gcc/ipa-prop.c
@@ -4004,7 +4004,7 @@ ipa_prop_write_jump_functions (void)
         ipa_write_node_info (ob, node);
     }
   streamer_write_char_stream (ob->main_stream, 0);
-  produce_asm (ob, NULL);
+  produce_asm (ob, NULL, false);
   destroy_output_block (ob);
 }
 
@@ -4179,7 +4179,7 @@ ipa_prop_write_all_agg_replacement (void)
 	write_agg_replacement_chain (ob, node);
     }
   streamer_write_char_stream (ob->main_stream, 0);
-  produce_asm (ob, NULL);
+  produce_asm (ob, NULL, false);
   destroy_output_block (ob);
 }
 
diff --git a/gcc/ipa-pure-const.c b/gcc/ipa-pure-const.c
index 7a29365..87082e7 100644
--- a/gcc/ipa-pure-const.c
+++ b/gcc/ipa-pure-const.c
@@ -987,7 +987,7 @@ pure_const_write_summary (void)
 	}
     }
 
-  lto_destroy_simple_output_block (ob);
+  lto_destroy_simple_output_block (ob, false);
 }
 
 
diff --git a/gcc/ipa-reference.c b/gcc/ipa-reference.c
index 3742474..b11d4bc 100644
--- a/gcc/ipa-reference.c
+++ b/gcc/ipa-reference.c
@@ -1022,7 +1022,7 @@ ipa_reference_write_optimization_summary (void)
 	  }
       }
   BITMAP_FREE (ltrans_statics);
-  lto_destroy_simple_output_block (ob);
+  lto_destroy_simple_output_block (ob, false);
   splay_tree_delete (reference_vars_to_consider);
 }
 
diff --git a/gcc/ipa.c b/gcc/ipa.c
index 778a88f..8b97338 100644
--- a/gcc/ipa.c
+++ b/gcc/ipa.c
@@ -1349,7 +1349,7 @@ ipa_profile_write_summary (void)
       streamer_write_uhwi_stream (ob->main_stream, histogram[i]->time);
       streamer_write_uhwi_stream (ob->main_stream, histogram[i]->size);
     }
-  lto_destroy_simple_output_block (ob);
+  lto_destroy_simple_output_block (ob, false);
 }
 
 /* Deserialize the ipa info for lto.  */
diff --git a/gcc/lto-cgraph.c b/gcc/lto-cgraph.c
index fcba1b9..fbb41a7 100644
--- a/gcc/lto-cgraph.c
+++ b/gcc/lto-cgraph.c
@@ -689,7 +689,7 @@ output_outgoing_cgraph_edges (struct cgraph_edge *edge,
 /* Output the part of the cgraph in SET.  */
 
 static void
-output_refs (lto_symtab_encoder_t encoder)
+output_refs (lto_symtab_encoder_t encoder, bool is_omp)
 {
   lto_symtab_encoder_iterator lsei;
   struct lto_simple_output_block *ob;
@@ -718,7 +718,7 @@ output_refs (lto_symtab_encoder_t encoder)
 
   streamer_write_uhwi_stream (ob->main_stream, 0);
 
-  lto_destroy_simple_output_block (ob);
+  lto_destroy_simple_output_block (ob, is_omp);
 }
 
 /* Add NODE into encoder as well as nodes it is cloned from.
@@ -845,7 +845,7 @@ compute_ltrans_boundary (lto_symtab_encoder_t in_encoder)
 /* Output the part of the symtab in SET and VSET.  */
 
 void
-output_symtab (void)
+output_symtab (bool is_omp)
 {
   struct cgraph_node *node;
   struct lto_simple_output_block *ob;
@@ -874,9 +874,15 @@ output_symtab (void)
     {
       symtab_node node = lto_symtab_encoder_deref (encoder, i);
       if (cgraph_node *cnode = dyn_cast <cgraph_node> (node))
-        lto_output_node (ob, cnode, encoder);
+	{
+	  if (!is_omp || lookup_attribute ("omp declare target",
+					  DECL_ATTRIBUTES (node->symbol.decl)))
+	  lto_output_node (ob, cnode, encoder);
+	}
       else
-        lto_output_varpool_node (ob, varpool (node), encoder);
+	  if (!is_omp || lookup_attribute ("omp declare target",
+					  DECL_ATTRIBUTES (node->symbol.decl)))
+	    lto_output_varpool_node (ob, varpool (node), encoder);
 	
     }
 
@@ -891,7 +897,7 @@ output_symtab (void)
 
   streamer_write_uhwi_stream (ob->main_stream, 0);
 
-  lto_destroy_simple_output_block (ob);
+  lto_destroy_simple_output_block (ob, is_omp);
 
   /* Emit toplevel asms.
      When doing WPA we must output every asm just once.  Since we do not partition asm
@@ -903,7 +909,7 @@ output_symtab (void)
       lto_output_toplevel_asms ();
     }
 
-  output_refs (encoder);
+  output_refs (encoder, is_omp);
 }
 
 /* Overwrite the information in NODE based on FILE_DATA, TAG, FLAGS,
@@ -1658,7 +1664,7 @@ output_cgraph_opt_summary (void)
 	  output_node_opt_summary (ob, cnode, encoder);
 	}
     }
-  produce_asm (ob, NULL);
+  produce_asm (ob, NULL, false);
   destroy_output_block (ob);
 }
 
diff --git a/gcc/lto-opts.c b/gcc/lto-opts.c
index 4d9cdfd..b524b19 100644
--- a/gcc/lto-opts.c
+++ b/gcc/lto-opts.c
@@ -70,7 +70,7 @@ lto_write_options (void)
   char *args;
   bool first_p = true;
 
-  section_name = lto_get_section_name (LTO_section_opts, NULL, NULL);
+  section_name = lto_get_section_name (LTO_section_opts, NULL, NULL, false);
   lto_begin_section (section_name, false);
   memset (&stream, 0, sizeof (stream));
 
diff --git a/gcc/lto-section-out.c b/gcc/lto-section-out.c
index 8145ec3..282950f 100644
--- a/gcc/lto-section-out.c
+++ b/gcc/lto-section-out.c
@@ -49,6 +49,8 @@ static vec<lto_out_decl_state_ptr> decl_state_stack;
 
 vec<lto_out_decl_state_ptr> lto_function_decl_states;
 
+vec<lto_out_decl_state_ptr> omp_function_decl_states;
+
 
 /*****************************************************************************
    Output routines shared by all of the serialization passes.
@@ -336,13 +338,13 @@ lto_create_simple_output_block (enum lto_section_type section_type)
 /* Produce a simple section for one of the ipa passes.  */
 
 void
-lto_destroy_simple_output_block (struct lto_simple_output_block *ob)
+lto_destroy_simple_output_block (struct lto_simple_output_block *ob, bool is_omp)
 {
   char *section_name;
   struct lto_simple_header header;
   struct lto_output_stream *header_stream;
 
-  section_name = lto_get_section_name (ob->section_type, NULL, NULL);
+  section_name = lto_get_section_name (ob->section_type, NULL, NULL, is_omp);
   lto_begin_section (section_name, !flag_wpa);
   free (section_name);
 
@@ -431,7 +433,8 @@ lto_pop_out_decl_state (void)
 
 void
 lto_record_function_out_decl_state (tree fn_decl,
-				    struct lto_out_decl_state *state)
+				    struct lto_out_decl_state *state,
+				    bool is_omp)
 {
   int i;
 
@@ -443,5 +446,8 @@ lto_record_function_out_decl_state (tree fn_decl,
 	state->streams[i].tree_hash_table = NULL;
       }
   state->fn_decl = fn_decl;
-  lto_function_decl_states.safe_push (state);
+  if (is_omp)
+    omp_function_decl_states.safe_push (state);
+  else
+    lto_function_decl_states.safe_push (state);
 }
diff --git a/gcc/lto-streamer-out.c b/gcc/lto-streamer-out.c
index ea0ff17..9e4392b 100644
--- a/gcc/lto-streamer-out.c
+++ b/gcc/lto-streamer-out.c
@@ -1639,7 +1639,7 @@ output_cfg (struct output_block *ob, struct function *fn)
    a function, set FN to the decl for that function.  */
 
 void
-produce_asm (struct output_block *ob, tree fn)
+produce_asm (struct output_block *ob, tree fn, bool is_omp)
 {
   enum lto_section_type section_type = ob->section_type;
   struct lto_function_header header;
@@ -1649,10 +1649,10 @@ produce_asm (struct output_block *ob, tree fn)
   if (section_type == LTO_section_function_body)
     {
       const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (fn));
-      section_name = lto_get_section_name (section_type, name, NULL);
+      section_name = lto_get_section_name (section_type, name, NULL, is_omp);
     }
   else
-    section_name = lto_get_section_name (section_type, NULL, NULL);
+    section_name = lto_get_section_name (section_type, NULL, NULL, false);
 
   lto_begin_section (section_name, !flag_wpa);
   free (section_name);
@@ -1736,7 +1736,7 @@ output_struct_function_base (struct output_block *ob, struct function *fn)
 /* Output the body of function NODE->DECL.  */
 
 static void
-output_function (struct cgraph_node *node)
+output_function (struct cgraph_node *node, bool is_omp)
 {
   tree function;
   struct function *fn;
@@ -1834,7 +1834,7 @@ output_function (struct cgraph_node *node)
     streamer_write_uhwi (ob, 0);
 
   /* Create a section to hold the pickled output of this function.   */
-  produce_asm (ob, function);
+  produce_asm (ob, function, is_omp);
 
   destroy_output_block (ob);
 }
@@ -1867,7 +1867,7 @@ lto_output_toplevel_asms (void)
 
   streamer_write_string_cst (ob, ob->main_stream, NULL_TREE);
 
-  section_name = lto_get_section_name (LTO_section_asm, NULL, NULL);
+  section_name = lto_get_section_name (LTO_section_asm, NULL, NULL, false);
   lto_begin_section (section_name, !flag_wpa);
   free (section_name);
 
@@ -1909,7 +1909,7 @@ copy_function (struct cgraph_node *node)
   size_t len;
   const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (function));
   char *section_name =
-    lto_get_section_name (LTO_section_function_body, name, NULL);
+    lto_get_section_name (LTO_section_function_body, name, NULL, false);
   size_t i, j;
   struct lto_in_decl_state *in_state;
   struct lto_out_decl_state *out_state = lto_get_out_decl_state ();
@@ -1988,12 +1988,12 @@ lto_output (void)
 	  decl_state = lto_new_out_decl_state ();
 	  lto_push_out_decl_state (decl_state);
 	  if (gimple_has_body_p (node->symbol.decl) || !flag_wpa)
-	    output_function (node);
+	    output_function (node, false);
 	  else
 	    copy_function (node);
 	  gcc_assert (lto_get_out_decl_state () == decl_state);
 	  lto_pop_out_decl_state ();
-	  lto_record_function_out_decl_state (node->symbol.decl, decl_state);
+	  lto_record_function_out_decl_state (node->symbol.decl, decl_state, false);
 	}
     }
 
@@ -2001,13 +2001,98 @@ lto_output (void)
      be done now to make sure that all the statements in every function
      have been renumbered so that edges can be associated with call
      statements using the statement UIDs.  */
-  output_symtab ();
+  output_symtab (false);
 
 #ifdef ENABLE_CHECKING
   lto_bitmap_free (output);
 #endif
 }
 
+bool
+gate_omp_out (void)
+{
+  return flag_openmp;
+}
+
+static void
+omp_output (void)
+{
+  lto_streamer_hooks_init();
+  struct lto_out_decl_state *decl_state;
+  int i, n_nodes;
+  lto_symtab_encoder_t encoder = lto_get_out_decl_state ()->symtab_node_encoder;
+
+  /* Initialize the streamer.  */
+  lto_streamer_init ();
+
+  n_nodes = lto_symtab_encoder_size (encoder);
+  /* Process only the functions with bodies.  */
+  for (i = 0; i < n_nodes; i++)
+    {
+      symtab_node snode = lto_symtab_encoder_deref (encoder, i);
+      cgraph_node *node = dyn_cast <cgraph_node> (snode);
+      if (node
+	  && lto_symtab_encoder_encode_body_p (encoder, node)
+	  && !node->symbol.alias
+	  && !node->thunk.thunk_p
+	  && gimple_has_body_p (node->symbol.decl)
+	  && lookup_attribute ("omp declare target",
+			       DECL_ATTRIBUTES (node->symbol.decl)))
+	{
+	  decl_state = lto_new_out_decl_state ();
+	  lto_push_out_decl_state (decl_state);
+	  output_function (node, true);
+	  lto_pop_out_decl_state ();
+	  lto_record_function_out_decl_state (node->symbol.decl, decl_state, true);
+	}
+    }
+
+  /* Emit the callgraph after emitting function bodies.  This needs to
+     be done now to make sure that all the statements in every function
+     have been renumbered so that edges can be associated with call
+     statements using the statement UIDs.  */
+  output_symtab (true);
+}
+
+namespace {
+
+const pass_data pass_data_ipa_omp_gimple_out =
+{
+  IPA_PASS, /* type */
+  "omp_gimple_out", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  true, /* has_gate */
+  false, /* has_execute */
+  TV_IPA_OMP_GIMPLE_OUT, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_omp_gimple_out : public ipa_opt_pass_d
+{
+public:
+  pass_ipa_omp_gimple_out(gcc::context *ctxt)
+    : ipa_opt_pass_d(pass_data_ipa_omp_gimple_out, ctxt,
+		     NULL, /* generate_summary */
+		     omp_output, /* write_summary */
+		     NULL, /* read_summary */
+		     omp_output, /* write_optimization_summary */
+		     NULL, /* read_optimization_summary */
+		     NULL, /* stmt_fixup */
+		     0, /* function_transform_todo_flags_start */
+		     NULL, /* function_transform */
+		     NULL) /* variable_transform */
+  {}
+
+  /* opt_pass methods: */
+  bool gate () { return gate_omp_out (); }
+
+}; // class pass_ipa_omp_gimple_out
+
+} // anon namespace
 namespace {
 
 const pass_data pass_data_ipa_lto_gimple_out =
@@ -2049,6 +2134,12 @@ public:
 } // anon namespace
 
 ipa_opt_pass_d *
+make_pass_ipa_omp_gimple_out (gcc::context *ctxt)
+{
+  return new pass_ipa_omp_gimple_out (ctxt);
+}
+
+ipa_opt_pass_d *
 make_pass_ipa_lto_gimple_out (gcc::context *ctxt)
 {
   return new pass_ipa_lto_gimple_out (ctxt);
@@ -2324,10 +2415,10 @@ output_symbol_p (symtab_node node)
    SET and VSET are cgraph/varpool node sets we are outputting.  */
 
 static void
-produce_symtab (struct output_block *ob)
+produce_symtab (struct output_block *ob, bool is_omp)
 {
   struct streamer_tree_cache_d *cache = ob->writer_cache;
-  char *section_name = lto_get_section_name (LTO_section_symtab, NULL, NULL);
+  char *section_name = lto_get_section_name (LTO_section_symtab, NULL, NULL, is_omp);
   struct pointer_set_t *seen;
   struct lto_output_stream stream;
   lto_symtab_encoder_t encoder = ob->decl_state->symtab_node_encoder;
@@ -2347,7 +2438,9 @@ produce_symtab (struct output_block *ob)
     {
       symtab_node node = lsei_node (lsei);
 
-      if (!output_symbol_p (node) || DECL_EXTERNAL (node->symbol.decl))
+      if (!output_symbol_p (node) || DECL_EXTERNAL (node->symbol.decl)
+	  || (is_omp && !lookup_attribute ("omp declare target",
+					   DECL_ATTRIBUTES (node->symbol.decl))))
 	continue;
       write_symbol (cache, &stream, node->symbol.decl, seen, false);
     }
@@ -2356,7 +2449,9 @@ produce_symtab (struct output_block *ob)
     {
       symtab_node node = lsei_node (lsei);
 
-      if (!output_symbol_p (node) || !DECL_EXTERNAL (node->symbol.decl))
+      if (!output_symbol_p (node) || !DECL_EXTERNAL (node->symbol.decl)
+	  || (is_omp && !lookup_attribute ("omp declare target",
+					   DECL_ATTRIBUTES (node->symbol.decl))))
 	continue;
       write_symbol (cache, &stream, node->symbol.decl, seen, false);
     }
@@ -2375,7 +2470,7 @@ produce_symtab (struct output_block *ob)
    recover these on other side.  */
 
 static void
-produce_asm_for_decls (void)
+produce_asm_for_decls (bool is_omp)
 {
   struct lto_out_decl_state *out_state;
   struct lto_out_decl_state *fn_out_state;
@@ -2387,12 +2482,17 @@ produce_asm_for_decls (void)
   size_t decl_state_size;
   int32_t num_decl_states;
 
+  vec<lto_out_decl_state_ptr> decl_states = is_omp
+    ? omp_function_decl_states
+    : lto_function_decl_states;
+
+
   ob = create_output_block (LTO_section_decls);
   ob->global = true;
 
   memset (&header, 0, sizeof (struct lto_decl_header));
 
-  section_name = lto_get_section_name (LTO_section_decls, NULL, NULL);
+  section_name = lto_get_section_name (LTO_section_decls, NULL, NULL, is_omp);
   lto_begin_section (section_name, !flag_wpa);
   free (section_name);
 
@@ -2403,12 +2503,12 @@ produce_asm_for_decls (void)
 
   /* Write the global symbols.  */
   out_state = lto_get_out_decl_state ();
-  num_fns = lto_function_decl_states.length ();
+  num_fns = decl_states.length ();
   lto_output_decl_state_streams (ob, out_state);
   for (idx = 0; idx < num_fns; idx++)
     {
       fn_out_state =
-	lto_function_decl_states[idx];
+	decl_states[idx];
       lto_output_decl_state_streams (ob, fn_out_state);
     }
 
@@ -2424,8 +2524,7 @@ produce_asm_for_decls (void)
   decl_state_size += lto_out_decl_state_written_size (out_state);
   for (idx = 0; idx < num_fns; idx++)
     {
-      fn_out_state =
-	lto_function_decl_states[idx];
+      fn_out_state = decl_states[idx];
       decl_state_size += lto_out_decl_state_written_size (fn_out_state);
     }
   header.decl_state_size = decl_state_size;
@@ -2447,8 +2546,7 @@ produce_asm_for_decls (void)
   lto_output_decl_state_refs (ob, decl_state_stream, out_state);
   for (idx = 0; idx < num_fns; idx++)
     {
-      fn_out_state =
-	lto_function_decl_states[idx];
+      fn_out_state = decl_states[idx];
       lto_output_decl_state_refs (ob, decl_state_stream, fn_out_state);
     }
   lto_write_stream (decl_state_stream);
@@ -2462,24 +2560,38 @@ produce_asm_for_decls (void)
   /* Write the symbol table.  It is used by linker to determine dependencies
      and thus we can skip it for WPA.  */
   if (!flag_wpa)
-    produce_symtab (ob);
+    produce_symtab (ob, is_omp);
 
   /* Write command line opts.  */
-  lto_write_options ();
+  if (!is_omp)
+    lto_write_options ();
 
   /* Deallocate memory and clean up.  */
   for (idx = 0; idx < num_fns; idx++)
     {
-      fn_out_state =
-	lto_function_decl_states[idx];
+      fn_out_state = decl_states[idx];
       lto_delete_out_decl_state (fn_out_state);
     }
-  lto_symtab_encoder_delete (ob->decl_state->symtab_node_encoder);
+  if (!is_omp || !flag_lto)
+    lto_symtab_encoder_delete (ob->decl_state->symtab_node_encoder);
+  omp_function_decl_states.release ();
   lto_function_decl_states.release ();
   destroy_output_block (ob);
 }
 
 
+static void
+produce_asm_for_decls_lto ()
+{
+  produce_asm_for_decls (false);
+}
+
+static void
+produce_asm_for_decls_omp ()
+{
+  produce_asm_for_decls (true);
+}
+
 namespace {
 
 const pass_data pass_data_ipa_lto_finish_out =
@@ -2503,9 +2615,9 @@ public:
   pass_ipa_lto_finish_out(gcc::context *ctxt)
     : ipa_opt_pass_d(pass_data_ipa_lto_finish_out, ctxt,
 		     NULL, /* generate_summary */
-		     produce_asm_for_decls, /* write_summary */
+		     produce_asm_for_decls_lto, /* write_summary */
 		     NULL, /* read_summary */
-		     produce_asm_for_decls, /* write_optimization_summary */
+		     produce_asm_for_decls_lto, /* write_optimization_summary */
 		     NULL, /* read_optimization_summary */
 		     NULL, /* stmt_fixup */
 		     0, /* function_transform_todo_flags_start */
@@ -2525,3 +2637,48 @@ make_pass_ipa_lto_finish_out (gcc::context *ctxt)
 {
   return new pass_ipa_lto_finish_out (ctxt);
 }
+namespace {
+
+const pass_data pass_data_ipa_omp_finish_out =
+{
+  IPA_PASS, /* type */
+  "omp_decls_out", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  true, /* has_gate */
+  false, /* has_execute */
+  TV_IPA_OMP_DECL_OUT, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_omp_finish_out : public ipa_opt_pass_d
+{
+public:
+  pass_ipa_omp_finish_out(gcc::context *ctxt)
+    : ipa_opt_pass_d(pass_data_ipa_omp_finish_out, ctxt,
+		     NULL, /* generate_summary */
+		     produce_asm_for_decls_omp, /* write_summary */
+		     NULL, /* read_summary */
+		     produce_asm_for_decls_omp, /* write_optimization_summary */
+		     NULL, /* read_optimization_summary */
+		     NULL, /* stmt_fixup */
+		     0, /* function_transform_todo_flags_start */
+		     NULL, /* function_transform */
+		     NULL) /* variable_transform */
+  {}
+
+  /* opt_pass methods: */
+  bool gate () { return gate_omp_out (); }
+
+}; // class pass_ipa_lto_finish_out
+
+} // anon namespace
+
+ipa_opt_pass_d *
+make_pass_ipa_omp_finish_out (gcc::context *ctxt)
+{
+  return new pass_ipa_omp_finish_out (ctxt);
+}
diff --git a/gcc/lto-streamer.c b/gcc/lto-streamer.c
index e7b66c1..0686658 100644
--- a/gcc/lto-streamer.c
+++ b/gcc/lto-streamer.c
@@ -140,7 +140,7 @@ lto_bitmap_free (bitmap b)
    to free the returned name.  */
 
 char *
-lto_get_section_name (int section_type, const char *name, struct lto_file_decl_data *f)
+lto_get_section_name (int section_type, const char *name, struct lto_file_decl_data *f, bool is_omp)
 {
   const char *add;
   char post[32];
@@ -173,7 +173,9 @@ lto_get_section_name (int section_type, const char *name, struct lto_file_decl_d
     sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, f->id);
   else
     sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, get_random_seed (false)); 
-  return concat (LTO_SECTION_NAME_PREFIX, sep, add, post, NULL);
+  return concat (is_omp ? OMP_SECTION_NAME_PREFIX
+		        : LTO_SECTION_NAME_PREFIX,
+			sep, add, post, NULL);
 }
 
 
diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h
index e7c89f1..17f18dd 100644
--- a/gcc/lto-streamer.h
+++ b/gcc/lto-streamer.h
@@ -141,6 +141,7 @@ along with GCC; see the file COPYING3.  If not see
    name for the functions and static_initializers.  For other types of
    sections a '.' and the section type are appended.  */
 #define LTO_SECTION_NAME_PREFIX         ".gnu.lto_"
+#define OMP_SECTION_NAME_PREFIX         ".gnu.target_lto_"
 
 #define LTO_major_version 2
 #define LTO_minor_version 2
@@ -802,14 +803,15 @@ extern void lto_output_type_ref_index (struct lto_out_decl_state *,
 				struct lto_output_stream *, tree);
 extern struct lto_simple_output_block *lto_create_simple_output_block (
 				enum lto_section_type);
-extern void lto_destroy_simple_output_block (struct lto_simple_output_block *);
+extern void lto_destroy_simple_output_block (struct lto_simple_output_block *, bool is_omp);
 extern struct lto_out_decl_state *lto_new_out_decl_state (void);
 extern void lto_delete_out_decl_state (struct lto_out_decl_state *);
 extern struct lto_out_decl_state *lto_get_out_decl_state (void);
 extern void lto_push_out_decl_state (struct lto_out_decl_state *);
 extern struct lto_out_decl_state *lto_pop_out_decl_state (void);
 extern void lto_record_function_out_decl_state (tree,
-						struct lto_out_decl_state *);
+						struct lto_out_decl_state *,
+						bool is_omp);
 extern void lto_append_block (struct lto_output_stream *);
 
 
@@ -817,7 +819,7 @@ extern void lto_append_block (struct lto_output_stream *);
 extern const char *lto_tag_name (enum LTO_tags);
 extern bitmap lto_bitmap_alloc (void);
 extern void lto_bitmap_free (bitmap);
-extern char *lto_get_section_name (int, const char *, struct lto_file_decl_data *);
+extern char *lto_get_section_name (int, const char *, struct lto_file_decl_data *, bool is_omp);
 extern void print_lto_report (const char *);
 extern void lto_streamer_init (void);
 extern bool gate_lto_out (void);
@@ -860,7 +862,7 @@ extern struct output_block *create_output_block (enum lto_section_type);
 extern void destroy_output_block (struct output_block *);
 extern void lto_output_tree (struct output_block *, tree, bool, bool);
 extern void lto_output_toplevel_asms (void);
-extern void produce_asm (struct output_block *ob, tree fn);
+extern void produce_asm (struct output_block *ob, tree fn, bool is_omp);
 void lto_output_decl_state_streams (struct output_block *,
 				    struct lto_out_decl_state *);
 void lto_output_decl_state_refs (struct output_block *,
@@ -883,7 +885,7 @@ void lto_set_symtab_encoder_in_partition (lto_symtab_encoder_t,
 
 bool lto_symtab_encoder_encode_initializer_p (lto_symtab_encoder_t,
 					      struct varpool_node *);
-void output_symtab (void);
+void output_symtab (bool is_omp);
 void input_symtab (void);
 bool referenced_from_other_partition_p (struct ipa_ref_list *,
 				        lto_symtab_encoder_t);
@@ -916,6 +918,7 @@ extern const char *lto_section_name[];
 /* Holds all the out decl states of functions output so far in the
    current output file.  */
 extern vec<lto_out_decl_state_ptr> lto_function_decl_states;
+extern vec<lto_out_decl_state_ptr> omp_function_decl_states;
 
 /* Return true if LTO tag TAG corresponds to a tree code.  */
 static inline bool
diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index c854589..829edfc 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -2925,7 +2925,7 @@ get_section_data (struct lto_file_decl_data *file_data,
   htab_t section_hash_table = file_data->section_hash_table;
   struct lto_section_slot *f_slot;
   struct lto_section_slot s_slot;
-  const char *section_name = lto_get_section_name (section_type, name, file_data);
+  const char *section_name = lto_get_section_name (section_type, name, file_data, false);
   char *data = NULL;
 
   *len = 0;
diff --git a/gcc/passes.c b/gcc/passes.c
index e3a7212..678755b 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -2312,9 +2312,8 @@ ipa_write_summaries (void)
   struct cgraph_node *node;
   struct cgraph_node **order;
 
-  if (!flag_generate_lto || seen_error ())
+  if (!(flag_generate_lto || flag_openmp) || seen_error ())
     return;
-
   encoder = lto_symtab_encoder_new (false);
 
   /* Create the callgraph set in the same order used in
diff --git a/gcc/passes.def b/gcc/passes.def
index b289713..5fc0588 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -102,6 +102,8 @@ along with GCC; see the file COPYING3.  If not see
   TERMINATE_PASS_LIST ()
 
   INSERT_PASSES_AFTER (all_regular_ipa_passes)
+  NEXT_PASS (pass_ipa_omp_gimple_out);
+  NEXT_PASS (pass_ipa_omp_finish_out);
   NEXT_PASS (pass_ipa_whole_program_visibility);
   NEXT_PASS (pass_ipa_profile);
   NEXT_PASS (pass_ipa_cp);
diff --git a/gcc/timevar.def b/gcc/timevar.def
index bd1ee76..3660e78 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -72,8 +72,10 @@ DEFTIMEVAR (TV_IPA_FNSPLIT           , "ipa function splitting")
 DEFTIMEVAR (TV_IPA_OPT		     , "ipa various optimizations")
 DEFTIMEVAR (TV_IPA_LTO_GIMPLE_IN     , "ipa lto gimple in")
 DEFTIMEVAR (TV_IPA_LTO_GIMPLE_OUT    , "ipa lto gimple out")
+DEFTIMEVAR (TV_IPA_OMP_GIMPLE_OUT    , "ipa omp gimple out")
 DEFTIMEVAR (TV_IPA_LTO_DECL_IN       , "ipa lto decl in")
 DEFTIMEVAR (TV_IPA_LTO_DECL_OUT      , "ipa lto decl out")
+DEFTIMEVAR (TV_IPA_OMP_DECL_OUT      , "ipa omp decl out")
 DEFTIMEVAR (TV_IPA_LTO_DECL_INIT_IO  , "ipa lto decl init I/O")
 DEFTIMEVAR (TV_IPA_LTO_CGRAPH_IO     , "ipa lto cgraph I/O")
 DEFTIMEVAR (TV_IPA_LTO_DECL_MERGE    , "ipa lto decl merge")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 787a49b..ff89bec 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -451,6 +451,7 @@ extern simple_ipa_opt_pass *make_pass_early_local_passes (gcc::context *ctxt);
 
 extern ipa_opt_pass_d *make_pass_ipa_whole_program_visibility (gcc::context
 							       *ctxt);
+extern ipa_opt_pass_d *make_pass_ipa_omp_gimple_out (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_lto_gimple_out (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_increase_alignment (gcc::context
 							      *ctxt);
@@ -463,6 +464,7 @@ extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_pure_const (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_pta (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_lto_finish_out (gcc::context *ctxt);
+extern ipa_opt_pass_d *make_pass_ipa_omp_finish_out (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_tm (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_profile (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_cdtor_merge (gcc::context *ctxt);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13 12:36                                       ` Jakub Jelinek
  2013-09-13 13:11                                         ` Michael V. Zolotukhin
@ 2013-09-13 15:34                                         ` Jakub Jelinek
  1 sibling, 0 replies; 56+ messages in thread
From: Jakub Jelinek @ 2013-09-13 15:34 UTC (permalink / raw)
  To: Michael V. Zolotukhin; +Cc: Kirill Yukhin, Richard Henderson, gcc, triegel

On Fri, Sep 13, 2013 at 02:36:14PM +0200, Jakub Jelinek wrote:
> FYI, I'm attaching a WIP patch with the splay tree stuff, debugging
> target-1.c with OMP_DEFAULT_DEVICE=257 right now (with all tgtv related
> stuff removed), but hitting some error regarding OMP_CLAUSE_MAP_POINTER
> reallocation, supposedly a bug on the compiler side.  But e.g. fn2 and fn3
> already seem to pass with that, only fn4 is problematic.

Ok, found the bug, this should fix fn4 and the whole test passes
now with OMP_DEFAULT_DEVICE=257.

Will commit once the http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01044.html
issue is resolved.

2013-09-13  Jakub Jelinek  <jakub@redhat.com>

	* omp-low.c (install_var_field): Use (mask & 4) to request double
	indirection.
	(scan_sharing_clauses): For OMP_CLAUSE_MAP_POINTER arrays pass
	7 instead of 3 to install_var_field.
	(lower_omp_target): For OMP_CLAUSE_MAP_POINTER arrays add extra
	indirection.

--- gcc/omp-low.c.jj	2013-09-12 13:55:34.000000000 +0200
+++ gcc/omp-low.c	2013-09-13 15:57:58.425272908 +0200
@@ -1017,7 +1017,12 @@ install_var_field (tree var, bool by_ref
 	      || !splay_tree_lookup (ctx->sfield_map, (splay_tree_key) var));
 
   type = TREE_TYPE (var);
-  if (by_ref)
+  if (mask & 4)
+    {
+      gcc_assert (TREE_CODE (type) == ARRAY_TYPE);
+      type = build_pointer_type (build_pointer_type (type));
+    }
+  else if (by_ref)
     type = build_pointer_type (type);
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
@@ -1587,7 +1592,13 @@ scan_sharing_clauses (tree clauses, omp_
 		}
 	      else
 		{
-		  install_var_field (decl, true, 3, ctx);
+		  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_MAP
+		      && OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_POINTER
+		      && !OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION (c)
+		      && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE)
+		    install_var_field (decl, true, 7, ctx);
+		  else
+		    install_var_field (decl, true, 3, ctx);
 		  if (gimple_omp_target_kind (ctx->stmt)
 		      == GF_OMP_TARGET_KIND_REGION)
 		    install_var_local (decl, ctx);
@@ -9331,6 +9342,11 @@ lower_omp_target (gimple_stmt_iterator *
 	  {
 	    x = build_receiver_ref (var, true, ctx);
 	    tree new_var = lookup_decl (var, ctx);
+	    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_MAP
+		&& OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_POINTER
+		&& !OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION (c)
+		&& TREE_CODE (TREE_TYPE (var)) == ARRAY_TYPE)
+	      x = build_simple_mem_ref (x);
 	    SET_DECL_VALUE_EXPR (new_var, x);
 	    DECL_HAS_VALUE_EXPR_P (new_var) = 1;
 	  }
@@ -9435,7 +9451,20 @@ lower_omp_target (gimple_stmt_iterator *
 	      {
 		tree var = lookup_decl_in_outer_ctx (ovar, ctx);
 		tree x = build_sender_ref (ovar, ctx);
-		if (is_gimple_reg (var))
+		if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_MAP
+		    && OMP_CLAUSE_MAP_KIND (c) == OMP_CLAUSE_MAP_POINTER
+		    && !OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION (c)
+		    && TREE_CODE (TREE_TYPE (ovar)) == ARRAY_TYPE)
+		  {
+		    gcc_assert (kind == GF_OMP_TARGET_KIND_REGION);
+		    tree avar
+		      = create_tmp_var (TREE_TYPE (TREE_TYPE (x)), NULL);
+		    mark_addressable (avar);
+		    gimplify_assign (avar, build_fold_addr_expr (var), &ilist);
+		    avar = build_fold_addr_expr (avar);
+		    gimplify_assign (x, avar, &ilist);
+		  }
+		else if (is_gimple_reg (var))
 		  {
 		    gcc_assert (kind == GF_OMP_TARGET_KIND_REGION);
 		    tree avar = create_tmp_var (TREE_TYPE (var), NULL);


	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13  9:35                         ` Michael Zolotukhin
  2013-09-13 10:52                           ` Kirill Yukhin
  2013-09-13 11:04                           ` Nathan Sidwell
@ 2013-09-16  9:35                           ` Jakub Jelinek
  2013-09-17 12:05                             ` Michael V. Zolotukhin
  2 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-09-16  9:35 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Kirill Yukhin, Richard Henderson, GCC Development, triegel

On Fri, Sep 13, 2013 at 01:34:43PM +0400, Michael Zolotukhin wrote:
> 1.2.  Linking
> 
> When all source files are compiled, a linker is invoked.  The linker is passed
> a special option to invoke openmp-plugin.  The plugin is responsible for
> producing target-side executables - for each target it calls the corresponding
> target compiler and linker.

I thought the idea was to just use the LTO plugin for that, teach it to
handle the .gnu.target_lto* sections specially.

> The target-side GCC is invoked to load Gimple IR from .gnu.target_lto sections
> of the FAT-object and compile it to target-side objects which later will be
> used by target-side linker.
> 
> The host-side linker needs libgomp along side with standard libraries like
> libc/libm to successfully resolve symbols, generated by the host compiler.  The
> target-side linker needs CRT.O, containing main-routine for target-side
> executable and target-specific versions of standard libraries.

I'd say main shouldn't be linked into the shared libraries (well, for MIC)
that you put into the binaries resp. shared libraries, but into special
section inside of libgomp.so from which the runtime would upload it into the
binary.  Because, if you have main in every shared library, what do you do
if you have multiple shared libraries with offloading code in it?
What the plugin will need to do is for each of the shared libraries in the
link, extract from the special sections the embedded target shared libraries
(or executables) into temporary files and pass that to the target linker, so
that if you have say
#pragma omp declare target
extern int foo;
extern void bar (void);
#pragma omp declare target
in one shared library and the definitions thereof in a different one, you
can link that together.

> 1.3.  Execution
> 
> Host-side executable contains calls to libgomp library, which interfaces all
> interactions with target-devices.
> On loading, the executable calls GOMP_target_init from libgomp.so, which will

No.  The first call to omp_get_num_devices, GOMP_target, GOMP_target_data
or GOMP_target_update (using pthread_once) scans for the available target
devices, there is no GOMP_target_init, and the first GOMP_target,
GOMP_target_data or GOMP_target_update from a particular shared library or
binary (all of them will have __OPENMP_TARGET__ weak hidden symbol as one of
the arguments) offloads the embedded shared library into target (resp.
compiles HSAIL/PTX and uploads or whatever).

> load the target executables onto target-devices and start them.  Since this
> moment, the devices are ready to execute requested code and interact with the
> main host-process.
> 
> When a host-side program calls libgomp functions related to the offloading,
> libgomp decides, whether it's profitable to offload, and which device to choose
> for that.  In order to do that, libgomp calls available plugins and checks
> which devices are ready to execute offloaded code.  Available plugins should be
> located in a specified folder and should implement a certain interface.
> 
> Another important function of libgomp is host-target memory mapping and keeping
> information about mapped regions and their types.

The only "type" info needed is the copy_from flag, all the rest happens on
mapping memory to the device, so don't need to be tracked afterwards.

> TBD: probably, it's better to 'hard-code' available plugin during build of
> libgomp (e.g., at configure step).

That is certainly doable.

> 2.  LINKER PLUGINS INFRASTRUCTURE
> 
> 2.1.  Overview
> 
> When -flto or -fopenmp option is given to the GCC driver, linker plugin
> invocation is triggered.  The plugin claims the input files containing
> .gnu.lto* or .gnu.target_lto* sections for further processing and creates
> resolutions file.
> After this preliminary work, LTO-wrapper is called.  It is responsible for
> sequential calls of GCC.
> 
> The first call is needed to run WPA, which performs usual LTO partitioning as
> well as partitioning of OpenMP-target sections.  WPA reads bytecode of:
>   1) all functions and variables with "omp declare target" attribute;
>   2) the outlined bodies of #pragma omp target turned into '*.ompfn' functions;

1) and 2) is basically the same, because omp expansion adds "omp declare
target" attribute to the outlined bodies too.

I don't see why you want a WPA phase, at least when not also -flto.
IMNSHO you want to compile each .gnu.target_lto* set of input sections
individually, using one target compiler driver invocation, that will
generate object files and you just link them together.

> 3.  OPENMP PRAGMA TARGET HANDLING IN MIDDLE-END
> 
> Middle end work is done in two omp passes.  Specifically, omp-lower pass:
>   * Creates outlined function with no body
>   * Adds #pragma omp return in the end of the region
>   * Creates empty struct args_data
>   * For each var referenced in clauses  (e.g. int i):
>     -  Adds entry to data_arr, data_sizes and data_kind arrays describing this
>        variable, its size and mapping type
>     -  Adds assignment before call to outlined function : args_data.i = &i
>     -  Replace uses of i with uses of args_data->i inside the region
> 
> Then, omp-expand pass:
>   * Moves the region to the outlined function
>   * Adds a call to libGOMP to maybe offload this function:
>     GOMP_target (condition /* evaluated expression from IF clause */,
> 		 device_no /* a number from DEVICE clause */,
> 		 foo, .foo.,
> 		 data_arr, data_size, data_kinds, 1 /* size of arrays */);

Except for the #pragma omp declare target arrays creation all this is
implemented, the GOMP_target arguments have different order and some aren't
present, just the current fnname argument will be in fact __OPENMP_TARGET__,
i.e. either NULL, or address of a section which contains some info on what
offloading targets are supported in the current binary resp. shared library,
where to find them, where to find their mapping etc.

> GOMP_target routine takes additional arguments:
>   * Address of the host version of outlined function.  It is used when runtime
> decides to perform host fallback instead of offloading to an accelerator.
>   * Name of the target version of outlined function.  This is used when runtime
> decides to offload.  It cannot directly call a function on a target device, so
> it calls the corresponding plugin and gives it a function name to invoke.

See above and my earlier mail why a name is a bad idea.  You will look up
the { fnaddr, fnaddr + 1 } address range in the target mapping structure
instead.

> GOMP_target, GOMP_target_data, GOMP_target_data_end, GOMP_target_update routines
> performs maintaining of a global structure describing current mapping, which
> will be covered in the next section, and an actual data marshalling:
>   * GOMP_target copies regions with kind TO or TOFROM to device before
> offloading and copies regions with kind FROM or TOFROM from the device when the
> offloading is finished.  In case of host fallback no copying is performed.
>   * GOMP_target_data copies regions with kind TO or TOFROM to the device.
>   * GOMP_target_data_end copies regions with kind FROM or TOFROM from the
> device.
>   * GOMP_target_data_update copies all regions according to their types to and
> from the device.

All this is implemented, just it will need to be changed to use device hooks
to perform the actual target allocation/deallocation/copyto/copyfrom
operations.
> 
> 4.2.  Maintaining info about mapped regions
> 
> Every libGOMP routine dealing with mapped memory regions is responsible for an
> accurate maintaining of a global data structure describing this mapping.  This
> data structure is a binary search tree containing structures 
> struct
>   {
>     void *host_address;
>     void *target_address;
>     size_t region_size;
>     enum {TO, TOFROM, FROM, ALLOC} region_type;
>   }
> with host addresses used as a key.
> 
> The data structure allows to check whether a given host address is mapped, or
> not.  In order to do that, on every request it needs to find out whether the
> requested interval is covered with already mapped ones and check if all of them
> have a corresponding type.

This is again already implemented, just the splay tree and lock will need
moving from a global variable into the device descriptor.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-16  9:35                           ` Jakub Jelinek
@ 2013-09-17 12:05                             ` Michael V. Zolotukhin
  2013-09-17 12:30                               ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Michael V. Zolotukhin @ 2013-09-17 12:05 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, Richard Henderson, GCC Development, triegel

> > 1.2.  Linking
> > 
> > When all source files are compiled, a linker is invoked.  The linker is passed
> > a special option to invoke openmp-plugin.  The plugin is responsible for
> > producing target-side executables - for each target it calls the corresponding
> > target compiler and linker.
> 
> I thought the idea was to just use the LTO plugin for that, teach it to
> handle the .gnu.target_lto* sections specially.
Well, I think we don't have a decision here yet - we'll try one of the options
and see if it goes well.

> > The target-side GCC is invoked to load Gimple IR from .gnu.target_lto sections
> > of the FAT-object and compile it to target-side objects which later will be
> > used by target-side linker.
> > 
> > The host-side linker needs libgomp along side with standard libraries like
> > libc/libm to successfully resolve symbols, generated by the host compiler.  The
> > target-side linker needs CRT.O, containing main-routine for target-side
> > executable and target-specific versions of standard libraries.
> 
> I'd say main shouldn't be linked into the shared libraries (well, for MIC)
> that you put into the binaries resp. shared libraries, but into special
> section inside of libgomp.so from which the runtime would upload it into the
> binary.  Because, if you have main in every shared library, what do you do
> if you have multiple shared libraries with offloading code in it?
> What the plugin will need to do is for each of the shared libraries in the
> link, extract from the special sections the embedded target shared libraries
> (or executables) into temporary files and pass that to the target linker, so
> that if you have say
> #pragma omp declare target
> extern int foo;
> extern void bar (void);
> #pragma omp declare target
> in one shared library and the definitions thereof in a different one, you
> can link that together.
That's true.  However, for now I do not consider offloading code in shared
objects, though it is surely an important task for future and should be taken
into account now.

> > 1.3.  Execution
> > 
> > Host-side executable contains calls to libgomp library, which interfaces all
> > interactions with target-devices.
> > On loading, the executable calls GOMP_target_init from libgomp.so, which will
> 
> No.  The first call to omp_get_num_devices, GOMP_target, GOMP_target_data
> or GOMP_target_update (using pthread_once) scans for the available target
> devices, there is no GOMP_target_init, 
That's true.  I wrote this when I wasn't aware of our current approach.

> ... and the first GOMP_target,
> GOMP_target_data or GOMP_target_update from a particular shared library or
> binary (all of them will have __OPENMP_TARGET__ weak hidden symbol as one of
> the arguments) offloads the embedded shared library into target (resp.
> compiles HSAIL/PTX and uploads or whatever).
What is that __OPENMP_TARGET__ argument?  Is it an address of section with
target code or something like that?  I am not sure I am completely clear with
this part.  (Please also find my other questions/comments below).

> > load the target executables onto target-devices and start them.  Since this
> > moment, the devices are ready to execute requested code and interact with the
> > main host-process.
> > 
> > When a host-side program calls libgomp functions related to the offloading,
> > libgomp decides, whether it's profitable to offload, and which device to choose
> > for that.  In order to do that, libgomp calls available plugins and checks
> > which devices are ready to execute offloaded code.  Available plugins should be
> > located in a specified folder and should implement a certain interface.
> > 
> > Another important function of libgomp is host-target memory mapping and keeping
> > information about mapped regions and their types.
> 
> The only "type" info needed is the copy_from flag, all the rest happens on
> mapping memory to the device, so don't need to be tracked afterwards.
Yes, that's what was meant here.

> > TBD: probably, it's better to 'hard-code' available plugin during build of
> > libgomp (e.g., at configure step).
> 
> That is certainly doable.
Yes, it is doable, but do we want to do this or scanning some folders for
suitable plugins is sufficient for us?

> 
> > 2.  LINKER PLUGINS INFRASTRUCTURE
> > 
> > 2.1.  Overview
> > 
> > When -flto or -fopenmp option is given to the GCC driver, linker plugin
> > invocation is triggered.  The plugin claims the input files containing
> > .gnu.lto* or .gnu.target_lto* sections for further processing and creates
> > resolutions file.
> > After this preliminary work, LTO-wrapper is called.  It is responsible for
> > sequential calls of GCC.
> > 
> > The first call is needed to run WPA, which performs usual LTO partitioning as
> > well as partitioning of OpenMP-target sections.  WPA reads bytecode of:
> >   1) all functions and variables with "omp declare target" attribute;
> >   2) the outlined bodies of #pragma omp target turned into '*.ompfn' functions;
> 
> 1) and 2) is basically the same, because omp expansion adds "omp declare
> target" attribute to the outlined bodies too.
> 
> I don't see why you want a WPA phase, at least when not also -flto.
> IMNSHO you want to compile each .gnu.target_lto* set of input sections
> individually, using one target compiler driver invocation, that will
> generate object files and you just link them together.
Yes, you are right here.  WPA would only be invoked when '-flto' is given and it
will work just as usual.  The target compilers would be called by LTO-wrapper
independently on WPA.

> 
> > 3.  OPENMP PRAGMA TARGET HANDLING IN MIDDLE-END
> > 
> > Middle end work is done in two omp passes.  Specifically, omp-lower pass:
> >   * Creates outlined function with no body
> >   * Adds #pragma omp return in the end of the region
> >   * Creates empty struct args_data
> >   * For each var referenced in clauses  (e.g. int i):
> >     -  Adds entry to data_arr, data_sizes and data_kind arrays describing this
> >        variable, its size and mapping type
> >     -  Adds assignment before call to outlined function : args_data.i = &i
> >     -  Replace uses of i with uses of args_data->i inside the region
> > 
> > Then, omp-expand pass:
> >   * Moves the region to the outlined function
> >   * Adds a call to libGOMP to maybe offload this function:
> >     GOMP_target (condition /* evaluated expression from IF clause */,
> > 		 device_no /* a number from DEVICE clause */,
> > 		 foo, .foo.,
> > 		 data_arr, data_size, data_kinds, 1 /* size of arrays */);
> 
> Except for the #pragma omp declare target arrays creation all this is
> implemented, the GOMP_target arguments have different order and some aren't
> present, just the current fnname argument will be in fact __OPENMP_TARGET__,
> i.e. either NULL, or address of a section which contains some info on what
> offloading targets are supported in the current binary resp. shared library,
> where to find them, where to find their mapping etc.
Yep, many of this is already implemented, and some things even look differently
now, but this document was designed as a documentation for what and how is done,
to make life of future contributors easier.

> > GOMP_target routine takes additional arguments:
> >   * Address of the host version of outlined function.  It is used when runtime
> > decides to perform host fallback instead of offloading to an accelerator.
> >   * Name of the target version of outlined function.  This is used when runtime
> > decides to offload.  It cannot directly call a function on a target device, so
> > it calls the corresponding plugin and gives it a function name to invoke.
> 
> See above and my earlier mail why a name is a bad idea.  You will look up
> the { fnaddr, fnaddr + 1 } address range in the target mapping structure
> instead.
So, fnaddr is the host function address, right?  Then we are looking for it in
the splay tree and find the corresponding address on the target side, correct?
What do we map for the functions?
Also, AFAIK COI needs a name passed to it to run offloaded function, so we might
want to keep the name anyway.

> > GOMP_target, GOMP_target_data, GOMP_target_data_end, GOMP_target_update routines
> > performs maintaining of a global structure describing current mapping, which
> > will be covered in the next section, and an actual data marshalling:
> >   * GOMP_target copies regions with kind TO or TOFROM to device before
> > offloading and copies regions with kind FROM or TOFROM from the device when the
> > offloading is finished.  In case of host fallback no copying is performed.
> >   * GOMP_target_data copies regions with kind TO or TOFROM to the device.
> >   * GOMP_target_data_end copies regions with kind FROM or TOFROM from the
> > device.
> >   * GOMP_target_data_update copies all regions according to their types to and
> > from the device.
> 
> All this is implemented, just it will need to be changed to use device hooks
> to perform the actual target allocation/deallocation/copyto/copyfrom
> operations.
> > 
> > 4.2.  Maintaining info about mapped regions
> > 
> > Every libGOMP routine dealing with mapped memory regions is responsible for an
> > accurate maintaining of a global data structure describing this mapping.  This
> > data structure is a binary search tree containing structures 
> > struct
> >   {
> >     void *host_address;
> >     void *target_address;
> >     size_t region_size;
> >     enum {TO, TOFROM, FROM, ALLOC} region_type;
> >   }
> > with host addresses used as a key.
> > 
> > The data structure allows to check whether a given host address is mapped, or
> > not.  In order to do that, on every request it needs to find out whether the
> > requested interval is covered with already mapped ones and check if all of them
> > have a corresponding type.
> 
> This is again already implemented, just the splay tree and lock will need
> moving from a global variable into the device descriptor.
Yep, but again, this document was intended to describe decisions we've chosen
for implementation of OpenMP4 offloading support.

In general, what do you think, is it worth maintaining such document (and
probably later upload it as a wiki page) or we don't need it and it's better
just be dropped?

Michael
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-17 12:05                             ` Michael V. Zolotukhin
@ 2013-09-17 12:30                               ` Jakub Jelinek
  2013-10-28 10:43                                 ` Ilya Verbin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-09-17 12:30 UTC (permalink / raw)
  To: Michael V. Zolotukhin
  Cc: Kirill Yukhin, Richard Henderson, GCC Development, triegel

On Tue, Sep 17, 2013 at 04:04:54PM +0400, Michael V. Zolotukhin wrote:
> > ... and the first GOMP_target,
> > GOMP_target_data or GOMP_target_update from a particular shared library or
> > binary (all of them will have __OPENMP_TARGET__ weak hidden symbol as one of
> > the arguments) offloads the embedded shared library into target (resp.
> > compiles HSAIL/PTX and uploads or whatever).
> What is that __OPENMP_TARGET__ argument?  Is it an address of section with
> target code or something like that?  I am not sure I am completely clear with
> this part.  (Please also find my other questions/comments below).

See http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00276.html

> > See above and my earlier mail why a name is a bad idea.  You will look up
> > the { fnaddr, fnaddr + 1 } address range in the target mapping structure
> > instead.
> So, fnaddr is the host function address, right?  Then we are looking for it in
> the splay tree and find the corresponding address on the target side, correct?
> What do we map for the functions?
> Also, AFAIK COI needs a name passed to it to run offloaded function, so we might
> want to keep the name anyway.

See above, names are just a bad idea.  You can just use some magic wrapper
name in the target binary (the one sitting in libgomp), to which you just
pass the pair of function address and it's argument and the named function
will just read the (target) function pointer and (target) pointer argument
from misc data block and tail call that function.
Looking at COI source, the function is:
COINATIVELIBEXPORT
void Foo (uint32_t         in_BufferCount,
          void**           in_ppBufferPointers,
          uint64_t*        in_pBufferLengths,
          void*            in_pMiscData,
          uint16_t         in_MiscDataLength,
          void*            in_pReturnValue,
          uint16_t         in_ReturnValueLength)
{
  ...
}
anyway, so wouldn't match very well the API we have right now (just one
.omp_target_s * argument).

> Yep, but again, this document was intended to describe decisions we've chosen
> for implementation of OpenMP4 offloading support.
> 
> In general, what do you think, is it worth maintaining such document (and
> probably later upload it as a wiki page) or we don't need it and it's better
> just be dropped?

I don't know, if it helps you to work on the implementation, maybe, but if
it gets stale quickly and won't match the actual implementation, it won't
help much.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-17 12:30                               ` Jakub Jelinek
@ 2013-10-28 10:43                                 ` Ilya Verbin
  2013-10-29  8:04                                   ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Ilya Verbin @ 2013-10-28 10:43 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, triegel, sergos.gnu

Hi Jakub,

We have a MIC offload runtime library (liboffload), which is an abstraction over
COI.  Currently it is a part of ICC, but there are plans of open sourcing it.
However, liboffload requires somewhat different tables comparing to what we have
agreed on.  The liboffload tables serve to associate host functions with target
functions.  They should be inserted at compile-time into special sections of
every executable or DSO with #pragma omp target.  The tables contain pairs of:
{ char *name, void *host_addr } for host binaries, and { char *name, void
*target_addr } for target.  The "name" might be not the actual function name,
but just a key for host->target mapping.
So, in this approach, GOMP_target will take host_addr as input, then MIC plugin
will convert it into the "name" by host-side table, and call on MIC using
liboffload interface.  Perhaps, additional table will be created by MIC plugin
to speed up the name lookup.  This also should eliminate problems with functions
re-ordering at LTO where address tables from different objects will be mixed
into one in executable/shared library.
What do you think, is it ok to save this additional data in the tables?

Thanks,
  -- Ilya

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-10-28 10:43                                 ` Ilya Verbin
@ 2013-10-29  8:04                                   ` Jakub Jelinek
  2014-01-31 18:03                                     ` Ilya Verbin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2013-10-29  8:04 UTC (permalink / raw)
  To: Ilya Verbin
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, triegel, sergos.gnu

On Mon, Oct 28, 2013 at 02:42:37PM +0400, Ilya Verbin wrote:
> We have a MIC offload runtime library (liboffload), which is an abstraction over
> COI.  Currently it is a part of ICC, but there are plans of open sourcing it.
> However, liboffload requires somewhat different tables comparing to what we have
> agreed on.  The liboffload tables serve to associate host functions with target
> functions.  They should be inserted at compile-time into special sections of
> every executable or DSO with #pragma omp target.  The tables contain pairs of:
> { char *name, void *host_addr } for host binaries, and { char *name, void
> *target_addr } for target.  The "name" might be not the actual function name,
> but just a key for host->target mapping.
> So, in this approach, GOMP_target will take host_addr as input, then MIC plugin
> will convert it into the "name" by host-side table, and call on MIC using
> liboffload interface.  Perhaps, additional table will be created by MIC plugin
> to speed up the name lookup.  This also should eliminate problems with functions
> re-ordering at LTO where address tables from different objects will be mixed
> into one in executable/shared library.
> What do you think, is it ok to save this additional data in the tables?

See my earlier comments on why names are bad.  For exported functions names
can work (though, if you use them for the whole process, things like
visibility attributes can complicate it), but for non-exported functions you
either need to fall back to names derived from first exported symbol if any,
or uuid etc., which are highly undersirable for bitwise reproduceable
builds.  If liboffload requires something else, isn't it a library at a too
high level and shouldn't you use COI instead?

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-10-29  8:04                                   ` Jakub Jelinek
@ 2014-01-31 18:03                                     ` Ilya Verbin
  2014-01-31 19:43                                       ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Ilya Verbin @ 2014-01-31 18:03 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, triegel, Sergey Ostanevich

Looks like there is a bug (in GOMP_target lowering? or in
gomp_map_vars_existing?)
The reproducer:

#define N 1000

void foo ()
{
  int *a = malloc (N * sizeof (int));
  printf ("1: %p\n", a);
  #pragma omp target data map(tofrom: a[0:N])
  {
    printf ("2: %p\n", a);
    #pragma omp target
    {
      int i;
      for (i = 0; i < N; i++)
        a[i] = i;
    }
    printf ("3: %p\n", a);
  }
  printf ("4: %p\n", a);
  free (a);
}

Here GOMP_target believes that the pointer 'a' has a type TOFROM, so
it sets copy_from to true for the existing mapping of the pointer 'a',
that was mapped in GOMP_target_data.  Therefore the output is
incorrect:

1: [host addr]
2: [host addr]
3: [host addr]
4: [target addr]

  -- Ilya

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-01-31 18:03                                     ` Ilya Verbin
@ 2014-01-31 19:43                                       ` Jakub Jelinek
  2014-02-14 15:24                                         ` Ilya Verbin
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2014-01-31 19:43 UTC (permalink / raw)
  To: Ilya Verbin
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, triegel, Sergey Ostanevich

On Fri, Jan 31, 2014 at 09:18:33PM +0400, Ilya Verbin wrote:
> Looks like there is a bug (in GOMP_target lowering? or in
> gomp_map_vars_existing?)
> The reproducer:
> 
> #define N 1000
> 
> void foo ()
> {
>   int *a = malloc (N * sizeof (int));
>   printf ("1: %p\n", a);
>   #pragma omp target data map(tofrom: a[0:N])
>   {
>     printf ("2: %p\n", a);
>     #pragma omp target
>     {
>       int i;
>       for (i = 0; i < N; i++)
>         a[i] = i;
>     }
>     printf ("3: %p\n", a);
>   }
>   printf ("4: %p\n", a);
>   free (a);
> }
> 
> Here GOMP_target believes that the pointer 'a' has a type TOFROM, so
> it sets copy_from to true for the existing mapping of the pointer 'a',
> that was mapped in GOMP_target_data.  Therefore the output is
> incorrect:

Implicit map(tofrom: a) on #pragma omp target is what the standard
requires, so I don't see a bug on the compiler side.
I'd need to go back to omp-lang endless discussions regarding the copy_from
stuff and/or discuss this further.

I'd suggest just using map(tofrom: a[0:N]) also on the #pragma omp target,
then it is clear what should happen.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-01-31 19:43                                       ` Jakub Jelinek
@ 2014-02-14 15:24                                         ` Ilya Verbin
  2014-02-14 15:43                                           ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Ilya Verbin @ 2014-02-14 15:24 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Michael V. Zolotukhin, Kirill Yukhin, Richard Henderson,
	GCC Development, triegel, Sergey Ostanevich

2014-01-31 22:03 GMT+04:00 Jakub Jelinek <jakub@redhat.com>:
> Implicit map(tofrom: a) on #pragma omp target is what the standard
> requires, so I don't see a bug on the compiler side.
>         Jakub

There is an exception in the standard (page 177, lines 17-21):

> If a corresponding list item of the original list item is in the enclosing device data
> environment, the new device data environment uses the corresponding list item from the
> enclosing device data environment. No additional storage is allocated in the new device
> data environment and neither initialization nor assignment is performed, regardless of
> the map-type that is specified.

So, the pointer 'a' should inherit map-type ALLOC from the enclosing
device data environment.

  -- Ilya

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-02-14 15:24                                         ` Ilya Verbin
@ 2014-02-14 15:43                                           ` Jakub Jelinek
  2014-02-14 18:54                                             ` Richard Henderson
  2014-02-17 15:59                                             ` Ilya Verbin
  0 siblings, 2 replies; 56+ messages in thread
From: Jakub Jelinek @ 2014-02-14 15:43 UTC (permalink / raw)
  To: Ilya Verbin, Richard Henderson
  Cc: Michael V. Zolotukhin, Kirill Yukhin, GCC Development, triegel,
	Sergey Ostanevich

On Fri, Feb 14, 2014 at 07:24:16PM +0400, Ilya Verbin wrote:
> 2014-01-31 22:03 GMT+04:00 Jakub Jelinek <jakub@redhat.com>:
> > Implicit map(tofrom: a) on #pragma omp target is what the standard
> > requires, so I don't see a bug on the compiler side.
> >         Jakub
> 
> There is an exception in the standard (page 177, lines 17-21):
> 
> > If a corresponding list item of the original list item is in the enclosing device data
> > environment, the new device data environment uses the corresponding list item from the
> > enclosing device data environment. No additional storage is allocated in the new device
> > data environment and neither initialization nor assignment is performed, regardless of
> > the map-type that is specified.
> 
> So, the pointer 'a' should inherit map-type ALLOC from the enclosing
> device data environment.

The standard itself is very unclear.  I'll cite my omp-lang mail from
September:

> Ok, I'm for now implementing this refcounted model.                                                                                              
> One still unclear thing is what is supposed to happen if multiple host threads                                                                   
> enter a target data construct mapping at least one same object with different                                                                    
> map kind.                                                                                                                                        
> Say thread A enters #pragma omp target data map(tofrom:p[:64]), then                                                                             
> thread B enters #pragma omp target data map(alloc:p[:64]) while thread A is                                                                      
> still running the body of it's target data (so, the mapping just increments                                                                      
> refcount of the p[:64] array section), then thread A leaves the target data                                                                      
> construct, decrements p[:64] refcount, but as it is non-zero, doesn't                                                                            
> deallocate it, and finally thread B enters end of its target data construct and                                                                  
> unmaps p[:64].  The question is, when (if ever) is the array section supposed                                                                    
> to be copied back to host?  Shall it be done at the end of thread's A target                                                                     
> data section, or at the end of thread's B target data section (i.e. propagate                                                                    
> the flag, has at least one of the mapping's requested copy from the device to                                                                    
> host at the end of it's lifetime), or not copied at all?                                                                                         
> What if thread B doesn't request the whole array section, but only a portion                                                                     
> thereof map(alloc:p[:32]) ?  Would it copy the whole p[:64] array section                                                                        
> back, or just a portion of it?  Though, admittedly, this latter case of a subset                                                                 
> might be harder to construct valid non-racy testcase for, one needs to make                                                                      
> sure one of the target data constructs is always entered before the other;                                                                       
> though perhaps with #pragma omp atomic and spinning it might be doable.                                                                          

and will just paraphrase the Sep 9th answer I got for that, because not sure I'm allowed
to repost it.  The answer was that on entry the standard is pretty clear what
happens, the first encountering thread/data construct allocates and optionally copies
based on the flags, all others when it is already mapped do nothing.  On exit, the
standard is silent and none of the solutions are right, the committee will discuss
it further.

So, for now the implementation choice was to or in the copy from device bit.

Now, you could argue this case is different, because it is not different threads,
but the same thread, just nested construct on the same thread.  But how to
reliably differentiate that?  Even if you stored some thread identification
into the tree along with each mapping (what thread mapped this in), what if some
other thread also does the same (outer #pragma omp target data, inner
#pragma omp target, where the outer one does just array section mapping and
inner tofrom mapping on the pointer), then we'd still copy back.

So, perhaps we should just stop for now oring the copyfrom in and just use
the copyfrom from the very first mapping only, and wait for what the committee
actually agrees on.

Richard, your thoughts on this?

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-02-14 15:43                                           ` Jakub Jelinek
@ 2014-02-14 18:54                                             ` Richard Henderson
  2014-02-17 15:59                                             ` Ilya Verbin
  1 sibling, 0 replies; 56+ messages in thread
From: Richard Henderson @ 2014-02-14 18:54 UTC (permalink / raw)
  To: Jakub Jelinek, Ilya Verbin
  Cc: Michael V. Zolotukhin, Kirill Yukhin, GCC Development, triegel,
	Sergey Ostanevich

On 02/14/2014 07:43 AM, Jakub Jelinek wrote:
> So, perhaps we should just stop for now oring the copyfrom in and just use
> the copyfrom from the very first mapping only, and wait for what the committee
> actually agrees on.
> 
> Richard, your thoughts on this?

I think stopping the or'ing until the issue is resolved is the best plan.


r~

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-02-14 15:43                                           ` Jakub Jelinek
  2014-02-14 18:54                                             ` Richard Henderson
@ 2014-02-17 15:59                                             ` Ilya Verbin
  2014-02-17 16:03                                               ` Jakub Jelinek
  1 sibling, 1 reply; 56+ messages in thread
From: Ilya Verbin @ 2014-02-17 15:59 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Henderson, Michael V. Zolotukhin, Kirill Yukhin,
	GCC Development, triegel, Sergey Ostanevich

On 14 Feb 16:43, Jakub Jelinek wrote:
> So, perhaps we should just stop for now oring the copyfrom in and just use
> the copyfrom from the very first mapping only, and wait for what the committee
> actually agrees on.
> 
> 	Jakub

Like this?

@@ -171,11 +171,16 @@ gomp_map_vars_existing (splay_tree_key oldn, splay_tree_key newn,
 		"[%p..%p) is already mapped",
 		(void *) newn->host_start, (void *) newn->host_end,
 		(void *) oldn->host_start, (void *) oldn->host_end);
+#if 0
+  /* FIXME: Remove this when OpenMP 4.0 will be standardized.  Currently it's
+     unclear regarding overwriting copy_from for the existing mapping.
+     See http://gcc.gnu.org/ml/gcc/2014-02/msg00208.html for details.  */
   if (((kind & 7) == 2 || (kind & 7) == 3)
       && !oldn->copy_from
       && oldn->host_start == newn->host_start
       && oldn->host_end == newn->host_end)
     oldn->copy_from = true;
+#endif
   oldn->refcount++;
 }

  -- Ilya

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-02-17 15:59                                             ` Ilya Verbin
@ 2014-02-17 16:03                                               ` Jakub Jelinek
  0 siblings, 0 replies; 56+ messages in thread
From: Jakub Jelinek @ 2014-02-17 16:03 UTC (permalink / raw)
  To: Ilya Verbin
  Cc: Richard Henderson, Michael V. Zolotukhin, Kirill Yukhin,
	GCC Development, triegel, Sergey Ostanevich

On Mon, Feb 17, 2014 at 07:59:16PM +0400, Ilya Verbin wrote:
> On 14 Feb 16:43, Jakub Jelinek wrote:
> > So, perhaps we should just stop for now oring the copyfrom in and just use
> > the copyfrom from the very first mapping only, and wait for what the committee
> > actually agrees on.
> > 
> > 	Jakub
> 
> Like this?
> 
> @@ -171,11 +171,16 @@ gomp_map_vars_existing (splay_tree_key oldn, splay_tree_key newn,
>  		"[%p..%p) is already mapped",
>  		(void *) newn->host_start, (void *) newn->host_end,
>  		(void *) oldn->host_start, (void *) oldn->host_end);
> +#if 0
> +  /* FIXME: Remove this when OpenMP 4.0 will be standardized.  Currently it's
> +     unclear regarding overwriting copy_from for the existing mapping.
> +     See http://gcc.gnu.org/ml/gcc/2014-02/msg00208.html for details.  */
>    if (((kind & 7) == 2 || (kind & 7) == 3)
>        && !oldn->copy_from
>        && oldn->host_start == newn->host_start
>        && oldn->host_end == newn->host_end)
>      oldn->copy_from = true;
> +#endif
>    oldn->refcount++;
>  }

Well, OpenMP 4.0 is a released standard, just in some cases ambiguous or
buggy.  I'd just remove the code rather than putting it into #if 0, patch
preapproved.  It will stay in the SVN history...

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2013-09-13 11:30                                     ` Michael V. Zolotukhin
  2013-09-13 12:36                                       ` Jakub Jelinek
@ 2014-07-17  7:52                                       ` Thomas Schwinge
  2014-07-17 12:30                                         ` Ilya Verbin
  1 sibling, 1 reply; 56+ messages in thread
From: Thomas Schwinge @ 2014-07-17  7:52 UTC (permalink / raw)
  To: Michael V. Zolotukhin, Jakub Jelinek
  Cc: Kirill Yukhin, Richard Henderson, gcc, triegel, julian

[-- Attachment #1: Type: text/plain, Size: 1616 bytes --]

Hi!

On Fri, 13 Sep 2013 15:29:30 +0400, "Michael V. Zolotukhin" <michael.v.zolotukhin@gmail.com> wrote:
> [patch for adding plugins support in libgomp]

One question:

> --- a/libgomp/target.c
> +++ b/libgomp/target.c

> +/* This functions scans folder, specified in environment variable
> +   LIBGOMP_PLUGIN_PATH, and loads all suitable libgomp plugins from this folder.
> +   For a plugin to be suitable, its name should be "libgomp-plugin-*.so.1" and
> +   it should implement a certain set of functions.
> +   Result of this function is properly initialized variable NUM_DEVICES and
> +   array DEVICES, containing all plugins and their callback handles.  */
> +static void
> +gomp_find_available_plugins (void)
> +{
> +  char *plugin_path = NULL;
> +[...]
> +  plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");

What is the benefit of making this an environment variable that the user
set to set, LIBGOMP_PLUGIN_PATH, as opposed to hard-coding it to
somewhere inside the GCC installation directory ([...]/lib/libgomp/*.so)?
(There, it can still be overridden; dlopen obeys DT_RPATH/DT_RUNPATH, and
LD_LIBRARY_PATH.)  Hard-coding it would make libgomp testing a bit
easier, and it generally seems to make sense to me that the compiler
(libgomp) should be able to locate its own resources, and I don't think
the plugin search path is something that a user generally would want to
override -- or is your use case indeed that the plugin is not built as
part of libgomp's build process?  (But, in this case you still could use
LD_LIBRARY_PATH to have it found.)

Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-07-17  7:52                                       ` Thomas Schwinge
@ 2014-07-17 12:30                                         ` Ilya Verbin
  2014-07-17 12:37                                           ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Ilya Verbin @ 2014-07-17 12:30 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek
  Cc: Kirill Yukhin, Richard Henderson, GCC Development, triegel, julian

2014-07-17 11:51 GMT+04:00 Thomas Schwinge <thomas@codesourcery.com>:
>> +  plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
>
> What is the benefit of making this an environment variable that the user
> set to set, LIBGOMP_PLUGIN_PATH, as opposed to hard-coding it to
> somewhere inside the GCC installation directory ([...]/lib/libgomp/*.so)?
> (There, it can still be overridden; dlopen obeys DT_RPATH/DT_RUNPATH, and
> LD_LIBRARY_PATH.)  Hard-coding it would make libgomp testing a bit
> easier, and it generally seems to make sense to me that the compiler
> (libgomp) should be able to locate its own resources, and I don't think
> the plugin search path is something that a user generally would want to
> override -- or is your use case indeed that the plugin is not built as
> part of libgomp's build process?  (But, in this case you still could use
> LD_LIBRARY_PATH to have it found.)

Hi,

We invented this environment variable almost a year ago, when we
didn’t fully understand how all the parts will work together. So for
now, likely, your proposal is better.
Jakub, what do you think?

  -- Ilya

P.S. Michael is no longer working on this, I'm continuing his work.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-07-17 12:30                                         ` Ilya Verbin
@ 2014-07-17 12:37                                           ` Jakub Jelinek
  2014-07-17 12:58                                             ` Thomas Schwinge
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2014-07-17 12:37 UTC (permalink / raw)
  To: Ilya Verbin
  Cc: Thomas Schwinge, Kirill Yukhin, Richard Henderson,
	GCC Development, triegel, julian

On Thu, Jul 17, 2014 at 04:30:15PM +0400, Ilya Verbin wrote:
> 2014-07-17 11:51 GMT+04:00 Thomas Schwinge <thomas@codesourcery.com>:
> >> +  plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
> >
> > What is the benefit of making this an environment variable that the user
> > set to set, LIBGOMP_PLUGIN_PATH, as opposed to hard-coding it to
> > somewhere inside the GCC installation directory ([...]/lib/libgomp/*.so)?
> > (There, it can still be overridden; dlopen obeys DT_RPATH/DT_RUNPATH, and
> > LD_LIBRARY_PATH.)  Hard-coding it would make libgomp testing a bit
> > easier, and it generally seems to make sense to me that the compiler
> > (libgomp) should be able to locate its own resources, and I don't think
> > the plugin search path is something that a user generally would want to
> > override -- or is your use case indeed that the plugin is not built as
> > part of libgomp's build process?  (But, in this case you still could use
> > LD_LIBRARY_PATH to have it found.)
> 
> We invented this environment variable almost a year ago, when we
> didnâ€™t fully understand how all the parts will work together. So for
> now, likely, your proposal is better.
> Jakub, what do you think?

Yeah, certainly.  Though, ideally the path it looks at should be relative
to the directory where libgomp is installed, and I'm afraid it is hard to
figure out portably where it was loaded from, and DT_RPATH/DT_RUNPATH on
libgomp would affect all dlopen calls, not just the loading of the plugins.
Not sure if one can use at least on Linux ${ORIGIN} in dlopen and what
exactly will it expand to.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-07-17 12:37                                           ` Jakub Jelinek
@ 2014-07-17 12:58                                             ` Thomas Schwinge
  2014-07-17 13:09                                               ` Thomas Schwinge
  0 siblings, 1 reply; 56+ messages in thread
From: Thomas Schwinge @ 2014-07-17 12:58 UTC (permalink / raw)
  To: Jakub Jelinek, Ilya Verbin
  Cc: Kirill Yukhin, Richard Henderson, GCC Development, triegel, julian

[-- Attachment #1: Type: text/plain, Size: 3558 bytes --]

Hi!

On Thu, 17 Jul 2014 14:37:12 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Jul 17, 2014 at 04:30:15PM +0400, Ilya Verbin wrote:
> > 2014-07-17 11:51 GMT+04:00 Thomas Schwinge <thomas@codesourcery.com>:
> > >> +  plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
> > >
> > > What is the benefit of making this an environment variable that the user
> > > set to set, LIBGOMP_PLUGIN_PATH, as opposed to hard-coding it to
> > > somewhere inside the GCC installation directory ([...]/lib/libgomp/*.so)?
> > > (There, it can still be overridden; dlopen obeys DT_RPATH/DT_RUNPATH, and
> > > LD_LIBRARY_PATH.)  Hard-coding it would make libgomp testing a bit
> > > easier, and it generally seems to make sense to me that the compiler
> > > (libgomp) should be able to locate its own resources, and I don't think
> > > the plugin search path is something that a user generally would want to
> > > override -- or is your use case indeed that the plugin is not built as
> > > part of libgomp's build process?  (But, in this case you still could use
> > > LD_LIBRARY_PATH to have it found.)
> > 
> > We invented this environment variable almost a year ago, when we
> > didn’t fully understand how all the parts will work together. So for
> > now, likely, your proposal is better.
> > Jakub, what do you think?
> 
> Yeah, certainly.  Though, ideally the path it looks at should be relative
> to the directory where libgomp is installed

Right...

> and I'm afraid it is hard to
> figure out portably where it was loaded from, and DT_RPATH/DT_RUNPATH on
> libgomp would affect all dlopen calls, not just the loading of the plugins.
> Not sure if one can use at least on Linux ${ORIGIN} in dlopen and what
> exactly will it expand to.

I haven't verified, but I'd guess it to expand to the *executable*
linking against libgomp, so that won't help...

I have, however, found some logic in gcc/plugin.c that seems at least
similar to what we need:

gcc/doc/plugins.texi:

    @node Plugins loading
    @section Loading Plugins
    
    Plugins are supported on platforms that support @option{-ldl
    -rdynamic}.  They are loaded by the compiler using @code{dlopen}
    and invoked at pre-determined locations in the compilation
    process.
    
    Plugins are loaded with
    
    @option{-fplugin=/path/to/@var{name}.so} [...]
    
    [...]
    
    A plugin can be simply given by its short name (no dots or
    slashes). When simply passing @option{-fplugin=@var{name}}, the plugin is
    loaded from the @file{plugin} directory, so @option{-fplugin=@var{name}} is
    the same as @option{-fplugin=`gcc -print-file-name=plugin`/@var{name}.so},
    using backquote shell syntax to query the @file{plugin} directory.

gcc/plugin.c:

    /* Retrieve the default plugin directory.  The gcc driver should have passed
       it as -iplugindir <dir> to the cc1 program, and it is queriable through the
       -print-file-name=plugin option to gcc.  */
    const char*
    default_plugin_dir_name (void)
    {
      if (!plugindir_string)
        fatal_error ("-iplugindir <dir> option not passed from the gcc driver");
      return plugindir_string;
    }

But I'm not yet sure how we could use this to tie the libgomp plugin
search path to the location of libgomp.so...  Especially, given that the
location of libgomp.so during compilation need not match the location
during execution.  A show-stopper?  (No time currently to explore this in
more detail.)


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-07-17 12:58                                             ` Thomas Schwinge
@ 2014-07-17 13:09                                               ` Thomas Schwinge
  2014-07-17 13:35                                                 ` Jakub Jelinek
  0 siblings, 1 reply; 56+ messages in thread
From: Thomas Schwinge @ 2014-07-17 13:09 UTC (permalink / raw)
  To: Jakub Jelinek, Ilya Verbin
  Cc: Kirill Yukhin, Richard Henderson, GCC Development, triegel, julian

[-- Attachment #1: Type: text/plain, Size: 4608 bytes --]

Hi!

On Thu, 17 Jul 2014 14:58:04 +0200, I wrote:
> On Thu, 17 Jul 2014 14:37:12 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Thu, Jul 17, 2014 at 04:30:15PM +0400, Ilya Verbin wrote:
> > > 2014-07-17 11:51 GMT+04:00 Thomas Schwinge <thomas@codesourcery.com>:
> > > >> +  plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
> > > >
> > > > What is the benefit of making this an environment variable that the user
> > > > set to set, LIBGOMP_PLUGIN_PATH, as opposed to hard-coding it to
> > > > somewhere inside the GCC installation directory ([...]/lib/libgomp/*.so)?
> > > > (There, it can still be overridden; dlopen obeys DT_RPATH/DT_RUNPATH, and
> > > > LD_LIBRARY_PATH.)  Hard-coding it would make libgomp testing a bit
> > > > easier, and it generally seems to make sense to me that the compiler
> > > > (libgomp) should be able to locate its own resources, and I don't think
> > > > the plugin search path is something that a user generally would want to
> > > > override -- or is your use case indeed that the plugin is not built as
> > > > part of libgomp's build process?  (But, in this case you still could use
> > > > LD_LIBRARY_PATH to have it found.)
> > > 
> > > We invented this environment variable almost a year ago, when we
> > > didn’t fully understand how all the parts will work together. So for
> > > now, likely, your proposal is better.
> > > Jakub, what do you think?
> > 
> > Yeah, certainly.  Though, ideally the path it looks at should be relative
> > to the directory where libgomp is installed
> 
> Right...
> 
> > and I'm afraid it is hard to
> > figure out portably where it was loaded from, and DT_RPATH/DT_RUNPATH on
> > libgomp would affect all dlopen calls, not just the loading of the plugins.
> > Not sure if one can use at least on Linux ${ORIGIN} in dlopen and what
> > exactly will it expand to.
> 
> I haven't verified, but I'd guess it to expand to the *executable*
> linking against libgomp, so that won't help...
> 
> I have, however, found some logic in gcc/plugin.c that seems at least
> similar to what we need:
> 
> gcc/doc/plugins.texi:
> 
>     @node Plugins loading
>     @section Loading Plugins
>     
>     Plugins are supported on platforms that support @option{-ldl
>     -rdynamic}.  They are loaded by the compiler using @code{dlopen}
>     and invoked at pre-determined locations in the compilation
>     process.
>     
>     Plugins are loaded with
>     
>     @option{-fplugin=/path/to/@var{name}.so} [...]
>     
>     [...]
>     
>     A plugin can be simply given by its short name (no dots or
>     slashes). When simply passing @option{-fplugin=@var{name}}, the plugin is
>     loaded from the @file{plugin} directory, so @option{-fplugin=@var{name}} is
>     the same as @option{-fplugin=`gcc -print-file-name=plugin`/@var{name}.so},
>     using backquote shell syntax to query the @file{plugin} directory.
> 
> gcc/plugin.c:
> 
>     /* Retrieve the default plugin directory.  The gcc driver should have passed
>        it as -iplugindir <dir> to the cc1 program, and it is queriable through the
>        -print-file-name=plugin option to gcc.  */
>     const char*
>     default_plugin_dir_name (void)
>     {
>       if (!plugindir_string)
>         fatal_error ("-iplugindir <dir> option not passed from the gcc driver");
>       return plugindir_string;
>     }
> 
> But I'm not yet sure how we could use this to tie the libgomp plugin
> search path to the location of libgomp.so...  Especially, given that the
> location of libgomp.so during compilation need not match the location
> during execution.  A show-stopper?  (No time currently to explore this in
> more detail.)

Heh, would a "hack" like the following work?

libcilkrts/runtime/sysdep-unix.c:

    /* (Non-static) dummy function is used by get_runtime_path() to find the path
     * to the .so containing the Cilk runtime.
     */
    void dummy_function() { }
    
    /* return a string with the path to the Cilk runtime, or "unknown" if the path
     * cannot be determined.
     */
    static const char *get_runtime_path ()
    {
    #ifdef __CYGWIN__
        // Cygwin doesn't support dladdr, which sucks
        return "unknown";
    #else
        Dl_info info;
        if (0 == dladdr(dummy_function, &info)) return "unknown";
        return info.dli_fname;
    #endif
    }

Putting that into libgomp, it should give the path to the libgomp.so
actually loaded, and then we can load the plugins relative from its
dirname?


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-07-17 13:09                                               ` Thomas Schwinge
@ 2014-07-17 13:35                                                 ` Jakub Jelinek
  2014-07-17 14:37                                                   ` Thomas Schwinge
  0 siblings, 1 reply; 56+ messages in thread
From: Jakub Jelinek @ 2014-07-17 13:35 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: Ilya Verbin, Kirill Yukhin, Richard Henderson, GCC Development,
	triegel, julian

On Thu, Jul 17, 2014 at 03:09:32PM +0200, Thomas Schwinge wrote:
> > But I'm not yet sure how we could use this to tie the libgomp plugin
> > search path to the location of libgomp.so...  Especially, given that the
> > location of libgomp.so during compilation need not match the location
> > during execution.  A show-stopper?  (No time currently to explore this in
> > more detail.)
> 
> Heh, would a "hack" like the following work?
> 
> libcilkrts/runtime/sysdep-unix.c:
> 
>     /* (Non-static) dummy function is used by get_runtime_path() to find the path
>      * to the .so containing the Cilk runtime.
>      */
>     void dummy_function() { }
>     
>     /* return a string with the path to the Cilk runtime, or "unknown" if the path
>      * cannot be determined.
>      */
>     static const char *get_runtime_path ()
>     {
>     #ifdef __CYGWIN__
>         // Cygwin doesn't support dladdr, which sucks
>         return "unknown";
>     #else
>         Dl_info info;
>         if (0 == dladdr(dummy_function, &info)) return "unknown";
>         return info.dli_fname;
>     #endif
>     }
> 
> Putting that into libgomp, it should give the path to the libgomp.so
> actually loaded, and then we can load the plugins relative from its
> dirname?

Well, libgomp has to be far more portable than this, so the question is
if we want to live with one behavior on Linux and another one elsewhere
(fallback to absolute path)?  In any case, as last resort it should just try
to dlopen the plugin without full path, and the plugins really should have
libgomp-plugin or something similar in their names to make it less likely
to clash with something else.

If we would be ok with that, then a function to return that would need
to go into config/linux/ and config/posix/.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Offloading Support in libgomp
  2014-07-17 13:35                                                 ` Jakub Jelinek
@ 2014-07-17 14:37                                                   ` Thomas Schwinge
  0 siblings, 0 replies; 56+ messages in thread
From: Thomas Schwinge @ 2014-07-17 14:37 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ilya Verbin, Kirill Yukhin, Richard Henderson, GCC Development,
	triegel, julian

[-- Attachment #1: Type: text/plain, Size: 3009 bytes --]

Hi!

On Thu, 17 Jul 2014 15:35:36 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Jul 17, 2014 at 03:09:32PM +0200, Thomas Schwinge wrote:
> > > But I'm not yet sure how we could use this to tie the libgomp plugin
> > > search path to the location of libgomp.so...  Especially, given that the
> > > location of libgomp.so during compilation need not match the location
> > > during execution.  A show-stopper?  (No time currently to explore this in
> > > more detail.)
> > 
> > Heh, would a "hack" like the following work?
> > 
> > libcilkrts/runtime/sysdep-unix.c:
> > 
> >     /* (Non-static) dummy function is used by get_runtime_path() to find the path
> >      * to the .so containing the Cilk runtime.
> >      */
> >     void dummy_function() { }
> >     
> >     /* return a string with the path to the Cilk runtime, or "unknown" if the path
> >      * cannot be determined.
> >      */
> >     static const char *get_runtime_path ()
> >     {
> >     #ifdef __CYGWIN__
> >         // Cygwin doesn't support dladdr, which sucks
> >         return "unknown";
> >     #else
> >         Dl_info info;
> >         if (0 == dladdr(dummy_function, &info)) return "unknown";
> >         return info.dli_fname;
> >     #endif
> >     }
> > 
> > Putting that into libgomp, it should give the path to the libgomp.so
> > actually loaded, and then we can load the plugins relative from its
> > dirname?
> 
> Well, libgomp has to be far more portable than this, so the question is
> if we want to live with one behavior on Linux and another one elsewhere
> (fallback to absolute path)?

Hmm, that doesn't really seem appealing.

> In any case, as last resort it should just try
> to dlopen the plugin without full path, and the plugins really should have
> libgomp-plugin or something similar in their names to make it less likely
> to clash with something else.

The problem is that we don't know the plugins' names.  Currently, we're
scanning a directory for all filenames matching libgomp-plugin-*.so.1.

> If we would be ok with that, then a function to return that would need
> to go into config/linux/ and config/posix/.

(config/gnu/ instead of config/linux/, as that's more a GNU/glibc thing
than a Linux kernel thing.)

Hmm, take one step back.  Putting the plusing next to libgomp.so will
make their discovery easy, as that'll be governed by the very same
searching rules that led the dynamic linker to libgomp.so.  All plugins
are named libgomp-plugin-*.so.1.  But we'd have to know the plugins'
names (for use with dlopen), losing the ability to dynamically extend the
set of libgomp plugins.  This in turn could be achieved by setting an
environment variable that specifies an additional -- or replacement? --
directory to scan, or even just specifies a list of additional plugin
*names* to load, again relying on the standard searching rules for them
to be found.  Is that a reasonable price to pay?


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2014-07-17 14:37 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-22 22:37 [RFC] Offloading Support in libgomp Michael V. Zolotukhin
2013-08-23  0:22 ` Jakub Jelinek
2013-08-23 12:16   ` Michael V. Zolotukhin
2013-08-23 12:37     ` Jakub Jelinek
2013-08-24  6:17       ` Michael V. Zolotukhin
2013-08-25 16:24         ` Jakub Jelinek
2013-08-27  0:36           ` Michael V. Zolotukhin
2013-08-27  0:38             ` Jakub Jelinek
2013-08-27  6:16               ` Michael V. Zolotukhin
2013-08-27  8:06                 ` Jakub Jelinek
2013-08-27 15:47                   ` Michael V. Zolotukhin
2013-08-27 16:22                     ` Jakub Jelinek
2013-08-27 19:54                       ` Michael V. Zolotukhin
2013-08-28 11:21                         ` Jakub Jelinek
2013-08-29 10:44                           ` Michael V. Zolotukhin
2013-09-10 15:02                           ` Michael V. Zolotukhin
2013-09-10 15:15                             ` Jakub Jelinek
2013-09-10 15:31                               ` Michael V. Zolotukhin
2013-09-10 15:36                                 ` Jakub Jelinek
2013-09-10 15:38                                   ` Michael V. Zolotukhin
2013-09-13 11:30                                     ` Michael V. Zolotukhin
2013-09-13 12:36                                       ` Jakub Jelinek
2013-09-13 13:11                                         ` Michael V. Zolotukhin
2013-09-13 13:16                                           ` Jakub Jelinek
2013-09-13 15:09                                             ` Ilya Tocar
2013-09-13 15:34                                         ` Jakub Jelinek
2014-07-17  7:52                                       ` Thomas Schwinge
2014-07-17 12:30                                         ` Ilya Verbin
2014-07-17 12:37                                           ` Jakub Jelinek
2014-07-17 12:58                                             ` Thomas Schwinge
2014-07-17 13:09                                               ` Thomas Schwinge
2014-07-17 13:35                                                 ` Jakub Jelinek
2014-07-17 14:37                                                   ` Thomas Schwinge
2013-09-13  9:35                         ` Michael Zolotukhin
2013-09-13 10:52                           ` Kirill Yukhin
2013-09-13 11:04                           ` Nathan Sidwell
2013-09-13 11:21                             ` Michael V. Zolotukhin
2013-09-16  9:35                           ` Jakub Jelinek
2013-09-17 12:05                             ` Michael V. Zolotukhin
2013-09-17 12:30                               ` Jakub Jelinek
2013-10-28 10:43                                 ` Ilya Verbin
2013-10-29  8:04                                   ` Jakub Jelinek
2014-01-31 18:03                                     ` Ilya Verbin
2014-01-31 19:43                                       ` Jakub Jelinek
2014-02-14 15:24                                         ` Ilya Verbin
2014-02-14 15:43                                           ` Jakub Jelinek
2014-02-14 18:54                                             ` Richard Henderson
2014-02-17 15:59                                             ` Ilya Verbin
2014-02-17 16:03                                               ` Jakub Jelinek
2013-08-28 12:56             ` Richard Biener
2013-08-28 15:26               ` Jakub Jelinek
2013-08-28 17:03                 ` Richard Biener
2013-08-28 17:15                   ` Jakub Jelinek
2013-08-29 21:09                     ` Richard Biener
2013-08-28 18:54                   ` Torvald Riegel
2013-08-28 18:43                 ` Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).