From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-179989-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 976 invoked by alias); 27 Aug 2013 11:26:42 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 967 invoked by uid 89); 27 Aug 2013 11:26:41 -0000
Received: from mail-pd0-f173.google.com (HELO mail-pd0-f173.google.com) (209.85.192.173) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 27 Aug 2013 11:26:41 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,FREEMAIL_FROM autolearn=ham version=3.3.2
X-HELO: mail-pd0-f173.google.com
Received: by mail-pd0-f173.google.com with SMTP id p10so4751345pdj.32        for <gcc@gcc.gnu.org>; Tue, 27 Aug 2013 04:26:38 -0700 (PDT)
X-Received: by 10.68.183.131 with SMTP id em3mr20731913pbc.56.1377602798721;        Tue, 27 Aug 2013 04:26:38 -0700 (PDT)
Received: from msticlxl57.ims.intel.com ([192.55.54.40])        by mx.google.com with ESMTPSA id kd1sm26306152pab.20.1969.12.31.16.00.00        (version=TLSv1 cipher=RC4-SHA bits=128/128);        Tue, 27 Aug 2013 04:26:37 -0700 (PDT)
Date: Tue, 27 Aug 2013 15:47:00 -0000
From: "Michael V. Zolotukhin" <michael.v.zolotukhin@gmail.com>
To: Jakub Jelinek <jakub@redhat.com>
Cc: Kirill Yukhin <kirill.yukhin@gmail.com>,	Richard Henderson <rth@redhat.com>, gcc@gcc.gnu.org,	triegel@redhat.com
Subject: Re: [RFC] Offloading Support in libgomp
Message-ID: <20130827112609.GA4093@msticlxl57.ims.intel.com>
References: <20130822140810.GA27868@msticlxl57.ims.intel.com> <20130822142814.GB1814@tucnak.redhat.com> <20130823092810.GA36483@msticlxl57.ims.intel.com> <20130823095250.GJ1814@tucnak.redhat.com> <20130823153052.GA2974@msticlxl57.ims.intel.com> <20130823161631.GO1814@tucnak.redhat.com> <20130826115911.GA40923@msticlxl57.ims.intel.com> <20130826125116.GE21876@tucnak.zalov.cz> <20130826132936.GB40923@msticlxl57.ims.intel.com> <20130826141117.GF21876@tucnak.zalov.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20130826141117.GF21876@tucnak.zalov.cz>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-SW-Source: 2013-08/txt/msg00315.txt.bz2

Hi Jakub,

> Anyway, the GOMP_target_data implementation and part of GOMP_target would
> be something along the lines of following pseudocode:
> 
> device_data = lookup_device_id (device_id);
> ...
Thanks, I've seen that similarly.  But the problem with passing
arguments to the target is still open.  I'll try to explain, what is the
problem.

Remember what we did for 'pragma parallel':
  struct .omp_data_s.0 .omp_data_o.2;
  .omp_data_o.2.s = 0.0;
  .omp_data_o.2.b = &b;
  .omp_data_o.2.c = &c;
  .omp_data_o.2.y = y_7(D);
  .omp_data_o.2.j = j_9(D);
  __builtin_GOMP_parallel (bar._omp_fn.0, &.omp_data_o.2, 0, 0);
  s_12 = .omp_data_o.2.s;
  y_13 = .omp_data_o.2.y;
  j_14 = .omp_data_o.2.j;

I.e. compiler prepares a structure with all arguments and pass it to the
runtime.  Runtime passes this structure as-is to callee (i.e. to
bar._omp_fn.0).

In bar._omp_fn.0 the compiler just emits code that extracts
corresponding fields from the given struct and thus initialize all
needed local vars:
  bar._omp_fn.0 (struct .omp_data_s.0 * .omp_data_i)
  {
    int _12;
    int _13;
    ...
    _12 = .omp_data_i_11(D)->y;
    _13 = .omp_data_i_11(D)->j;
    ...
  }

That scheme would work perfectly for implementing host fallback, but as
I see it, can't be applied as is for target offloading.  The reason is
the following:
*) Compiler doesn't know runtime info, i.e. it doesn't know target
addresses so it can't fill the structure for passing to target version
of the routine.
*) Runtime doesn't know the structure layout - runtime should firstly
translate addresses and only then pass it to the callee, but it don't
know which addresses to translate, because it doesn't know which
variables are used by the callee.

Currently, I see two possible solutions for this:
1) add to the structure with arguments fields, describing size of each
field.  Then GOMP_target parses this struct and replace every found
address with the corresponding target address, and only then call
target_call.
2) Lift mapping/allocation stuff from runtime to compile time, i.e.
allow the compiler to generate calls like this:
  .omp_data_o.2.s = 0.0;
  .omp_data_o.2.b = &b;
  .omp_data_o.2.c = &c;
  .omp_data_o.2.y = y_7(D);
  .omp_data_o.2.j = j_9(D);
  .omp_data_o.target.2.s = GOMP_translate_target_address (0.0);
  .omp_data_o.target.2.b = GOMP_translate_target_address (&b);
  .omp_data_o.target.2.c = GOMP_translate_target_address (&c);
  .omp_data_o.target.2.y = GOMP_translate_target_address (y_7(D));
  .omp_data_o.target.2.j = GOMP_translate_target_address (j_9(D));
  GOMP_target (bar._omp_fn.0, &.omp_data_o.2, &.omp_data_o.target.2, 0, 0, );
Thus runtime would have two versions of structure with arguments and
will be able to pass it as-is to target callee.  But probably we'll need
a version of that struct for each target and that would look very ugly.

What do you think on that?  Maybe I'm missing or overcomplicating
something, but for now I can't get how all this stuff could work
together without answers to these questions.

Referencing to your code, the question could be rephrased as following:
Having thist string
>       target_call (fn_name, target_addrs);
how would FN_NAME, called on the target, figure out where to find its
arguments (as TARGET_ADDRS contains all mapped at runtime addresses)?

---
Thanks, Michael

On 26 Aug 16:11, Jakub Jelinek wrote:
> On Mon, Aug 26, 2013 at 05:29:36PM +0400, Michael V. Zolotukhin wrote:
> > > Nope, there is only one target data pragma, so you would use here just:
> > > 
> > >   struct data_descriptor data_desc2[2] = { ... };
> > >   GOMP_target (-1, bar.omp_fn.1, "bar.omp_fn.1", data_desc2, 2);
> > This 'pragma target' is placed inside a 'pragma target data' - so all
> > variables for 'pragma target data' should be available for the 'pragma
> > target'.  So we need to pass to GOMP_target an array, that contains
> > united set of mapped variables from both pragmas - in our example these
> > would be variables B, C, and S.  So as I see it, we need to use the same
> > array of descriptors both in outer 'pragma target data' and in inner
> > 'pragma target'.  Is it correct?  If data_desc2 contains descriptors of
> > only C and S, how B would be passed to bar.omp_fn.1?
> 
> Actually no, that should be the responsibility of the runtime library.
> Note, the #pragma omp target data directive doesn't have to be in the same
> function as #pragma omp target.  And, I'm sorry for having to confuse this
> by hacking in a premature optimization in the gimplifier.  It is true
> that if the target data is around target directive that the target will
> always be able to only look stuff up, will not need to allocate it,
> but 1) the gimplifier doesn't verify it is the same device between those two
> 2) and as discussed earlier we need it also for the mapping in GOMP_target
> So, the way it should actually work IMHO is that both GOMP_target_data
> is passed a descriptor for b, and also GOMP_target.
> In a more complicated testcase where you have a pointer based array section:
> void
> foo (int *p)
> {
>   #pragma omp target data map (tofrom: p[:1024])
>   {
>     #pragma omp target
>     for (int i = 0; i < 1024; i++)
>       p[i] += 2;
>   }
> }
> GOMP_target_data does two mappings - one where it maps
> (char *) p+0 ... (char *) p+1024*sizeof(int)-1
> region (tofrom) and one where it maps
> &p ... (char *)(&p+1)+1
> region with pointerassign type (i.e. that it is initialized to
> the address of the target pointer section).
> And then GOMP_target during gimplification determines that p
> is used, but not explicitly mapped, so it is added automatically
> to #pragma omp target as implicit map(tofrom:p).  That doesn't
> do anything with the corresponding array section, and while it is tofrom,
> it will actually be always ignored, since the region is already mapped.
> Perhaps as optimization the compiler could hint the runtime library that
> it can just look it up and doesn't need to allocate/copy anything.
> 
> Anyway, the GOMP_target_data implementation and part of GOMP_target would
> be something along the lines of following pseudocode:
> 
> device_data = lookup_device_id (device_id);
> if (device_data == NULL)
>   do host fallback;
> else
>   {
>     size_t i, length_sum = 0;
>     target_data_env = create_target_data_env ();
>     void *target_addrs[num_device_descs]; // VLA or alloca etc.
>     char *target_addr = NULL;
>     memset (target_addrs, 0, sizeof (target_addrs));
>     vec_safe_push (device_data->target_data_envs, target_data_env);
>     for (i = 0; i < num_device_descs; i++)
>       {
> 	target_addrs[i] = lookup_in_target_address_tree (device_data->addr_tree, device_desc[i].host_addr, device_desc[i].length);
>         if (target_addrs[i] == NULL)
> 	  length_sum += device_desc[i].length;
>       }
>     if (length_sum)
>       {
>         target_addr = target_malloc (device_data, length_sum);
> 	length_sum = 0;
> 	for (i = 0; i < num_device_descs; i++)
> 	  if (target_addrs[i] == NULL)
> 	    {
> 	      target_addrs[i] = target_addr + length_sum;
> 	      length_sum += device_desc[i].length;
> 	      switch (device_desc[i].kind)
> 		{
> 		case ALLOC: case FROM: /* nothing */ break;
> 		case TO:
> 		case TOFROM: target_copy_todevice (device_data, device_desc[i].host_addr, target_addrs[i], device_desc[i].length); break;
> 		case POINTER: lookup + copy to; break;
> 		}
> 	      ptr = add_to_target_address_tree (device_data->addr_tree, device_desc[i].host_addr, device_desc[i].length, target_addrs[i], device_desc[i].kind);
> 	      vec_safe_push (target_data_env->vec, ptr);
> 	    }
>       }
>     if (GOMP_target call)
>       {
>       target_call (fn_name, target_addrs);
>       FOR_EACH_SAFE_VEC (target_data_env->vec, ptr)
> 	{
> 	  switch (ptr->kind)
> 	    {
> 	    case FROM: case TOFROM: target_copy_fromdevice (device_data, ...); break;
> 	    }
> 	  remove_from_target_address_tree (device_data->addr_tree, ptr);
> 	}
>       vec_pop_and_free(device_data->target_data_envs);
>       }
>   }
> 
> and for GOMP_target_data_end it would do pretty much the stuff in between
> FOR_EACH_SAFE_VEC and vec_pop_and_free.
> All names above subject to change for something better, I just wanted to
> make the picture clear.  There needs to be some address -> target address
> data structure (addr_tree), probably some tree (AVL, whatever), in any case
> a lookup doesn't need to be exact, you can e.g. look up part of an existing
> mapping.  Trying to map something that overlaps an existing mapping, but is
> larger than that, is a user bug.
> 
> 	Jakub