From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fortran-return-46597-listarch-fortran=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 104705 invoked by alias); 29 Jun 2016 15:31:40 -0000
Mailing-List: contact fortran-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <fortran.gcc.gnu.org>
List-Subscribe: <mailto:fortran-subscribe@gcc.gnu.org>
List-Post: <mailto:fortran@gcc.gnu.org>
List-Help: <mailto:fortran-help@gcc.gnu.org>, <http://sourceware.org/lists.html#faqs>
Sender: fortran-owner@gcc.gnu.org
Received: (qmail 104625 invoked by uid 89); 29 Jun 2016 15:31:39 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=H*F:U*cesar, UD:php, states
X-Spam-User: qpsmtpd, 2 recipients
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Wed, 29 Jun 2016 15:31:28 +0000
Received: from svr-orw-fem-03.mgc.mentorg.com ([147.34.97.39])	by relay1.mentorg.com with esmtp 	id 1bIHSX-0005ry-0O from Cesar_Philippidis@mentor.com ; Wed, 29 Jun 2016 08:31:25 -0700
Received: from [127.0.0.1] (147.34.91.1) by svr-orw-fem-03.mgc.mentorg.com (147.34.97.39) with Microsoft SMTP Server id 14.3.224.2; Wed, 29 Jun 2016 08:31:24 -0700
Subject: Re: [PATCH,openacc] check for compatible loop parallelism with acc routine calls
To: Thomas Schwinge <thomas@codesourcery.com>, Jakub Jelinek	<jakub@redhat.com>
References: <5762190F.4030102@codesourcery.com> <20160617144206.GC7387@tucnak.redhat.com> <576C08D2.6070008@codesourcery.com> <87inwst78s.fsf@kepler.schwinge.homeip.net>
CC: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, Fortran List	<fortran@gcc.gnu.org>, Alexander Monakov <amonakov@ispras.ru>, Ilya Verbin	<iverbin@gmail.com>
From: Cesar Philippidis <cesar@codesourcery.com>
Message-ID: <5773E9CC.2000202@codesourcery.com>
Date: Wed, 29 Jun 2016 15:31:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0
MIME-Version: 1.0
In-Reply-To: <87inwst78s.fsf@kepler.schwinge.homeip.net>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2016-06/txt/msg00119.txt.bz2

On 06/29/2016 07:11 AM, Thomas Schwinge wrote:

> Cesar, I have not yet fully digested this, but do I understand right that
> you're really fixing two issues here, that are related (OpenACC routines)
> but still can be addressed independently of each other?  Do I understand
> right that the first one, the "problems with acc routines [...]
> incorrectly permitting 'acc seq' loops to call gang, worker and vector
> routines" is just a Fortran front end patch?  If yes, please split that
> one out, so as to reduce the volume of remaining changes that remain to
> be discussed.

This patch addresses the following issues:

 1. Issues warnings when a non-acc routine function is called inside an
    OpenACC offloaded region.

 2. It corrects a bug what was allowing seq loops to call gang, worker
    and vector routines.

 3. It adds supports for acc routines in fortran modules (which I
    noticed was missing when I added 'acc routine seq' to acc_on_device
    in the fortran openacc include files).

I'll split these into separate patches.

> On Thu, 23 Jun 2016 09:05:38 -0700, Cesar Philippidis <cesar@codesourcery.com> wrote:
>> On 06/17/2016 07:42 AM, Jakub Jelinek wrote:
>>> On Wed, Jun 15, 2016 at 08:12:15PM -0700, Cesar Philippidis wrote:
>>>> The second set of changes involves teaching the gimplifier to error when
>>>> it detects a function call to an non-acc routines inside an OpenACC
>>>> offloaded region.
> 
> As I understand, that's the same problem as has been discussed before
> (Ilya CCed), and has recently again been filed in
> <https://gcc.gnu.org/PR71499> "ICE in LTO1 when attempting NVPTX
> offloading (-fopenacc)", and <https://gcc.gnu.org/PR71535> "ICE in LTO1
> with -fopenmp offloading" (Alexander CCed).  Some earlier discussion
> threads include:
> <http://news.gmane.org/find-root.php?message_id=%3C20150109145702.GA45210%40msticlxl57.ims.intel.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C20150724152119.GA41292%40msticlxl57.ims.intel.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C56269C05.6030502%40acm.org%3E>.
> 
>>>> Actually, I relaxed non-acc routines by excluding
>>>> calls to builtin functions, including those prefixed with _gfortran_.
>>>> Nvptx does have a newlib c library, and it also has a subset of
>>>> libgfortran. Still, this solution is probably not optimal.
>>>
>>> I don't really like that, hardcoding prefixes or whatever is available
>>> (you have quite some subset of libc, libm etc. available too) in the
>>> compiler looks very hackish.  What is wrong with complaining during
>>> linking of the offloaded code?
> 
> ACK.  Jakub, do I understand you correctly, that you basically say that
> every function declaration that is in scope inside offloaded regions (for
> example, GCC builtin functions, or standard library functions declared in
> target compiler's header files) is permitted to be called in offloaded
> regions, and the offloading compiler will then either be able to resolve
> these (nvptx back end knows about trigonometric functions, for example,
> and a lot of functions are available in the nvptx libc), or otherwise
> error out during the offloading compilation (during linking), gracefully
> without terminating the target compilation (that "gracefully" bit is
> currently missing -- that's for another day).  That is, all such
> functions are implicitly callable as OpenACC "seq" functions (which means
> that they don't internally use gang/worker/vector parallelism).  In
> particular, all these functions do *not* need to be marked with an
> explicit "#pragma acc routine seq" directive.  (Functions internally
> using gang/worker/vector parallelism will need to be marked
> appropriately, using a "#pragma acc routine gang/worker/vector"
> directive.)  That's how I understand your comment above, and your earlier
> comments on this topic, and also is what I think should be done.

OK. I'll drop the warning changes from my patch set then unless you want
to keep it.

> A few random comments on the patch:
> 
>> --- a/gcc/fortran/gfortran.h
>> +++ b/gcc/fortran/gfortran.h
>> @@ -303,6 +303,15 @@ enum save_state
>>  { SAVE_NONE = 0, SAVE_EXPLICIT, SAVE_IMPLICIT
>>  };
>>  
>> +/* Flags to keep track of ACC routine states.  */
>> +enum oacc_function
>> +{ OACC_FUNCTION_NONE = 0,
>> +  OACC_FUNCTION_SEQ,
>> +  OACC_FUNCTION_GANG,
>> +  OACC_FUNCTION_WORKER,
>> +  OACC_FUNCTION_VECTOR
>> +};
> 
> What's the purpose of OACC_FUNCTION_NONE?  It's not used anywhere, as far
> as I can tell?

It's used by the fortran module code. It controls how parallelism gets
encoded in the .mod files.

>> --- a/gcc/fortran/openmp.c
>> +++ b/gcc/fortran/openmp.c
>> @@ -1664,21 +1664,31 @@ gfc_match_oacc_cache (void)
>>  
>>  /* Determine the loop level for a routine.   */
>>  
>> -static int
>> +static oacc_function
>>  gfc_oacc_routine_dims (gfc_omp_clauses *clauses)
>>  {
>>    int level = -1;
>> +  oacc_function ret = OACC_FUNCTION_SEQ;
>>  
>>    if (clauses)
>>      {
>>        unsigned mask = 0;
>>  
>>        if (clauses->gang)
>> -	level = GOMP_DIM_GANG, mask |= GOMP_DIM_MASK (level);
>> +	{
>> +	  level = GOMP_DIM_GANG, mask |= GOMP_DIM_MASK (level);
>> +	  ret = OACC_FUNCTION_GANG;
>> +	}
>>        if (clauses->worker)
>> -	level = GOMP_DIM_WORKER, mask |= GOMP_DIM_MASK (level);
>> +	{
>> +	  level = GOMP_DIM_WORKER, mask |= GOMP_DIM_MASK (level);
>> +	  ret = OACC_FUNCTION_WORKER;
>> +	}
>>        if (clauses->vector)
>> -	level = GOMP_DIM_VECTOR, mask |= GOMP_DIM_MASK (level);
>> +	{
>> +	  level = GOMP_DIM_VECTOR, mask |= GOMP_DIM_MASK (level);
>> +	  ret = OACC_FUNCTION_VECTOR;
>> +	}
>>        if (clauses->seq)
>>  	level = GOMP_DIM_MAX, mask |= GOMP_DIM_MASK (level);
>>  
> 
> I have not looked in detail, so maybe I'm misunderstanding what is being
> done here -- but how do "clauses->seq" and "level = GOMP_DIM_MAX" fit
> together?

Conceptually, if you take a look at the oacc_function attribute in a
tree dump, you'll see an array with three elements. Basically, each
element in that array represents a gang, worker or vector parallelism.
By definition, a gang loop permits a worker and vector loop to be nested
inside it. So, for a gang routine, the oacc_function attribute is
constructed such that it permits gang, worker and vector level
parallelism. Similarly, for a worker routine, the oacc_function
attribute has the worker and vector level parallelism 'bits' set.

With that in mind, setting seq to GOMP_DIM_MASK allows the loop creating
that oacc_function attribute to mask out any gang, worker and vector
parallelism.

>> @@ -1689,7 +1699,7 @@ gfc_oacc_routine_dims (gfc_omp_clauses *clauses)
>>    if (level < 0)
>>      level = GOMP_DIM_MAX;
>>  
>> -  return level;
>> +  return ret;
>>  }
> 
> Just from that last hunk, it seems that the assignment to "level" is a
> dead store?

I'll need to check this when I split out the patch.

>> +static tree
>> +add_attributes_to_decl (symbol_attribute sym_attr, tree list)
>> +{
>> +  unsigned id;
>> +  tree attr;
>> +
>> +  for (id = 0; id < EXT_ATTR_NUM; id++)
>> +    if (sym_attr.ext_attr & (1 << id))
>> +      {
>> +	attr = build_tree_list (
>> +		 get_identifier (ext_attr_list[id].middle_end_name),
>> +				 NULL_TREE);
>> +	list = chainon (list, attr);
>> +      }
>> +
>> +  list = add_omp_offloading_attributes (sym_attr.omp_declare_target,
>> +					sym_attr.oacc_function, list);
>> +
>> +  return list;
>> +}
> 
> Something that I had noticed before, possibly related here: code in
> gcc/fortran/ does never call replace_oacc_fn_attrib, but the C and C++
> front ends do.  Is that function what you've re-implemented here?

Similar, but I broke this code out from another function to handle
BUILT_IN_EXPECT. But I can revert this change now, since BUILT_IN_EXPECT
will be treated as an implicit SEQ routine.

>> --- a/gcc/lto-cgraph.c
>> +++ b/gcc/lto-cgraph.c
>> @@ -1201,9 +1201,11 @@ input_overwrite_node (struct lto_file_decl_data *file_data,
>>  				     LDPR_NUM_KNOWN);
>>    node->instrumentation_clone = bp_unpack_value (bp, 1);
>>    node->split_part = bp_unpack_value (bp, 1);
>> -  gcc_assert (flag_ltrans
>> -	      || (!node->in_other_partition
>> -		  && !node->used_from_other_partition));
>> +
>> +  int success = flag_ltrans || (!node->in_other_partition
>> +				&& !node->used_from_other_partition);
>> +  if (!success)
>> +    error ("Missing %<%s%>", node->name ());
>>  }
>>  
>>  /* Return string alias is alias of.  */
>> @@ -1416,9 +1418,11 @@ input_varpool_node (struct lto_file_decl_data *file_data,
>>      node->set_section_for_node (section);
>>    node->resolution = streamer_read_enum (ib, ld_plugin_symbol_resolution,
>>  					        LDPR_NUM_KNOWN);
>> -  gcc_assert (flag_ltrans
>> -	      || (!node->in_other_partition
>> -		  && !node->used_from_other_partition));
>> +
>> +  int success = flag_ltrans || (!node->in_other_partition
>> +				&& !node->used_from_other_partition);
>> +  if (!success)
>> +    error ("Missing %<%s%>", node->name ());
>>  
>>    return node;
>>  }
> 
> That looks similar to what I remember from earlier, simiar patches, as
> referenced above.

It is. I never got around to pushing that patch very strongly because I
thought those link failures were legitimate compiler bugs.

>> @@ -19420,7 +19442,8 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
>>      {
>>        unsigned outermost = this_mask & -this_mask;
>>  
>> -      if (outermost && outermost <= outer_mask)
>> +      if ((outermost && outermost <= outer_mask)
>> +	  || (this_mask && (loop->parent->flags & OLF_SEQ)))
>>  	{
>>  	  if (noisy)
>>  	    {
> 
>> --- a/gcc/testsuite/c-c++-common/goacc/routine-3.c
>> +++ b/gcc/testsuite/c-c++-common/goacc/routine-3.c
>> @@ -49,7 +49,7 @@ main ()
>>    int red = 0;
>>  #pragma acc parallel copy (red)
>>    {
>> -    /* Independent/seq loop tests.  */
>> +    /* Independent loop tests.  */
>>  #pragma acc loop reduction (+:red) // { dg-warning "insufficient partitioning" }
>>      for (int i = 0; i < 10; i++)
>>        red += gang ();
>> @@ -62,6 +62,19 @@ main ()
>>      for (int i = 0; i < 10; i++)
>>        red += vector ();
>>  
>> +    /* Seq loop tests.  */
>> +#pragma acc loop seq reduction (+:red)  /* { dg-message "containing loop" } */
>> +    for (int i = 0; i < 10; i++)
>> +      red += gang (); /* { dg-error "incorrectly nested" } */
>> +
>> +#pragma acc loop seq reduction (+:red)  /* { dg-message "containing loop" } */
>> +    for (int i = 0; i < 10; i++)
>> +      red += worker (); /* { dg-error "incorrectly nested" } */
>> +
>> +#pragma acc loop seq reduction (+:red)  /* { dg-message "containing loop" } */
>> +    for (int i = 0; i < 10; i++)
>> +      red += vector (); /* { dg-error "incorrectly nested" } */
>> +    
>>      /* Gang routine tests.  */
>>  #pragma acc loop gang reduction (+:red)  /* { dg-message "containing loop" } */
>>      for (int i = 0; i < 10; i++)
> 
> Do these test case changes actually relate to any of the compiler changes
> discussed above?  Maybe to the oacc_loop_fixed_partitions cited just
> above?  Is that a separate issue to fix?  Eh, or is that actually the fix
> for your first issue, the "problems with acc routines [...] incorrectly
> permitting 'acc seq' loops to call gang, worker and vector routines"?

This is issue 2, and I'll break it out into a separate patch.

>> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c
>> @@ -1,4 +1,4 @@
>>  /* { dg-do run { target lto } } */
>> -/* { dg-additional-options "-fipa-pta -flto -flto-partition=max" } */
>> +/* { dg-additional-options "-fipa-pta -flto -flto-partition=max -fno-exceptions" } */
>>  
>>  #include "data-clauses-kernels.c"
> 
>> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels.c
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels.c
>> @@ -1,2 +1,4 @@
>> +/* { dg-additional-options "-fno-exceptions" }  */
>> +
>>  #define CONSTRUCT kernels
>>  #include "data-clauses.h"
> 
>> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c
>> @@ -1,4 +1,4 @@
>>  /* { dg-do run { target lto } } */
>> -/* { dg-additional-options "-fipa-pta -flto -flto-partition=max" } */
>> +/* { dg-additional-options "-fipa-pta -flto -flto-partition=max -fno-exceptions" } */
>>  
>>  #include "data-clauses-parallel.c"
> 
>> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel.c
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel.c
>> @@ -1,2 +1,4 @@
>> +/* { dg-additional-options "-fno-exceptions" }  */
>> +
>>  #define CONSTRUCT parallel
>>  #include "data-clauses.h"
> 
> Hmm?

I'm not sure what happened here either. Maybe adding the 'acc routine'
directive to acc_on_device is preventing that function from expanding to
its builtin function counterpart, which caused gcc to generate exception
code?

Cesar