Hi Jakub,

> >     * testsuite/libgomp.c-c++-common/icv-4.c: Bugfix.
>
> Better say what exactly you changed in words.

Changed.

> > --- a/gcc/gimplify.cc
> > +++ b/gcc/gimplify.cc
> > @@ -14153,7 +14153,7 @@ optimize_target_teams (tree target, gimple_seq
> *pre_p)
> >    struct gimplify_omp_ctx *target_ctx = gimplify_omp_ctxp;
> >
> >    if (teams == NULL_TREE)
> > -    num_teams_upper = integer_one_node;
> > +    num_teams_upper = build_int_cst (integer_type_node, -2);
> >    else
> >      for (c = OMP_TEAMS_CLAUSES (teams); c; c = OMP_CLAUSE_CHAIN (c))
> >        {
>
> The function comment above optimize_target_teams contains detailed description
> on what the values mean and why, so it definitely should document what -2 means
> and when it is used.
> I know you have documentation in libgomp for it, but it should be in both places.

I updated the comment with an explanation for "-2".

>
> > +  intptr_t new_teams = orig_teams, new_threads = orig_threads;
> > +  /* ORIG_TEAMS == -2: No explicit teams construct specified. Set to 1.
>
> Two spaces after .

Corrected here and at other places.

>
> > +     ORIG_TEAMS == -1: TEAMS construct with NUM_TEAMS clause specified, but
> the
> > +                  value could not be specified. No Change.
>
> Likewise.
> lowercase change ?

Corrected.

>
> > +     ORIG_TEAMS == 0: TEAMS construct without NUM_TEAMS clause.
> > +                 Set device-specific value.
> > +     ORIG_TEAMS > 0: Value was already set through e.g. NUM_TEAMS clause.
> > +                No change.  */
> > +  if (orig_teams == -2)
> > +    new_teams = 1;
> > +  else if (orig_teams == 0)
> > +    {
> > +      struct gomp_offload_icv_list *item = gomp_get_offload_icv_item (device);
> > +      if (item != NULL)
> > +   new_teams = item->icvs.nteams;
> > +    }
> > +  /* The device-specific teams-thread-limit is only set if (a) an explicit TEAMS
> > +     region exists, i.e. ORIG_TEAMS > -2, and (b) THREADS was not already set by
> > +     e.g. a THREAD_LIMIT clause.  */
> > +  if (orig_teams >= -2 && orig_threads == 0)
>
> The comment talks about ORIG_TEAMS > -2, but the condition is >= -2.
> So which one is it?

Thanks for the hint. It should be indeed "> -2" since teams_thread_limit "sets
the maximum number of OpenMP threads to use in each contention group created by
a teams construct" (OpenMP 5.2, section 21.6.2). So if there is no (explicit)
teams construct, then teams_thread_limit doesn't need to be copied to the device.

>
> > +      /* This tests a large number of teams and threads. If it is larger than
> > +    2^15+1 then the according argument in the kernels arguments list
> > +    is encoded with two items instead of one. On NVIDIA there is an
> > +    adjustment for too large teams and threads. For AMD such adjustment
> > +    exists only for threads and will cause runtime errors with a two
> > +large
>
> s/two/too/ ?
> Shouldn't amdgcn adjusts also number of teams?

I adjusted now also the number of teams in the amdgcn plugin. As upper bound I
chose two times the number of compute units. This seems to be sufficient when
one team is executed at one compute unit. This at least avoids the queueing of a
large amount of teams and the corresponding memory allocation.

The drawback is that a user is probably not aware of the actual number of
compute units (which is not very large on gfx cards, e.g. 120 for gfx908 and 104
for gfx90a) and thus maybe expects different values from omp_get_team_num(). For
instance in something like the following:

#pragma omp target
#pragma omp teams num_teams(220)
#pragma omp distribute parallel for
   for(int i = 0; i < 220; ++i)
     {
#pragma omp critical
        ... omp_get_team_num () ...
     }

On a gfx90a card with 104 compute units 12 threads are assigned to "reused"
teams (instead of having their own teams) that would not be the case without the
limit.

Alternatively, we could just define some (larger) constant number (though I
don't know a reasonable value here). But this does actually not solve the above
mentioned drawback. I think, we need to find a compromise between an
unneccessary small upper bound and the chance to get memory allocation failures
due to a too large number of teams.

>
> As for testcases, have you tested this in a native setup where dg-set-target-env-var
> actually works?

Besides remote testing with offloading (which does not yet work with
dg-set-target-env-var), I also tested locally on x86_64-pc-linux-gnu with one
nvptx offload device without issues (using "make check" and verifying that
offloading is indeed used).

Marcel
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955