From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jakub@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id 3DD383857C5B
 for <gcc-patches@gcc.gnu.org>; Fri, 12 Nov 2021 19:47:17 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3DD383857C5B
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-366-PuksZLpTPBCRt_s-37u5Tw-1; Fri, 12 Nov 2021 14:47:14 -0500
X-MC-Unique: PuksZLpTPBCRt_s-37u5Tw-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com
 [10.5.11.12])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 57F80871827;
 Fri, 12 Nov 2021 19:47:12 +0000 (UTC)
Received: from tucnak.zalov.cz (unknown [10.39.192.54])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id D190660C05;
 Fri, 12 Nov 2021 19:47:11 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
 by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 1ACJl84C823926
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
 Fri, 12 Nov 2021 20:47:08 +0100
Received: (from jakub@localhost)
 by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 1ACJl6Wk823925;
 Fri, 12 Nov 2021 20:47:06 +0100
Date: Fri, 12 Nov 2021 20:47:06 +0100
From: Jakub Jelinek <jakub@redhat.com>
To: Alexander Monakov <amonakov@ispras.ru>
Cc: Tom de Vries <tdevries@suse.de>, Tobias Burnus <tobias@codesourcery.com>, 
 gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound
Message-ID: <20211112194706.GL2710@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <20211112132023.GC2710@tucnak> <20211112132716.GD2710@tucnak>
 <20211112175804.GJ2710@tucnak>
 <cde518e-b369-c597-5ddd-1086bd7129c5@ispras.ru>
MIME-Version: 1.0
In-Reply-To: <cde518e-b369-c597-5ddd-1086bd7129c5@ispras.ru>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW,
 RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Nov 2021 19:47:18 -0000

On Fri, Nov 12, 2021 at 10:16:11PM +0300, Alexander Monakov wrote:
> I suspect there may be a misunderstanding here, or maybe your explanation is
> incomplete. I don't think the intention of the standard was to force such
> complexity. You can launch as many blocks on the GPU as you like, limited only
> by the bitwidth of the indexing register used in hardware, NVIDIA guarantees
> at least INT_MAX blocks (in fact almost 1<<63 blocks if you launch a
> three-dimensional grid with INT_MAX x 65535 x 65535 blocks).
> 
> The hardware will schedule blocks automatically (so for example if the hardware
> can run 40 blocks simultaneously and you launch 100, the hardware may launch
> blocks 0 to 39, then when one of those finishes it will launch the 40'th block
> and so on).
> 
> So isn't the solution simply to adjust the logic around
> nvptx_adjust_launch_bounds in GOMP_OFFLOAD_run, that is, if there's a lower
> bound specified, use it instead of what adjust_launch_bounds is computing as
> max_blocks?

The problem is that the argument of the num_teams clause isn't always known
before target is launched.
While gimplify.c tries hard to figure it out as often as possible and the
standard makes it easy for the combined target teams case where we say
that the expressions in the num_teams/thread_limit clauses are evaluated on
the host before the target construct - in that case the plugin is told the
expected number and unless CUDA decides to allocate fewer than requested,
we are fine, there are cases where target is not combined with teams where
per the spec the expressions need to be evaluated on the target, not on the
host (gimplify still tries to optimize some of those cases by e.g. seeing if
it is some simple arithmetic expression where all the vars would be
firstprivatized), and in that case we create some default number of CTAs and
only later on find out what the user asked for.
extern int foo (void);
#pragma omp declare target to (foo)
void bar (void)
{
  #pragma omp target
  #pragma omp teams num_teams (foo ())
  ;
}
is such a case, we simply don't know and foo () needs to be called in
target.  In OpenMP 5.0 we had the option to always create fewer teams if
we decided so (of course at least 1), but in 5.1 we don't have that option,
if there is just one expression, we need to create exactly that many teams,
if it is num_teams (foo () - 10 : foo () + 10), we need to be within that
range (inclusive).

	Jakub