From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 3DD383857C5B for ; Fri, 12 Nov 2021 19:47:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3DD383857C5B Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-366-PuksZLpTPBCRt_s-37u5Tw-1; Fri, 12 Nov 2021 14:47:14 -0500 X-MC-Unique: PuksZLpTPBCRt_s-37u5Tw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 57F80871827; Fri, 12 Nov 2021 19:47:12 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.192.54]) by smtp.corp.redhat.com (Postfix) with ESMTPS id D190660C05; Fri, 12 Nov 2021 19:47:11 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 1ACJl84C823926 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Fri, 12 Nov 2021 20:47:08 +0100 Received: (from jakub@localhost) by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 1ACJl6Wk823925; Fri, 12 Nov 2021 20:47:06 +0100 Date: Fri, 12 Nov 2021 20:47:06 +0100 From: Jakub Jelinek To: Alexander Monakov Cc: Tom de Vries , Tobias Burnus , gcc-patches@gcc.gnu.org Subject: Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound Message-ID: <20211112194706.GL2710@tucnak> Reply-To: Jakub Jelinek References: <20211112132023.GC2710@tucnak> <20211112132716.GD2710@tucnak> <20211112175804.GJ2710@tucnak> MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Nov 2021 19:47:18 -0000 On Fri, Nov 12, 2021 at 10:16:11PM +0300, Alexander Monakov wrote: > I suspect there may be a misunderstanding here, or maybe your explanation is > incomplete. I don't think the intention of the standard was to force such > complexity. You can launch as many blocks on the GPU as you like, limited only > by the bitwidth of the indexing register used in hardware, NVIDIA guarantees > at least INT_MAX blocks (in fact almost 1<<63 blocks if you launch a > three-dimensional grid with INT_MAX x 65535 x 65535 blocks). > > The hardware will schedule blocks automatically (so for example if the hardware > can run 40 blocks simultaneously and you launch 100, the hardware may launch > blocks 0 to 39, then when one of those finishes it will launch the 40'th block > and so on). > > So isn't the solution simply to adjust the logic around > nvptx_adjust_launch_bounds in GOMP_OFFLOAD_run, that is, if there's a lower > bound specified, use it instead of what adjust_launch_bounds is computing as > max_blocks? The problem is that the argument of the num_teams clause isn't always known before target is launched. While gimplify.c tries hard to figure it out as often as possible and the standard makes it easy for the combined target teams case where we say that the expressions in the num_teams/thread_limit clauses are evaluated on the host before the target construct - in that case the plugin is told the expected number and unless CUDA decides to allocate fewer than requested, we are fine, there are cases where target is not combined with teams where per the spec the expressions need to be evaluated on the target, not on the host (gimplify still tries to optimize some of those cases by e.g. seeing if it is some simple arithmetic expression where all the vars would be firstprivatized), and in that case we create some default number of CTAs and only later on find out what the user asked for. extern int foo (void); #pragma omp declare target to (foo) void bar (void) { #pragma omp target #pragma omp teams num_teams (foo ()) ; } is such a case, we simply don't know and foo () needs to be called in target. In OpenMP 5.0 we had the option to always create fewer teams if we decided so (of course at least 1), but in 5.1 we don't have that option, if there is just one expression, we need to create exactly that many teams, if it is num_teams (foo () - 10 : foo () + 10), we need to be within that range (inclusive). Jakub