From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id 8FB80385841B for ; Fri, 12 Nov 2021 21:22:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8FB80385841B Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id 27B2340D403D; Fri, 12 Nov 2021 21:21:58 +0000 (UTC) Date: Sat, 13 Nov 2021 00:21:58 +0300 (MSK) From: Alexander Monakov To: Jakub Jelinek cc: Tobias Burnus , gcc-patches@gcc.gnu.org Subject: Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound In-Reply-To: <20211112194905.GA2664@tucnak> Message-ID: <126a293f-f6d0-935c-ee9-45720ad48e@ispras.ru> References: <20211112132023.GC2710@tucnak> <20211112132716.GD2710@tucnak> <20211112175804.GJ2710@tucnak> <20211112194706.GL2710@tucnak> <20211112194905.GA2664@tucnak> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Nov 2021 21:22:02 -0000 On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote: > On Fri, Nov 12, 2021 at 08:47:09PM +0100, Jakub Jelinek wrote: > > The problem is that the argument of the num_teams clause isn't always known > > before target is launched. > > There was a design mistake that the clause has been put on teams rather than > on target (well, for host teams we need it on teams), and 5.1 actually > partially fixes this up for thread_limit by allowing that clause on both, > but not for num_teams. If this is a mistake in the standard, can GCC say "the spec is bad; fix the spec" and refuse to implement support, since it penalizes the common case? Technically, this could be implemented without penalizing the common case via CUDA "dynamic parallelism" where you initially launch just one block on the device that figures out the dimensions and then performs a GPU-side launch of the required amount of blocks, but that's a nontrivial amount of work. I looked over your patch. I sent a small nitpick about 'nocommon' in a separate message, and I still think it's better to adjust GOMP_OFFLOAD_run to take into account the lower bound when it's known on the host side (otherwise you do static scheduling of blocks which is going to be inferior to dynamic scheduling: imagine lower bound is 3, and maximum resident blocks is 2: then you first do teams 0 and 1 in parallel, then you do team 2 from the 0'th block, while in fact you want to do it from whichever block finished its initial team first). Alexander