From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 1C30B3858D37 for ; Wed, 21 Sep 2022 09:02:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1C30B3858D37 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1663750921; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references; bh=3XAOj+2RdsluNA3wQUuEmytN9PLXbO0UG3pycbLkfRU=; b=bg2M5KuufmE4Mjx1dVnhNUEXt/yx4ZsT7nVT/YLH6+yugbOo4a23DU8zEIVAk/1NvGd6r2 lxWyX8L4Jw+XKm8ENHz2xwBjVR18ZwgpkvISJhO65BVmXrT3kodN0AcwM9HEPo/Sa7ZOb7 X1EcKRJn2/sm5rt1Aun4emDb2wsHPUY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-292-adDSvy92OD6fABqQ4IrMxQ-1; Wed, 21 Sep 2022 05:01:58 -0400 X-MC-Unique: adDSvy92OD6fABqQ4IrMxQ-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 44BCA101A52A; Wed, 21 Sep 2022 09:01:58 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.192.194]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A680149BB62; Wed, 21 Sep 2022 09:01:57 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 28L91rnD703695 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Wed, 21 Sep 2022 11:01:54 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 28L91qXp703694; Wed, 21 Sep 2022 11:01:52 +0200 Date: Wed, 21 Sep 2022 11:01:51 +0200 From: Jakub Jelinek To: Chung-Lin Tang Cc: gcc-patches , Tom de Vries , Catherine Moore Subject: Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx Message-ID: Reply-To: Jakub Jelinek References: <8b974d21-e288-4596-7500-277a43c92771@gmail.com> MIME-Version: 1.0 In-Reply-To: <8b974d21-e288-4596-7500-277a43c92771@gmail.com> X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote: > Hi Tom, > I had a patch submitted earlier, where I reported that the current way of implementing > barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021 > benchmarks: > https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html > > That previous patch wasn't accepted well (admittedly, it was kind of a hack). > So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. > > Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for, > barriers are implemented simplistically with bar.* synchronization instructions. > Tasks are processed after threads have joined, and only if team->task_count != 0 > > (arguably, there might be a little bit of performance forfeited where earlier arriving threads > could've been used to process tasks ahead of other threads. But that again falls into requiring > implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target > offloading is usually used for) I admit I don't have a good picture if people in real-world actually use tasking in offloading regions and how much and in what way, but the above definitely would be a show-stopper for typical tasking workloads, where one thread (usually from master/masked/single construct's body) creates lots of tasks and can spend considerable amount of time in those preparations, while other threads are expected to handle those tasks. Do we have an idea how are other implementations handling this? I think it should be easily observable with atomics, have master/masked/single that creates lots of tasks and then spends a long time doing something, have very small task bodies that just increment some atomic counter and at the end of the master/masked/single see how many tasks were already encountered. Note, I don't have any smart ideas how to handle this instead and what you posted might be ok for what people usually do on offloading targets in OpenMP if they use tasking at all, just wanted to mention that there could be workloads where the above is a serious problem. If there are say hundreds of threads doing nothing until a single thread reaches a barrier and there are hundreds of pending tasks... E.g. note we have that 64 pending task limit after which we start to create undeferred tasks, so if we never start handling tasks until one thread is done with them, that would mean the single thread would create 64 deferred tasks and then handle all the others itself making it even longer until the other tasks can deal with it. Jakub