From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <chunglin.tang@gmail.com>
Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030])
	by sourceware.org (Postfix) with ESMTPS id 2AF863858D37
	for <gcc-patches@gcc.gnu.org>; Wed, 21 Sep 2022 10:02:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2AF863858D37
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-pj1-x1030.google.com with SMTP id q9-20020a17090a178900b0020265d92ae3so13691008pja.5
        for <gcc-patches@gcc.gnu.org>; Wed, 21 Sep 2022 03:02:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date;
        bh=2Xqlj/sO1XGJF6mTrClZve29YRGJAtX3kN5XtlRXqJw=;
        b=gDud3qJ8FUMMqWywb5xltoqD0mr4Fp+sqaCLJACWc0QaoQHKvEbmfkfYfZ4mpPFeNd
         z2/M10ZTIuHv3OYVRTio4HEPHXdtu+nNwXOrXidHszRjeALr0rWFEqYuLvvZ8z95NyE6
         WZ9xuYyagDku2j7tdd/B/aQAZKdro4BwxVeCnBr3SFtnjXhnFGm+nkfq94kY0UQQ1CrY
         JH/gi0Fl5QrUn+MYsLfTYMjuyoZ0rzuIqJgLcPTZEHjw6FAKPKvj3qwXu4XuHdkRe5ba
         Ia9w0O3m+F85rwpzMz+RBZMwBqNHvYi+75IwR94X5Hxznwb9vx4zS4/H70Zb5EJmAgcj
         6ssQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date;
        bh=2Xqlj/sO1XGJF6mTrClZve29YRGJAtX3kN5XtlRXqJw=;
        b=wNCbnLJjmujaHZk6SVV08Ujd5i+1gwCrjQ01F8kphc46GKbbcHtJBtazuOPPRqifGR
         ro3D2TGycP0767KhLtDgfn3Vm+r+EG6iyEv084Ted1zlEc3Fn/WvEWX1+/3pbHZ/PvZM
         lDVngC3AqlG4zjA+nO2XJlr9UKD4+996tYbsy208jZCNcrLs14DX8oRwMp6YV7AD3B2u
         hjqNMsijlknztVQ6oxPvTnpcJ8coKiqkX5OWd4nK68cbO1fKvrGGzU0s6yCIwITFhFQ6
         QPxsWvGmuPwsmFWAJmACulJ1BZDpsB+61lNxlyrd4I8HLgmXxPgr9P3w9oVZ9IMir7Zs
         FdbQ==
X-Gm-Message-State: ACrzQf0YLRHvW0Lv4Vzl8bAw8TCw/FNB90WQeyxmaIDrfzTr0ndOgjgS
	+zDPw6Qpjg0ugHHi4dxRXGQ=
X-Google-Smtp-Source: AMsMyM7X+oDadvshK/YsA+ldvpfokFvo4jSBB2JN3X6uI9eTOlSNwwVRUgl/wDechBGcakmZs1SBDw==
X-Received: by 2002:a17:902:a606:b0:178:57e4:a0c1 with SMTP id u6-20020a170902a60600b0017857e4a0c1mr3932915plq.83.1663754569666;
        Wed, 21 Sep 2022 03:02:49 -0700 (PDT)
Received: from ?IPV6:2401:e180:8d10:e870:813b:e8b4:3b6:dd59? ([2401:e180:8d10:e870:813b:e8b4:3b6:dd59])
        by smtp.gmail.com with ESMTPSA id z9-20020a170903018900b00177c488fea5sm1639659plg.12.2022.09.21.03.02.47
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 21 Sep 2022 03:02:48 -0700 (PDT)
Message-ID: <2f3dd462-e7be-151b-f03c-9ce3eea4e069@gmail.com>
Date: Wed, 21 Sep 2022 18:02:45 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:91.0)
 Gecko/20100101 Thunderbird/91.13.0
Subject: Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Content-Language: en-US
To: Jakub Jelinek <jakub@redhat.com>, Chung-Lin Tang <cltang@codesourcery.com>
Cc: gcc-patches <gcc-patches@gcc.gnu.org>, Tom de Vries <tdevries@suse.de>,
 Catherine Moore <clm@codesourcery.com>
References: <8b974d21-e288-4596-7500-277a43c92771@gmail.com>
 <YyrS/1hYWD518wM1@tucnak>
From: Chung-Lin Tang <chunglin.tang@gmail.com>
In-Reply-To: <YyrS/1hYWD518wM1@tucnak>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>


On 2022/9/21 5:01 PM, Jakub Jelinek wrote:
> On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote:
>> Hi Tom,
>> I had a patch submitted earlier, where I reported that the current way of implementing
>> barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>
>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>
>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for,
>> barriers are implemented simplistically with bar.* synchronization instructions.
>> Tasks are processed after threads have joined, and only if team->task_count != 0
>>
>> (arguably, there might be a little bit of performance forfeited where earlier arriving threads
>> could've been used to process tasks ahead of other threads. But that again falls into requiring
>> implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
>> offloading is usually used for)
> 
> I admit I don't have a good picture if people in real-world actually use
> tasking in offloading regions and how much and in what way, but the above
> definitely would be a show-stopper for typical tasking workloads, where
> one thread (usually from master/masked/single construct's body) creates lots
> of tasks and can spend considerable amount of time in those preparations,
> while other threads are expected to handle those tasks.

I think the most common use case for target offloading is "parallel for".

Really, not simply removing tasking altogether from target regions in the specification is just looking for trouble.

If asynchronous offloaded tasks are to be supported, something at the whole GPU offload region level
is much more reasonable, like the async clause functionality in OpenACC.

> Do we have an idea how are other implementations handling this?
> I think it should be easily observable with atomics, have
> master/masked/single that creates lots of tasks and then spends a long time
> doing something, have very small task bodies that just increment some atomic
> counter and at the end of the master/masked/single see how many tasks were
> already encountered.

This could be an interesting test...

> Note, I don't have any smart ideas how to handle this instead and what
> you posted might be ok for what people usually do on offloading targets
> in OpenMP if they use tasking at all, just wanted to mention that there
> could be workloads where the above is a serious problem.  If there are
> say hundreds of threads doing nothing until a single thread reaches a
> barrier and there are hundreds of pending tasks...

I think it might still be doable, just not in the very fine "wake one thread" style
that the Linux-based implementation was doing.

> E.g. note we have that 64 pending task limit after which we start to
> create undeferred tasks, so if we never start handling tasks until
> one thread is done with them, that would mean the single thread
> would create 64 deferred tasks and then handle all the others itself
> making it even longer until the other tasks can deal with it.

Okay, thanks for reminding that.

Chung-Lin