From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Thomas_Schwinge@mentor.com>
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
 by sourceware.org (Postfix) with ESMTPS id D28DF389EC34
 for <gcc-patches@gcc.gnu.org>; Mon, 19 Apr 2021 11:06:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D28DF389EC34
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=Thomas_Schwinge@mentor.com
IronPort-SDR: MfhcZYpJuRWtN3Eh/uWoEQeSEWBBSEwRAH3feLPxYKHQQFc8MULaAk6VMGR6s4n/mBBqKss+Ho
 SQuB5fKMI358I8kQiy8m1OOeUGb2oN1GdgaYHvh0Y0k07RItXOd2pE8Cy46uXbDdYFjKW8eig0
 dPZt0Ve0K0fF563FrijLIxCqXqdkTtmEz4IEN9/uNrJmbWIBmSn8z4hWhQ1v5zBwAnBdxcUsUT
 ac8bhjD2pDf5zXxi28XudzrAFQ5IfiNnnhR+CHNfFU0ojRxgpk4TuTTj8BRspjRrXpmaKcUvk8
 tPc=
X-IronPort-AV: E=Sophos;i="5.82,234,1613462400"; d="scan'208";a="60261274"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
 by esa2.mentor.iphmx.com with ESMTP; 19 Apr 2021 03:06:17 -0800
IronPort-SDR: VXcih2CFZnA8iDbwWExQmNGMMpiDwtH92jclpJsW/ECUJqBkAzBFODH9UaLTVZC/oTjbZWvi4W
 X8YVvL93OFDligrOuSSb7w2lsJFkjGawi7JgKJDBZrOyEljtrDs+QnBvG6Z0mhhiDvCNXVRisU
 3OESWwzEaMFLF9GavKPnZ81feV16fJ5wn9OZKE43GRHP3LdLTG1zV3W1qiDNBXQjsC2FVAKbPU
 SU903VR0F4gDCy5MfQI61FDcsCd8PDp2V96Qggta/YQRMyIS118v2AK3cwHuOrZu1sNfoNNwQS
 hrs=
From: Thomas Schwinge <thomas@codesourcery.com>
To: Andrew Stubbs <ams@codesourcery.com>, Julian Brown
 <julian@codesourcery.com>
CC: Jakub Jelinek <jakub@redhat.com>, <gcc-patches@gcc.gnu.org>, Tom de Vries
 <tdevries@suse.de>
Subject: Re: [PATCH 1/3] openacc: Add support for gang local storage
 allocation in shared memory
In-Reply-To: <7cf6702d-dfdc-b537-f922-d68e226bd81a@codesourcery.com>
References: <cover.1614342218.git.julian@codesourcery.com>
 <aaf895fb1e3a009afc146d08d0cc267fa81971b3.1614342218.git.julian@codesourcery.com>
 <87fszrv4pt.fsf@euler.schwinge.homeip.net>
 <3d8f4b18-f8cf-1f34-d904-6effbcb4f3a3@codesourcery.com>
 <877dl22l3a.fsf@euler.schwinge.homeip.net>
 <7cf6702d-dfdc-b537-f922-d68e226bd81a@codesourcery.com>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1
 (x86_64-pc-linux-gnu)
Date: Mon, 19 Apr 2021 13:06:07 +0200
Message-ID: <87a6puv8io.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-02.mgc.mentorg.com (139.181.222.2) To
 svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1)
X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE,
 SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Apr 2021 11:06:28 -0000

Hi!

On 2021-04-18T23:53:01+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> On 16/04/2021 18:30, Thomas Schwinge wrote:
>> On 2021-04-16T17:05:24+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>> On 15/04/2021 18:26, Thomas Schwinge wrote:
>>>>> and optimisation, since shared memory might be faster than
>>>>> the main memory on a GPU.
>>>>
>>>> Do we potentially have a problem that making more use of (scarce)
>>>> gang-private memory may negatively affect peformance, because potentia=
lly
>>>> fewer OpenACC gangs may then be launched to the GPU hardware in parall=
el?
>>>> (Of course, OpenACC semantics conformance firstly is more important th=
an
>>>> performance, but there may be ways to be conformant and performant;
>>>> "quality of implementation".)  Have you run any such performance testi=
ng
>>>> with the benchmarking codes that we've got set up?
>>>>
>>>> (As I'm more familiar with that, I'm using nvptx offloading examples i=
n
>>>> the following, whilst assuming that similar discussion may apply for G=
CN
>>>> offloading, which uses similar hardware concepts, as far as I remember=
.)
>>>
>>> Yes, that could happen.
>>
>> Thanks for sharing the GCN perspective.
>>
>>> However, there's space for quite a lot of
>>> scalars before performance is affected: 64KB of LDS memory shared by a
>>> hardware-defined maximum of 40 threads
>>
>> (Instead of threads, something like thread blocks, I suppose?)
>
> Workers. Wavefronts.

(ACK.)

> The terminology is so confusing for these cases!

Absolutely!  Everyone has their own, and slightly redefines meaning of
certain words -- and then again uses different words for the same
things/concepts...

> They look like CPU threads running SIMD instructions, at least on GCN.
> OpenMP calls them threads.

Alright -- and in OpenACC (which is the context here), "a thread is any
one vector lane of one worker of one gang" (that is, any element of a GCN
SIMD instruction).

> Each GCN compute unit can run up to 40 of them. A gang can have up to 16
> workers (in AMD terminology, a work group can have up 16 wavefronts), so
> each compute unit will usually have at least two gangs, meaning each
> gang would get 32KB local memory. If there are no worker loops then you
> get 40 gangs (of one worker each) per compute unit, hence the minimum of
> 1.5KB per gang.
>
> The local memory is specific to the compute unit and gangs launched
> there will stay there until they're done, so the 40 gangs really is the
> limit for memory division. If you launch more gangs than there are
> resources then they get queued, so the memory doesn't get divided any mor=
e.
>
>>> gives about 1.5KB of space for
>>> worker-reduction variables and gang-private variables.
>>
>> PTX, as I understand this, may generally have a lot of Thread Blocks in
>> flight: all for the same GPU kernel as well as any GPU kernels running
>> asynchronously/generally concurrently (system-wide), and libgomp does tr=
y
>> launching a high number of Thread Blocks ('num_gangs') (for purposes of
>> hiding memory access latency?).  Random example:
>>
>>      nvptx_exec: kernel t0_r$_omp_fn$0: launch gangs=3D1920, workers=3D3=
2, vectors=3D32
>>
>> With that, PTX's 48 KiB of '.shared' memory per SM (processor) are then
>> not so much anymore: just '48 * 1024 / 1920 =3D 25' bytes of gang-privat=
e
>> memory available for each of the 1920 gangs: 'double x, y, z'?  (... for
>> the simple case where just one GPU kernel is executing.)
>
> Your maths feels way off to me. That's not enough memory for any use,
> and it's not the only resource that will be stretched thin:

Might be way off, yes.  I did mention "other [limiting] factors" later
on, and:

According to the documentation that I'd pointed to, CC 3.5 may have
"Maximum number of resident blocks per SM": "16".

(Aha, and if, for example, we assume there are 80 SMs, then libgomp
launching 1920 gangs means '1920 / 80 =3D 24' Thread Blocks per SM -- which
seems reasonable.)

What I don't know is whether "resident" means scheduled/executing and the
same applies to the '.shared' memory allocation -- or whether the two
parts are separate (thus you can occupy '.shared' memory without having
it used via execution).  If we assume that allocation and execution are
done in one, and there is no pre-emption once launched, that indeed
simplifies the considerations quite some.

We'd then have a decent '48 * 1024 / 16 =3D 3072' bytes of gang-private
memory available for each of the 16 "resident" gangs (per SM).

> how many GPU
> registers does an SM have?

"Number of 32-bit registers per SM": "64 K", and with "Maximum number of
resident threads per SM": "2048", that means '64 K / 2048 =3D 32' registers
in this configuration vs. "Maximum number of 32-bit registers per
thread": "255" with correspondingly reduced occupancy.

> (I doubt that register contents are getting
> paged in and out.)

(Again, I have not looked up to which extent Nvidia GPUs/Driver are doing
any such things.)

> For comparison, with the maximum num_workers(16) GCN can run only 2
> gangs on each compute unit. Each compute unit can run 40 gangs
> simultaneously with num_workers(1), but that is the limit. If you launch
> more gangs than that then they are queued; even if you launch 100,000
> single-worker gangs, each one will still get 1/40th of the resources.
>
> I doubt that NVPTX is magically running 1920 gangs of 32 workers on one
> SM without any queueing and with the gang resources split 1920 ways (and
> the worker resources split 61440 ways).

No, indeed.  As I'd said:

>> (I suppose that calculation is valid for a GPU hardware variant where
>> there is just one SM.  If there are several (typically in the order of a
>> few dozens?), I suppose the Thread Blocks launched will be distributed
>> over all these, thus improving the situation correspondingly.)
>>
>> (And of course, there are certainly other factors that also limit the
>> number of Thread Blocks that are actually executing in parallel.)


>>> We might have a
>>> problem if there are large private arrays.
>>
>> Yes, that's understood.
>>
>> Also, directly related, the problem that comes with supporting
>> worker-private memory, which basically calculates to the amount necessar=
y
>> for gang-private memory multiplied by the number of workers?  (Out of
>> scope at present.)
>
> GCN just uses the stack space for that, which lives in main memory.
> That's limited resource, of course, but it's not architectural. I don't
> know what NVPTX does here.

Per my understanding, neither GCN nor nvptx are supporting OpenACC
worker-private memory yet.


>>> I believe we have a "good enough" solution for the usual case
>>
>> So you believe that.  ;-)
>>
>> It's certainly what I'd hope, too!  But we don't know yet whether there'=
s
>> any noticeable performance impact if we run with (potentially) lesser
>> parallelism, hence my question whether this patch has been run through
>> performance testing.
>
> Well, indeed I don't know the comparative situation with benchmark
> results because the benchmarks couldn't run at full occupancy, on GCN,
> without it. The purpose of this patch was precisely to allow us to
> reduce the local memory allocation enough to increase occupancy for
> benchmarks that don't use worker loops.

ACK, that's the GCN perspective.  But for nvptx, we ought be careful to
not regress existing functionality/performance.

Plus, we all agree, the proposed code changes do improve certain aspects
of OpenACC specification conformance: the concept of gang-private memory.


>>> and a
>>> v2.0 full solution is going to be big and hairy enough for a whole patc=
h
>>> of it's own (requiring per-gang dynamic allocation, a different memory
>>> address space and possibly different instruction selection too).
>>
>> Agree that a fully dynamic allocation scheme likely is going to be ugly,
>> so I'd certainly like to avoid that.
>>
>> Before attempting that, we'd first try to optimize gang-private memory
>> allocation: so that it's function-local (and thus GPU kernel-local)
>> instead of device-global (assuming that's indeed possible), and try not
>> using gang-private memory in cases where it's not actually necessary
>> (semantically not observable, and not necessary for performance reasons)=
.
>
> Global layout isn't ideal, but I don't know how we know how much to
> reserve otherwise? I suppose one would set the shared gang memory up as
> a stack, complete with a stack pointer in the ABI, which would allow
> recursion etc., but that would have other issues.

Due to lack of in-depth knowledge, I haven't made an attempt to reason
about how to implement that on GCN, but for nvptx there certainly is
evidence of '.shared' memory allocation per function, building a complete
call graph from the GPU kernel entry point onwards, and thus '.shared'
memory allocation per each individual GPU kernel launch.

(Yet, again, I'm totally fine to defer all these things for later --
unless the nvptx performance testing numbers mandate otherwise.)


Gr=C3=BC=C3=9Fe
 Thomas
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 M=C3=BCnchen R=
egistergericht M=C3=BCnchen HRB 106955, Gesch=C3=A4ftsf=C3=BChrer: Thomas H=
eurung, Frank Th=C3=BCrauf