public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Andrew Stubbs <ams@codesourcery.com>
To: Thomas Schwinge <thomas@codesourcery.com>,
	Julian Brown <julian@codesourcery.com>
Cc: Jakub Jelinek <jakub@redhat.com>, <gcc-patches@gcc.gnu.org>,
	Tom de Vries <tdevries@suse.de>
Subject: Re: [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory
Date: Sun, 18 Apr 2021 23:53:01 +0100	[thread overview]
Message-ID: <7cf6702d-dfdc-b537-f922-d68e226bd81a@codesourcery.com> (raw)
In-Reply-To: <877dl22l3a.fsf@euler.schwinge.homeip.net>

On 16/04/2021 18:30, Thomas Schwinge wrote:
> Hi!
> 
> On 2021-04-16T17:05:24+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 15/04/2021 18:26, Thomas Schwinge wrote:
>>>> and optimisation, since shared memory might be faster than
>>>> the main memory on a GPU.
>>>
>>> Do we potentially have a problem that making more use of (scarce)
>>> gang-private memory may negatively affect peformance, because potentially
>>> fewer OpenACC gangs may then be launched to the GPU hardware in parallel?
>>> (Of course, OpenACC semantics conformance firstly is more important than
>>> performance, but there may be ways to be conformant and performant;
>>> "quality of implementation".)  Have you run any such performance testing
>>> with the benchmarking codes that we've got set up?
>>>
>>> (As I'm more familiar with that, I'm using nvptx offloading examples in
>>> the following, whilst assuming that similar discussion may apply for GCN
>>> offloading, which uses similar hardware concepts, as far as I remember.)
>>
>> Yes, that could happen.
> 
> Thanks for sharing the GCN perspective.
> 
>> However, there's space for quite a lot of
>> scalars before performance is affected: 64KB of LDS memory shared by a
>> hardware-defined maximum of 40 threads
> 
> (Instead of threads, something like thread blocks, I suppose?)

Workers. Wavefronts. The terminology is so confusing for these cases! 
They look like CPU threads running SIMD instructions, at least on GCN. 
OpenMP calls them threads.

Each GCN compute unit can run up to 40 of them. A gang can have up to 16 
workers (in AMD terminology, a work group can have up 16 wavefronts), so 
each compute unit will usually have at least two gangs, meaning each 
gang would get 32KB local memory. If there are no worker loops then you 
get 40 gangs (of one worker each) per compute unit, hence the minimum of 
1.5KB per gang.

The local memory is specific to the compute unit and gangs launched 
there will stay there until they're done, so the 40 gangs really is the 
limit for memory division. If you launch more gangs than there are 
resources then they get queued, so the memory doesn't get divided any more.

>> gives about 1.5KB of space for
>> worker-reduction variables and gang-private variables.
> 
> PTX, as I understand this, may generally have a lot of Thread Blocks in
> flight: all for the same GPU kernel as well as any GPU kernels running
> asynchronously/generally concurrently (system-wide), and libgomp does try
> launching a high number of Thread Blocks ('num_gangs') (for purposes of
> hiding memory access latency?).  Random example:
> 
>      nvptx_exec: kernel t0_r$_omp_fn$0: launch gangs=1920, workers=32, vectors=32
> 
> With that, PTX's 48 KiB of '.shared' memory per SM (processor) are then
> not so much anymore: just '48 * 1024 / 1920 = 25' bytes of gang-private
> memory available for each of the 1920 gangs: 'double x, y, z'?  (... for
> the simple case where just one GPU kernel is executing.)

Your maths feels way off to me. That's not enough memory for any use, 
and it's not the only resource that will be stretched thin: how many GPU 
registers does an SM have? (I doubt that register contents are getting 
paged in and out.)

For comparison, with the maximum num_workers(16) GCN can run only 2 
gangs on each compute unit. Each compute unit can run 40 gangs 
simultaneously with num_workers(1), but that is the limit. If you launch 
more gangs than that then they are queued; even if you launch 100,000 
single-worker gangs, each one will still get 1/40th of the resources.

I doubt that NVPTX is magically running 1920 gangs of 32 workers on one 
SM without any queueing and with the gang resources split 1920 ways (and 
the worker resources split 61440 ways).

> (I suppose that calculation is valid for a GPU hardware variant where
> there is just one SM.  If there are several (typically in the order of a
> few dozens?), I suppose the Thread Blocks launched will be distributed
> over all these, thus improving the situation correspondingly.)
> 
> (And of course, there are certainly other factors that also limit the
> number of Thread Blocks that are actually executing in parallel.)
> 
>> We might have a
>> problem if there are large private arrays.
> 
> Yes, that's understood.
> 
> Also, directly related, the problem that comes with supporting
> worker-private memory, which basically calculates to the amount necessary
> for gang-private memory multiplied by the number of workers?  (Out of
> scope at present.)

GCN just uses the stack space for that, which lives in main memory. 
That's limited resource, of course, but it's not architectural. I don't 
know what NVPTX does here.

>> I believe we have a "good enough" solution for the usual case
> 
> So you believe that.  ;-)
> 
> It's certainly what I'd hope, too!  But we don't know yet whether there's
> any noticeable performance impact if we run with (potentially) lesser
> parallelism, hence my question whether this patch has been run through
> performance testing.

Well, indeed I don't know the comparative situation with benchmark 
results because the benchmarks couldn't run at full occupancy, on GCN, 
without it. The purpose of this patch was precisely to allow us to 
reduce the local memory allocation enough to increase occupancy for 
benchmarks that don't use worker loops.

>> and a
>> v2.0 full solution is going to be big and hairy enough for a whole patch
>> of it's own (requiring per-gang dynamic allocation, a different memory
>> address space and possibly different instruction selection too).
> 
> Agree that a fully dynamic allocation scheme likely is going to be ugly,
> so I'd certainly like to avoid that.
> 
> Before attempting that, we'd first try to optimize gang-private memory
> allocation: so that it's function-local (and thus GPU kernel-local)
> instead of device-global (assuming that's indeed possible), and try not
> using gang-private memory in cases where it's not actually necessary
> (semantically not observable, and not necessary for performance reasons).

Global layout isn't ideal, but I don't know how we know how much to 
reserve otherwise? I suppose one would set the shared gang memory up as 
a stack, complete with a stack pointer in the ABI, which would allow 
recursion etc., but that would have other issues.

Andrew

  reply	other threads:[~2021-04-18 22:53 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-26 12:34 [PATCH 0/3] openacc: Gang-private variables " Julian Brown
2021-02-26 12:34 ` [PATCH 1/3] openacc: Add support for gang local storage allocation " Julian Brown
2021-04-15 17:26   ` Thomas Schwinge
2021-04-16 16:05     ` Andrew Stubbs
2021-04-16 17:30       ` Thomas Schwinge
2021-04-18 22:53         ` Andrew Stubbs [this message]
2021-04-19 11:06           ` Thomas Schwinge
2021-04-19 11:23     ` Julian Brown
2021-05-21 18:55       ` Thomas Schwinge
2021-05-21 19:18       ` Thomas Schwinge
2021-05-21 19:20       ` Thomas Schwinge
2021-05-21 19:29       ` Thomas Schwinge
2021-05-22  1:40         ` [r12-989 Regression] FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 98) on Linux/x86_64 sunil.k.pandey
2021-05-22  8:41           ` Thomas Schwinge
2021-05-25  1:03             ` Sunil Pandey
2022-03-04 13:51         ` Test '-fopt-info-omp-all' in 'libgomp.oacc-*/kernels-private-vars-*' Thomas Schwinge
2022-03-10 11:10         ` Enhance further testcases to verify handling of OpenACC privatization level [PR90115] Thomas Schwinge
2022-03-12 13:05         ` Thomas Schwinge
2022-03-16  9:20         ` OpenACC privatization diagnostics vs. 'assert' [PR102841] Thomas Schwinge
2022-03-17  7:59         ` Enhance further testcases to verify handling of OpenACC privatization level [PR90115] Thomas Schwinge
2021-05-21 19:12   ` [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory Thomas Schwinge
2021-02-26 12:34 ` [PATCH 2/3] amdgcn: AMD GCN parts for OpenACC private variables patch Julian Brown
2021-02-26 12:34 ` [PATCH 3/3] nvptx: NVPTX " Julian Brown
2021-05-21 18:59   ` Thomas Schwinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7cf6702d-dfdc-b537-f922-d68e226bd81a@codesourcery.com \
    --to=ams@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=julian@codesourcery.com \
    --cc=tdevries@suse.de \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).