public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* The nvptx port
@ 2014-11-14  8:42 Jakub Jelinek
  2014-11-14 10:17 ` Jakub Jelinek
  2014-11-14 11:17 ` Bernd Schmidt
  0 siblings, 2 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14  8:42 UTC (permalink / raw)
  To: Bernd Schmidt, Richard Henderson; +Cc: gcc-patches

Hi!

I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
   to be wired up and uses the %alloca documented in the PTX
   manual, what is the issue with that?  %alloca not being actually
   implemented by the current PTX assembler or translator?  Or
   some local vs. global address space issues?  If the latter,
   could at least VLAs be supported?
2) what is the reason why TLS isn't supported by the port (well,
   __emutls is emitted, but I doubt pthread_[gs]etspecific is
   implementable and thus it will not really do anything.
   Can't the port just emit all DECL_THREAD_LOCAL_P variables
   into .local instead of .global address space?  Would one
   need to convert those pointers to generic any way?
   I'm asking because e.g. libgomp uses __thread heavily and
   it would be nice to be able to use that.
3) in assembly emitted by the nvptx port, I've noticed:
.visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 %in_ar2)
{
	.reg.u64 %ar1;
	.reg.u32 %ar2;
.reg.u32 %retval;
	.reg.u64 %hr10;
	.reg.u32 %r22;
	.reg.u64 %r25;
   is the missing \t before the %retval line intentional?
4) I had a brief look at what it would take to port libgomp to PTX,
   which is needed for OpenMP offloading.  OpenMP offloaded kernels
   should start with 1 team and 1 thread in it, if we ignore
   GOMP_teams for now, I think the major things are:
   - right now libgomp is heavily pthread_* based, which is a no-go
     for nvptx I assume, I think we'll need some ifdefs in the sources
   - the main thing is that I believe we just have to replace
     gomp_team_start for nvptx; seems there are
     cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
     to spawn selected kernel in selected number of threads (and teams),
     from the docs it isn't exactly clear what the calling thread will do,
     if it is suspended and the HW core given to it is reused by something
     else (e.g. one of the newly spawned threads), then I think it should
     be usable.  Not sure what happens with .local memory of the parent
     task, if the children all have different .local memory, then
     perhaps one could just copy over what is needed from the
     invoking to the first invoked thread at start.  The question is
     how to figure out what to pass to cudeLaunchDevice (e.g. how to
     get handle of the current stream), and how to query how many
     teams and/or threads it is reasonable to ask for if the program
     wants defaults (and how many teams/threads are hard limits beyond which
     one can't go)
   - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
     enough to start that any "thread" pooling should be removed for nvptx?
   - we'll need some synchronization primitives, I see atomic support is
     there, we need mutexes and semaphores I think, is that implementable
     using bar instruction?
   - the library uses __attribute__((constructor)) in 3 places or so,
     initialize_team is pthread specific and can be probably ifdefed out,
     we won't support dlclose in nvptx anyway, but at least we need some
     way to initialize the nvptx libgomp; if the initialization is done
     in global memory, would it persist in between different kernels,
     so can the initialization as separate kernel be run once, something
     else?
   - is there any way to do any affinity management, or shall we just
     ignore affinity strategies?
   - the target/offloading stuff should be most likely stubbed in the
     library for nvptx, target data/target regions inside of target
     regions are undefined behavior in OpenMP, no need to bloat things
   - any way how to query time?
   Other thoughts?

	Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14  8:42 The nvptx port Jakub Jelinek
@ 2014-11-14 10:17 ` Jakub Jelinek
  2014-11-14 11:31   ` Bernd Schmidt
  2014-11-14 11:17 ` Bernd Schmidt
  1 sibling, 1 reply; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 10:17 UTC (permalink / raw)
  To: Bernd Schmidt, Richard Henderson; +Cc: gcc-patches

On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote:
> I have some questions about nvptx:

Oh, and
5) I have noticed gcc doesn't generate the .uni suffixes anywhere,
   while llvm generates them; are those appropriate only when a function
   is guaranteed to be run unconditionally from the toplevel kernel,
   or even in spots in arbitrary functions which might not be run
   unconditionally by all threads in thread block, but all threads
   that encounter the particular function will run the specific spot
   unconditionally?  I mean, if we have arbitrary function:
void foo (void) { something; bar (); something; }
   then the call is unconditional in there, but there is no guarantee
   somebody will not do
void baz (int x) { if (x > 20) foo (); }
   and run foo only in a subset of the threads.

	Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14  8:42 The nvptx port Jakub Jelinek
  2014-11-14 10:17 ` Jakub Jelinek
@ 2014-11-14 11:17 ` Bernd Schmidt
  2014-11-14 11:53   ` Jakub Jelinek
  2014-11-14 16:48   ` Jeff Law
  1 sibling, 2 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 11:17 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Henderson; +Cc: gcc-patches

Hi Jakub,

> I have some questions about nvptx:
> 1) you've said that alloca isn't supported, but it seems
>     to be wired up and uses the %alloca documented in the PTX
>     manual, what is the issue with that?  %alloca not being actually
>     implemented by the current PTX assembler or translator?

Yes, it's unimplemented. There's an internal declaration for it but that 
seems to be as far as it goes, and that declaration is 32-bit only anyway.

> 2) what is the reason why TLS isn't supported by the port (well,
>     __emutls is emitted, but I doubt pthread_[gs]etspecific is
>     implementable and thus it will not really do anything.
>     Can't the port just emit all DECL_THREAD_LOCAL_P variables
>     into .local instead of .global address space?

.local is stack frame memory, not TLS. The ptx docs mention the use of 
.local at file-scope as occurring only in "legacy" ptx code and I get 
the impression it's discouraged.

(As an aside, there's a question of how to represent a different 
concept, gang-local memory, in gcc. That would be .shared memory. We're 
currently going with just using an internal attribute)

> 3) in assembly emitted by the nvptx port, I've noticed:
> .visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 %in_ar2)
> {
> 	.reg.u64 %ar1;
> 	.reg.u32 %ar2;
> .reg.u32 %retval;
> 	.reg.u64 %hr10;
> 	.reg.u32 %r22;
> 	.reg.u64 %r25;
>     is the missing \t before the %retval line intentional?

No, I can fix that up.

> 4) I had a brief look at what it would take to port libgomp to PTX,
>     which is needed for OpenMP offloading.  OpenMP offloaded kernels
>     should start with 1 team and 1 thread in it, if we ignore
>     GOMP_teams for now, I think the major things are:
>     - right now libgomp is heavily pthread_* based, which is a no-go
>       for nvptx I assume, I think we'll need some ifdefs in the sources

I haven't looked into whether libpthread is doable. I suspect it's a 
poor match. I also haven't really looked into OpenMP, so I'm feeling a 
bit uncertain about answering your further questions.

>     - the main thing is that I believe we just have to replace
>       gomp_team_start for nvptx; seems there are
>       cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
>       to spawn selected kernel in selected number of threads (and teams),
>       from the docs it isn't exactly clear what the calling thread will do,
>       if it is suspended and the HW core given to it is reused by something
>       else (e.g. one of the newly spawned threads), then I think it should
>       be usable.  Not sure what happens with .local memory of the parent
>       task, if the children all have different .local memory, then
>       perhaps one could just copy over what is needed from the
>       invoking to the first invoked thread at start.

I'm a bit confused here, it sounds as if you want to call 
cudaLaunchDevice from ptx code? These are called from the host. As 
mentioned above, .local is probably not useful for what you want.

>     - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
>       enough to start that any "thread" pooling should be removed for nvptx?

Sorry, I don't understand the question.

>     - we'll need some synchronization primitives, I see atomic support is
>       there, we need mutexes and semaphores I think, is that implementable
>       using bar instruction?

It's probably membar you need.

>     - the library uses __attribute__((constructor)) in 3 places or so,
>       initialize_team is pthread specific and can be probably ifdefed out,
>       we won't support dlclose in nvptx anyway, but at least we need some
>       way to initialize the nvptx libgomp; if the initialization is done
>       in global memory, would it persist in between different kernels,
>       so can the initialization as separate kernel be run once, something
>       else?

I think that it would persist, and this would be my scheme for 
implementing constructors, but I haven't actually tried.

>     - is there any way to do any affinity management, or shall we just
>       ignore affinity strategies?

Not sure what they do in libgomp. It's probably not a match for GPU 
architectures.

>     - any way how to query time?

There are %clock and %clock64 cycle counters.


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 10:17 ` Jakub Jelinek
@ 2014-11-14 11:31   ` Bernd Schmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 11:31 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Henderson; +Cc: gcc-patches

On 11/14/2014 11:01 AM, Jakub Jelinek wrote:
> On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote:
>> I have some questions about nvptx:
>
> Oh, and
> 5) I have noticed gcc doesn't generate the .uni suffixes anywhere,
>     while llvm generates them; are those appropriate only when a function
>     is guaranteed to be run unconditionally from the toplevel kernel,
>     or even in spots in arbitrary functions which might not be run
>     unconditionally by all threads in thread block, but all threads
>     that encounter the particular function will run the specific spot
>     unconditionally?  I mean, if we have arbitrary function:
> void foo (void) { something; bar (); something; }
>     then the call is unconditional in there, but there is no guarantee
>     somebody will not do
> void baz (int x) { if (x > 20) foo (); }
>     and run foo only in a subset of the threads.

It's unclear to me what the .uni suffix even does on calls. Google finds 
this:

   http://divmap.wordpress.com/home/divopt/

which suggests that it says that the call's predicate will evaluate to 
the same value on all threads. So I think for an unconditional call 
instruction it's just meaningless.


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 11:17 ` Bernd Schmidt
@ 2014-11-14 11:53   ` Jakub Jelinek
  2014-11-14 12:22     ` Bernd Schmidt
                       ` (2 more replies)
  2014-11-14 16:48   ` Jeff Law
  1 sibling, 3 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 11:53 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Henderson, gcc-patches

On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
> >I have some questions about nvptx:
> >1) you've said that alloca isn't supported, but it seems
> >    to be wired up and uses the %alloca documented in the PTX
> >    manual, what is the issue with that?  %alloca not being actually
> >    implemented by the current PTX assembler or translator?
> 
> Yes, it's unimplemented. There's an internal declaration for it but that
> seems to be as far as it goes, and that declaration is 32-bit only anyway.

:(.  Does NVidia plan to fix that in next version?

> >2) what is the reason why TLS isn't supported by the port (well,
> >    __emutls is emitted, but I doubt pthread_[gs]etspecific is
> >    implementable and thus it will not really do anything.
> >    Can't the port just emit all DECL_THREAD_LOCAL_P variables
> >    into .local instead of .global address space?
> 
> .local is stack frame memory, not TLS. The ptx docs mention the use of
> .local at file-scope as occurring only in "legacy" ptx code and I get the
> impression it's discouraged.

:(.  So what other option one has to implement something like TLS, even
using inline asm or similar?  There is %tid, so perhaps indexing some array
with %tid?  The trouble with that is that some thread can do
#pragma omp parallel again, and I bet the %tid afterwards would be
again 0-(n-1), and if it is an index into a global array, it wouldn't work
well then.  Maybe without anything like TLS we can't really support nested
parallelism, only one level of #pragma omp parallel inside of nvptx regions.
But, if we add support for #pragma omp team, we'd either need the array
in gang-local memory, or some other special register to give us gang id.

BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
multiple host threads, so the question is how without local TLS we can
actually do anything at all.  Sure, we can pass parameters to the kernel,
but we'd need to propagate it through all functions.  Or can
cudaGetParameterBuffer be used for that?

> >4) I had a brief look at what it would take to port libgomp to PTX,
> >    which is needed for OpenMP offloading.  OpenMP offloaded kernels
> >    should start with 1 team and 1 thread in it, if we ignore
> >    GOMP_teams for now, I think the major things are:
> >    - right now libgomp is heavily pthread_* based, which is a no-go
> >      for nvptx I assume, I think we'll need some ifdefs in the sources
> 
> I haven't looked into whether libpthread is doable. I suspect it's a poor
> match. I also haven't really looked into OpenMP, so I'm feeling a bit
> uncertain about answering your further questions.

What OpenMP needs is essentially:
- some way to spawn multiple threads (fork-join model), where the parent
  thread is the first one among those other threads, or, if that isn't
  possible, the first thread pretends to be the same as the first thread
  and the parent thread sleeps
- something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex
  we use for Linux)
- something like sem_* semaphore
- and some TLS or something similar (pthread_[gs]etspecific etc.)

> >    - the main thing is that I believe we just have to replace
> >      gomp_team_start for nvptx; seems there are
> >      cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
> >      to spawn selected kernel in selected number of threads (and teams),
> >      from the docs it isn't exactly clear what the calling thread will do,
> >      if it is suspended and the HW core given to it is reused by something
> >      else (e.g. one of the newly spawned threads), then I think it should
> >      be usable.  Not sure what happens with .local memory of the parent
> >      task, if the children all have different .local memory, then
> >      perhaps one could just copy over what is needed from the
> >      invoking to the first invoked thread at start.
> 
> I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
> from ptx code? These are called from the host. As mentioned above, .local is
> probably not useful for what you want.

In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
it should be possible, there is:
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
.param .b64 func,
.param .b64 parameterBuffer,
.param .align 4 .b8 gridDimension[12],
.param .align 4 .b8 blockDimension[12],
.param .b32 sharedMemSize,
.param .b64 stream
)
;
(or s/.b64/.b32/ for -m32) that should be usable from within PTX.
The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
parallelism (because all other variants are just bad for OpenMP, you'd need
to preallocate all the gangs/threads (without knowing how many you'll need),
and perhaps let them sleep on some barrier until you have work for them.

> >    - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
> >      enough to start that any "thread" pooling should be removed for nvptx?
> 
> Sorry, I don't understand the question.

I meant what is the cost of cudaLaunchDevice from within PTX compared to
keeping the threads around sleeping on a barrier.  As OpenMP doesn't support
threadprivate user vars in the offloaded regions, we don't have to preserve
any state and thus could always launch threads and tear them appart again.

> >    - we'll need some synchronization primitives, I see atomic support is
> >      there, we need mutexes and semaphores I think, is that implementable
> >      using bar instruction?
> 
> It's probably membar you need.

That is a memory barrier, I need threads to wait on each other, wake up one
another etc.

> >    - is there any way to do any affinity management, or shall we just
> >      ignore affinity strategies?
> 
> Not sure what they do in libgomp. It's probably not a match for GPU
> architectures.

Ok.

> >    - any way how to query time?
> 
> There are %clock and %clock64 cycle counters.

Thanks.

	Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 11:53   ` Jakub Jelinek
@ 2014-11-14 12:22     ` Bernd Schmidt
  2014-11-14 13:32       ` Jakub Jelinek
  2014-11-14 15:46       ` Cesar Philippidis
  2014-11-14 17:00     ` Jeff Law
  2014-11-14 17:02     ` Jeff Law
  2 siblings, 2 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 12:22 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis

I'm adding Thomas and Cesar to the Cc list, they may have more insight 
into CUDA library questions as I haven't really looked into that part 
all that much.

On 11/14/2014 12:39 PM, Jakub Jelinek wrote:
> On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
>>> I have some questions about nvptx:
>>> 1) you've said that alloca isn't supported, but it seems
>>>     to be wired up and uses the %alloca documented in the PTX
>>>     manual, what is the issue with that?  %alloca not being actually
>>>     implemented by the current PTX assembler or translator?
>>
>> Yes, it's unimplemented. There's an internal declaration for it but that
>> seems to be as far as it goes, and that declaration is 32-bit only anyway.
>
> :(.  Does NVidia plan to fix that in next version?

I very much doubt it. It was like this in CUDA 5.0 when we started 
working on it, and it's still like this in CUDA 6.5.

>>> 2) what is the reason why TLS isn't supported by the port (well,
>>>     __emutls is emitted, but I doubt pthread_[gs]etspecific is
>>>     implementable and thus it will not really do anything.
>>>     Can't the port just emit all DECL_THREAD_LOCAL_P variables
>>>     into .local instead of .global address space?
>>
>> .local is stack frame memory, not TLS. The ptx docs mention the use of
>> .local at file-scope as occurring only in "legacy" ptx code and I get the
>> impression it's discouraged.
>
> :(.  So what other option one has to implement something like TLS, even
> using inline asm or similar?  There is %tid, so perhaps indexing some array
> with %tid?

That ought to work. For performance you'd want that array in .shared 
memory but I believe that's limited in size.

> BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
> multiple host threads, so the question is how without local TLS we can
> actually do anything at all.  Sure, we can pass parameters to the kernel,
> but we'd need to propagate it through all functions.  Or can
> cudaGetParameterBuffer be used for that?

Presumably a kernel could copy its arguments out to memory somewhere 
when it's called?

>>> 4) I had a brief look at what it would take to port libgomp to PTX,
>>>     which is needed for OpenMP offloading.  OpenMP offloaded kernels
>>>     should start with 1 team and 1 thread in it, if we ignore
>>>     GOMP_teams for now, I think the major things are:
>>>     - right now libgomp is heavily pthread_* based, which is a no-go
>>>       for nvptx I assume, I think we'll need some ifdefs in the sources
>>
>> I haven't looked into whether libpthread is doable. I suspect it's a poor
>> match. I also haven't really looked into OpenMP, so I'm feeling a bit
>> uncertain about answering your further questions.
>
> What OpenMP needs is essentially:
> - some way to spawn multiple threads (fork-join model), where the parent
>    thread is the first one among those other threads, or, if that isn't
>    possible, the first thread pretends to be the same as the first thread
>    and the parent thread sleeps
> - something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex
>    we use for Linux)
> - something like sem_* semaphore
> - and some TLS or something similar (pthread_[gs]etspecific etc.)
>
>>>     - the main thing is that I believe we just have to replace
>>>       gomp_team_start for nvptx; seems there are
>>>       cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
>>>       to spawn selected kernel in selected number of threads (and teams),
>>>       from the docs it isn't exactly clear what the calling thread will do,
>>>       if it is suspended and the HW core given to it is reused by something
>>>       else (e.g. one of the newly spawned threads), then I think it should
>>>       be usable.  Not sure what happens with .local memory of the parent
>>>       task, if the children all have different .local memory, then
>>>       perhaps one could just copy over what is needed from the
>>>       invoking to the first invoked thread at start.
>>
>> I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
>> from ptx code? These are called from the host. As mentioned above, .local is
>> probably not useful for what you want.
>
> In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
> it should be possible, there is:
> .extern .func(.param .b32 func_retval0) cudaLaunchDevice
> (
> .param .b64 func,
> .param .b64 parameterBuffer,
> .param .align 4 .b8 gridDimension[12],
> .param .align 4 .b8 blockDimension[12],
> .param .b32 sharedMemSize,
> .param .b64 stream
> )
> ;
> (or s/.b64/.b32/ for -m32) that should be usable from within PTX.
> The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
> parallelism (because all other variants are just bad for OpenMP, you'd need
> to preallocate all the gangs/threads (without knowing how many you'll need),
> and perhaps let them sleep on some barrier until you have work for them.

The latter would have been essentially the model I'd have tried to use 
(instead of sleeping, conditionalize on %tid==0). I didn't know there 
was a way to launch kernels from ptx code and haven't thought about what 
this implies.

>>>     - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
>>>       enough to start that any "thread" pooling should be removed for nvptx?
>>
>> Sorry, I don't understand the question.
>
> I meant what is the cost of cudaLaunchDevice from within PTX compared to
> keeping the threads around sleeping on a barrier.  As OpenMP doesn't support
> threadprivate user vars in the offloaded regions, we don't have to preserve
> any state and thus could always launch threads and tear them appart again.

No idea.

>>>     - we'll need some synchronization primitives, I see atomic support is
>>>       there, we need mutexes and semaphores I think, is that implementable
>>>       using bar instruction?
>>
>> It's probably membar you need.
>
> That is a memory barrier, I need threads to wait on each other, wake up one
> another etc.

Hmm. It's worthwhile to keep in mind that GPU threads really behave 
somewhat differently from CPUs (they don't really execute 
independently); the OMP model may just be a poor match for the 
architecture in general.
One could busywait on a spinlock, but AFAIK there isn't really a way to 
put a thread to sleep. By not executing independently, I mean this: I 
believe if one thread in a warp is waiting on the spinlock, all the 
other ones are also busywaiting. There may be other effects that seem 
odd if one approaches it from a CPU perspective - for example you 
probably want only one thread in a warp to try to take the spinlock.


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 12:22     ` Bernd Schmidt
@ 2014-11-14 13:32       ` Jakub Jelinek
  2014-11-14 13:45         ` Bernd Schmidt
  2014-11-14 17:09         ` Jeff Law
  2014-11-14 15:46       ` Cesar Philippidis
  1 sibling, 2 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 13:32 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis

On Fri, Nov 14, 2014 at 01:12:40PM +0100, Bernd Schmidt wrote:
> >:(.  So what other option one has to implement something like TLS, even
> >using inline asm or similar?  There is %tid, so perhaps indexing some array
> >with %tid?
> 
> That ought to work. For performance you'd want that array in .shared memory
> but I believe that's limited in size.

Any way to query those limits?  Size of .shared memory, number of threads in
warp, number of warps, etc.?  In OpenACC, are all workers in a single gang
the same warp?

> >BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
> >multiple host threads, so the question is how without local TLS we can
> >actually do anything at all.  Sure, we can pass parameters to the kernel,
> >but we'd need to propagate it through all functions.  Or can
> >cudaGetParameterBuffer be used for that?
> 
> Presumably a kernel could copy its arguments out to memory somewhere when
> it's called?

The question is where.  If it is global memory, then how would you find out
what value is for your team and what value is for some other team?

> >>>    - we'll need some synchronization primitives, I see atomic support is
> >>>      there, we need mutexes and semaphores I think, is that implementable
> >>>      using bar instruction?
> >>
> >>It's probably membar you need.
> >
> >That is a memory barrier, I need threads to wait on each other, wake up one
> >another etc.
> 
> Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat
> differently from CPUs (they don't really execute independently); the OMP
> model may just be a poor match for the architecture in general.
> One could busywait on a spinlock, but AFAIK there isn't really a way to put
> a thread to sleep. By not executing independently, I mean this: I believe if
> one thread in a warp is waiting on the spinlock, all the other ones are also
> busywaiting. There may be other effects that seem odd if one approaches it
> from a CPU perspective - for example you probably want only one thread in a
> warp to try to take the spinlock.

So, for a warp, if some threads perform one branch of an if and other
threads another one, all threads perform the first one first (with some
maybe not doing anything), then all the threads the others (again, other
threads not doing anything)?

As for the match, OpenMP isn't written for a particular accelerator, though
supposedly the addition of #pragma omp teams construct was done for NVidia.
So, some OpenMP code may be efficient on PTX, while other code might not be
that much (e.g. if all threads in a warp need to execute the same thing,
supposedly #pragma omp task isn't very good idea for the devices).

	Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 13:32       ` Jakub Jelinek
@ 2014-11-14 13:45         ` Bernd Schmidt
  2014-11-14 17:09         ` Jeff Law
  1 sibling, 0 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 13:45 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis

On 11/14/2014 01:36 PM, Jakub Jelinek wrote:
> Any way to query those limits?  Size of .shared memory, number of threads in
> warp, number of warps, etc.?

I'd have to google most of that. There seems to be a WARP_SZ constant 
available in ptx to get the size of the warp.

> In OpenACC, are all workers in a single gang
> the same warp?

No, warps are a relatively small size (32 threads).

> So, for a warp, if some threads perform one branch of an if and other
> threads another one, all threads perform the first one first (with some
> maybe not doing anything), then all the threads the others (again, other
> threads not doing anything)?

I believe that's what happens.


Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 12:22     ` Bernd Schmidt
  2014-11-14 13:32       ` Jakub Jelinek
@ 2014-11-14 15:46       ` Cesar Philippidis
  2014-11-14 16:33         ` Jakub Jelinek
  1 sibling, 1 reply; 19+ messages in thread
From: Cesar Philippidis @ 2014-11-14 15:46 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek
  Cc: Richard Henderson, gcc-patches, Thomas Schwinge

On 11/14/2014 04:12 AM, Bernd Schmidt wrote:

>>>>     - we'll need some synchronization primitives, I see atomic
>>>> support is
>>>>       there, we need mutexes and semaphores I think, is that
>>>> implementable
>>>>       using bar instruction?
>>>
>>> It's probably membar you need.
>>
>> That is a memory barrier, I need threads to wait on each other, wake
>> up one
>> another etc.
> 
> Hmm. It's worthwhile to keep in mind that GPU threads really behave
> somewhat differently from CPUs (they don't really execute
> independently); the OMP model may just be a poor match for the
> architecture in general.
> One could busywait on a spinlock, but AFAIK there isn't really a way to
> put a thread to sleep. By not executing independently, I mean this: I
> believe if one thread in a warp is waiting on the spinlock, all the
> other ones are also busywaiting. There may be other effects that seem
> odd if one approaches it from a CPU perspective - for example you
> probably want only one thread in a warp to try to take the spinlock.

Thread synchronization in CUDA is different from conventional CPUs.
Using the gang/thread terminology, there's no way to synchronize two
threads in two different gangs in PTX without invoking separate kernels.
Basically, after a kernel is invoked, the host/accelerator (the later
using dynamic parallelism) waits for the kernel to finish, and that
effectively creates a barrier.

PTX does have an intra-gang synchronization primitive, which is helpful
if the control flow diverges within a gang. Also, unless I'm mistaken,
the PTX atomic operations only work within a gang.

Also, keep in mind that PTX doesn't have a global TID. The user needs to
calculate it using ctaid/tid and friends.

Cesar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 15:46       ` Cesar Philippidis
@ 2014-11-14 16:33         ` Jakub Jelinek
  2014-11-14 16:44           ` Cesar Philippidis
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 16:33 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Bernd Schmidt, Richard Henderson, gcc-patches, Thomas Schwinge

On Fri, Nov 14, 2014 at 07:37:49AM -0800, Cesar Philippidis wrote:
> > Hmm. It's worthwhile to keep in mind that GPU threads really behave
> > somewhat differently from CPUs (they don't really execute
> > independently); the OMP model may just be a poor match for the
> > architecture in general.
> > One could busywait on a spinlock, but AFAIK there isn't really a way to
> > put a thread to sleep. By not executing independently, I mean this: I
> > believe if one thread in a warp is waiting on the spinlock, all the
> > other ones are also busywaiting. There may be other effects that seem
> > odd if one approaches it from a CPU perspective - for example you
> > probably want only one thread in a warp to try to take the spinlock.
> 
> Thread synchronization in CUDA is different from conventional CPUs.
> Using the gang/thread terminology, there's no way to synchronize two
> threads in two different gangs in PTX without invoking separate kernels.
> Basically, after a kernel is invoked, the host/accelerator (the later
> using dynamic parallelism) waits for the kernel to finish, and that
> effectively creates a barrier.

I believe in OpenMP terminology a gang is a team, and inter-teams barriers
are not supposed to work etc. (though, I think locks and atomic instructions
still are, so is critical region, so I really hope atomics are atomic even
inter-gang).  So for synchronization (mutexes and semaphores, from which
barriers are implemented; but perhaps could also use bar.arrive and bar.sync)
we mainly need synchronization within the gang.

> Also, keep in mind that PTX doesn't have a global TID. The user needs to
> calculate it using ctaid/tid and friends.

Ok.  Is %gridid needed for that combo too?

	Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 16:33         ` Jakub Jelinek
@ 2014-11-14 16:44           ` Cesar Philippidis
  2014-11-14 17:04             ` Jakub Jelinek
  0 siblings, 1 reply; 19+ messages in thread
From: Cesar Philippidis @ 2014-11-14 16:44 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Bernd Schmidt, Richard Henderson, gcc-patches, Thomas Schwinge

On 11/14/2014 08:18 AM, Jakub Jelinek wrote:

>> Also, keep in mind that PTX doesn't have a global TID. The user needs to
>> calculate it using ctaid/tid and friends.
> 
> Ok.  Is %gridid needed for that combo too?

Eventually, probably. Currently, we're launching all of our kernels with
cuLaunchKernel, and that function doesn't take grids into account.

Nvidia's documentation is kind of confusing. They use different
terminology for their high level CUDA stuff and the low level PTX. E.g.,
what CUDA refers to blocks/warps, PTX calls CTAs. I'm not sure what
grids corresponds to, but I think it might be devices. If that's the
case, the runtime does have the capability to select which device to run
a kernel on. But, it can't run a single kernel on multiple devices
unless you use asynchronous kernel invocations.

Cesar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 11:17 ` Bernd Schmidt
  2014-11-14 11:53   ` Jakub Jelinek
@ 2014-11-14 16:48   ` Jeff Law
  2014-11-17 22:49     ` Nathan Sidwell
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff Law @ 2014-11-14 16:48 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek, Richard Henderson; +Cc: gcc-patches

On 11/14/14 04:09, Bernd Schmidt wrote:
> Hi Jakub,
>
>> I have some questions about nvptx:
>> 1) you've said that alloca isn't supported, but it seems
>>     to be wired up and uses the %alloca documented in the PTX
>>     manual, what is the issue with that?  %alloca not being actually
>>     implemented by the current PTX assembler or translator?
>
> Yes, it's unimplemented. There's an internal declaration for it but that
> seems to be as far as it goes, and that declaration is 32-bit only anyway.
Right.  My recollection is it's defined in the vISA, but unimplemented.

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 11:53   ` Jakub Jelinek
  2014-11-14 12:22     ` Bernd Schmidt
@ 2014-11-14 17:00     ` Jeff Law
  2014-11-14 17:02     ` Jeff Law
  2 siblings, 0 replies; 19+ messages in thread
From: Jeff Law @ 2014-11-14 17:00 UTC (permalink / raw)
  To: Jakub Jelinek, Bernd Schmidt; +Cc: Richard Henderson, gcc-patches

On 11/14/14 04:39, Jakub Jelinek wrote:
> On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
>>> I have some questions about nvptx:
>>> 1) you've said that alloca isn't supported, but it seems
>>>     to be wired up and uses the %alloca documented in the PTX
>>>     manual, what is the issue with that?  %alloca not being actually
>>>     implemented by the current PTX assembler or translator?
>>
>> Yes, it's unimplemented. There's an internal declaration for it but that
>> seems to be as far as it goes, and that declaration is 32-bit only anyway.
>
> :(.  Does NVidia plan to fix that in next version?
They haven't indicated any such plans to me directly.  However, there's 
a clear direction to support arbitrary C/C++ over time.

jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 11:53   ` Jakub Jelinek
  2014-11-14 12:22     ` Bernd Schmidt
  2014-11-14 17:00     ` Jeff Law
@ 2014-11-14 17:02     ` Jeff Law
  2 siblings, 0 replies; 19+ messages in thread
From: Jeff Law @ 2014-11-14 17:02 UTC (permalink / raw)
  To: Jakub Jelinek, Bernd Schmidt; +Cc: Richard Henderson, gcc-patches

On 11/14/14 04:39, Jakub Jelinek wrote:

> :(.  So what other option one has to implement something like TLS, even
> using inline asm or similar?  There is %tid, so perhaps indexing some array
> with %tid?  The trouble with that is that some thread can do
> #pragma omp parallel again, and I bet the %tid afterwards would be
> again 0-(n-1), and if it is an index into a global array, it wouldn't work
> well then.  Maybe without anything like TLS we can't really support nested
> parallelism, only one level of #pragma omp parallel inside of nvptx regions.
> But, if we add support for #pragma omp team, we'd either need the array
> in gang-local memory, or some other special register to give us gang id.
Does the interface to the hardware even allow a model where we can 
launch another offload task while one is in progress?

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 16:44           ` Cesar Philippidis
@ 2014-11-14 17:04             ` Jakub Jelinek
  0 siblings, 0 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 17:04 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Bernd Schmidt, Richard Henderson, gcc-patches, Thomas Schwinge

On Fri, Nov 14, 2014 at 08:37:52AM -0800, Cesar Philippidis wrote:
> On 11/14/2014 08:18 AM, Jakub Jelinek wrote:
> 
> >> Also, keep in mind that PTX doesn't have a global TID. The user needs to
> >> calculate it using ctaid/tid and friends.
> > 
> > Ok.  Is %gridid needed for that combo too?
> 
> Eventually, probably. Currently, we're launching all of our kernels with
> cuLaunchKernel, and that function doesn't take grids into account.

I wonder if cudaLaunchDevice called from PTX will result in a different
%gridid or not, will see next week if I manage to get the HW and SW stack

	Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 13:32       ` Jakub Jelinek
  2014-11-14 13:45         ` Bernd Schmidt
@ 2014-11-14 17:09         ` Jeff Law
  2014-11-17 22:32           ` Nathan Sidwell
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff Law @ 2014-11-14 17:09 UTC (permalink / raw)
  To: Jakub Jelinek, Bernd Schmidt
  Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis

On 11/14/14 05:36, Jakub Jelinek wrote:
>
> So, for a warp, if some threads perform one branch of an if and other
> threads another one, all threads perform the first one first (with some
> maybe not doing anything), then all the threads the others (again, other
> threads not doing anything)?
Nobody ever specified exactly what happens in this case to me, but I 
gathered from reading the docs that once you have some threads in one 
path and others in a different path, things slow down to a horrid crawl. 
  So you try to avoid that :-)


jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 17:09         ` Jeff Law
@ 2014-11-17 22:32           ` Nathan Sidwell
  0 siblings, 0 replies; 19+ messages in thread
From: Nathan Sidwell @ 2014-11-17 22:32 UTC (permalink / raw)
  To: Jeff Law, Jakub Jelinek, Bernd Schmidt
  Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis

On 11/14/14 11:04, Jeff Law wrote:
> On 11/14/14 05:36, Jakub Jelinek wrote:
>>
>> So, for a warp, if some threads perform one branch of an if and other
>> threads another one, all threads perform the first one first (with some
>> maybe not doing anything), then all the threads the others (again, other
>> threads not doing anything)?

> Nobody ever specified exactly what happens in this case to me, but I gathered
> from reading the docs that once you have some threads in one path and others in
> a different path, things slow down to a horrid crawl.  So you try to avoid that :-)

this is correct.  Don't do that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: The nvptx port
  2014-11-14 16:48   ` Jeff Law
@ 2014-11-17 22:49     ` Nathan Sidwell
  0 siblings, 0 replies; 19+ messages in thread
From: Nathan Sidwell @ 2014-11-17 22:49 UTC (permalink / raw)
  To: Jeff Law, Bernd Schmidt, Jakub Jelinek, Richard Henderson; +Cc: gcc-patches

On 11/14/14 10:43, Jeff Law wrote:
> On 11/14/14 04:09, Bernd Schmidt wrote:
>> Hi Jakub,
>>
>>> I have some questions about nvptx:
>>> 1) you've said that alloca isn't supported, but it seems

>> Yes, it's unimplemented. There's an internal declaration for it but that
>> seems to be as far as it goes, and that declaration is 32-bit only anyway.
> Right.  My recollection is it's defined in the vISA, but unimplemented.

yup, all PTX docs I've seen (which is up to 3.2) say:
'Note: The current version of PTX does not support alloca.'

and as Bernd says, the associated text only talks about a declaration for 32-bit 
land.

nathan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* the nvptx port
@ 2014-11-07  8:30 VandeVondele  Joost
  0 siblings, 0 replies; 19+ messages in thread
From: VandeVondele  Joost @ 2014-11-07  8:30 UTC (permalink / raw)
  To: bernds; +Cc: gcc-patches

Hi Bernd,

reading the patches, it seems like there is no mention of sm_35, only sm_30. So, I'm wondering what 'sub'targets will initially be supported, and if/how/when various processors will be selected.

Thanks,

Joost



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-11-17 22:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-14  8:42 The nvptx port Jakub Jelinek
2014-11-14 10:17 ` Jakub Jelinek
2014-11-14 11:31   ` Bernd Schmidt
2014-11-14 11:17 ` Bernd Schmidt
2014-11-14 11:53   ` Jakub Jelinek
2014-11-14 12:22     ` Bernd Schmidt
2014-11-14 13:32       ` Jakub Jelinek
2014-11-14 13:45         ` Bernd Schmidt
2014-11-14 17:09         ` Jeff Law
2014-11-17 22:32           ` Nathan Sidwell
2014-11-14 15:46       ` Cesar Philippidis
2014-11-14 16:33         ` Jakub Jelinek
2014-11-14 16:44           ` Cesar Philippidis
2014-11-14 17:04             ` Jakub Jelinek
2014-11-14 17:00     ` Jeff Law
2014-11-14 17:02     ` Jeff Law
2014-11-14 16:48   ` Jeff Law
2014-11-17 22:49     ` Nathan Sidwell
  -- strict thread matches above, loose matches on Subject: below --
2014-11-07  8:30 the " VandeVondele  Joost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).