* The nvptx port
@ 2014-11-14 8:42 Jakub Jelinek
2014-11-14 10:17 ` Jakub Jelinek
2014-11-14 11:17 ` Bernd Schmidt
0 siblings, 2 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 8:42 UTC (permalink / raw)
To: Bernd Schmidt, Richard Henderson; +Cc: gcc-patches
Hi!
I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
to be wired up and uses the %alloca documented in the PTX
manual, what is the issue with that? %alloca not being actually
implemented by the current PTX assembler or translator? Or
some local vs. global address space issues? If the latter,
could at least VLAs be supported?
2) what is the reason why TLS isn't supported by the port (well,
__emutls is emitted, but I doubt pthread_[gs]etspecific is
implementable and thus it will not really do anything.
Can't the port just emit all DECL_THREAD_LOCAL_P variables
into .local instead of .global address space? Would one
need to convert those pointers to generic any way?
I'm asking because e.g. libgomp uses __thread heavily and
it would be nice to be able to use that.
3) in assembly emitted by the nvptx port, I've noticed:
.visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 %in_ar2)
{
.reg.u64 %ar1;
.reg.u32 %ar2;
.reg.u32 %retval;
.reg.u64 %hr10;
.reg.u32 %r22;
.reg.u64 %r25;
is the missing \t before the %retval line intentional?
4) I had a brief look at what it would take to port libgomp to PTX,
which is needed for OpenMP offloading. OpenMP offloaded kernels
should start with 1 team and 1 thread in it, if we ignore
GOMP_teams for now, I think the major things are:
- right now libgomp is heavily pthread_* based, which is a no-go
for nvptx I assume, I think we'll need some ifdefs in the sources
- the main thing is that I believe we just have to replace
gomp_team_start for nvptx; seems there are
cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
to spawn selected kernel in selected number of threads (and teams),
from the docs it isn't exactly clear what the calling thread will do,
if it is suspended and the HW core given to it is reused by something
else (e.g. one of the newly spawned threads), then I think it should
be usable. Not sure what happens with .local memory of the parent
task, if the children all have different .local memory, then
perhaps one could just copy over what is needed from the
invoking to the first invoked thread at start. The question is
how to figure out what to pass to cudeLaunchDevice (e.g. how to
get handle of the current stream), and how to query how many
teams and/or threads it is reasonable to ask for if the program
wants defaults (and how many teams/threads are hard limits beyond which
one can't go)
- is it worth to reuse cudaLaunchDevice "threads" or are they cheap
enough to start that any "thread" pooling should be removed for nvptx?
- we'll need some synchronization primitives, I see atomic support is
there, we need mutexes and semaphores I think, is that implementable
using bar instruction?
- the library uses __attribute__((constructor)) in 3 places or so,
initialize_team is pthread specific and can be probably ifdefed out,
we won't support dlclose in nvptx anyway, but at least we need some
way to initialize the nvptx libgomp; if the initialization is done
in global memory, would it persist in between different kernels,
so can the initialization as separate kernel be run once, something
else?
- is there any way to do any affinity management, or shall we just
ignore affinity strategies?
- the target/offloading stuff should be most likely stubbed in the
library for nvptx, target data/target regions inside of target
regions are undefined behavior in OpenMP, no need to bloat things
- any way how to query time?
Other thoughts?
Jakub
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 8:42 The nvptx port Jakub Jelinek
@ 2014-11-14 10:17 ` Jakub Jelinek
2014-11-14 11:31 ` Bernd Schmidt
2014-11-14 11:17 ` Bernd Schmidt
1 sibling, 1 reply; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 10:17 UTC (permalink / raw)
To: Bernd Schmidt, Richard Henderson; +Cc: gcc-patches
On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote:
> I have some questions about nvptx:
Oh, and
5) I have noticed gcc doesn't generate the .uni suffixes anywhere,
while llvm generates them; are those appropriate only when a function
is guaranteed to be run unconditionally from the toplevel kernel,
or even in spots in arbitrary functions which might not be run
unconditionally by all threads in thread block, but all threads
that encounter the particular function will run the specific spot
unconditionally? I mean, if we have arbitrary function:
void foo (void) { something; bar (); something; }
then the call is unconditional in there, but there is no guarantee
somebody will not do
void baz (int x) { if (x > 20) foo (); }
and run foo only in a subset of the threads.
Jakub
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 8:42 The nvptx port Jakub Jelinek
2014-11-14 10:17 ` Jakub Jelinek
@ 2014-11-14 11:17 ` Bernd Schmidt
2014-11-14 11:53 ` Jakub Jelinek
2014-11-14 16:48 ` Jeff Law
1 sibling, 2 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 11:17 UTC (permalink / raw)
To: Jakub Jelinek, Richard Henderson; +Cc: gcc-patches
Hi Jakub,
> I have some questions about nvptx:
> 1) you've said that alloca isn't supported, but it seems
> to be wired up and uses the %alloca documented in the PTX
> manual, what is the issue with that? %alloca not being actually
> implemented by the current PTX assembler or translator?
Yes, it's unimplemented. There's an internal declaration for it but that
seems to be as far as it goes, and that declaration is 32-bit only anyway.
> 2) what is the reason why TLS isn't supported by the port (well,
> __emutls is emitted, but I doubt pthread_[gs]etspecific is
> implementable and thus it will not really do anything.
> Can't the port just emit all DECL_THREAD_LOCAL_P variables
> into .local instead of .global address space?
.local is stack frame memory, not TLS. The ptx docs mention the use of
.local at file-scope as occurring only in "legacy" ptx code and I get
the impression it's discouraged.
(As an aside, there's a question of how to represent a different
concept, gang-local memory, in gcc. That would be .shared memory. We're
currently going with just using an internal attribute)
> 3) in assembly emitted by the nvptx port, I've noticed:
> .visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 %in_ar2)
> {
> .reg.u64 %ar1;
> .reg.u32 %ar2;
> .reg.u32 %retval;
> .reg.u64 %hr10;
> .reg.u32 %r22;
> .reg.u64 %r25;
> is the missing \t before the %retval line intentional?
No, I can fix that up.
> 4) I had a brief look at what it would take to port libgomp to PTX,
> which is needed for OpenMP offloading. OpenMP offloaded kernels
> should start with 1 team and 1 thread in it, if we ignore
> GOMP_teams for now, I think the major things are:
> - right now libgomp is heavily pthread_* based, which is a no-go
> for nvptx I assume, I think we'll need some ifdefs in the sources
I haven't looked into whether libpthread is doable. I suspect it's a
poor match. I also haven't really looked into OpenMP, so I'm feeling a
bit uncertain about answering your further questions.
> - the main thing is that I believe we just have to replace
> gomp_team_start for nvptx; seems there are
> cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
> to spawn selected kernel in selected number of threads (and teams),
> from the docs it isn't exactly clear what the calling thread will do,
> if it is suspended and the HW core given to it is reused by something
> else (e.g. one of the newly spawned threads), then I think it should
> be usable. Not sure what happens with .local memory of the parent
> task, if the children all have different .local memory, then
> perhaps one could just copy over what is needed from the
> invoking to the first invoked thread at start.
I'm a bit confused here, it sounds as if you want to call
cudaLaunchDevice from ptx code? These are called from the host. As
mentioned above, .local is probably not useful for what you want.
> - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
> enough to start that any "thread" pooling should be removed for nvptx?
Sorry, I don't understand the question.
> - we'll need some synchronization primitives, I see atomic support is
> there, we need mutexes and semaphores I think, is that implementable
> using bar instruction?
It's probably membar you need.
> - the library uses __attribute__((constructor)) in 3 places or so,
> initialize_team is pthread specific and can be probably ifdefed out,
> we won't support dlclose in nvptx anyway, but at least we need some
> way to initialize the nvptx libgomp; if the initialization is done
> in global memory, would it persist in between different kernels,
> so can the initialization as separate kernel be run once, something
> else?
I think that it would persist, and this would be my scheme for
implementing constructors, but I haven't actually tried.
> - is there any way to do any affinity management, or shall we just
> ignore affinity strategies?
Not sure what they do in libgomp. It's probably not a match for GPU
architectures.
> - any way how to query time?
There are %clock and %clock64 cycle counters.
Bernd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 10:17 ` Jakub Jelinek
@ 2014-11-14 11:31 ` Bernd Schmidt
0 siblings, 0 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 11:31 UTC (permalink / raw)
To: Jakub Jelinek, Richard Henderson; +Cc: gcc-patches
On 11/14/2014 11:01 AM, Jakub Jelinek wrote:
> On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote:
>> I have some questions about nvptx:
>
> Oh, and
> 5) I have noticed gcc doesn't generate the .uni suffixes anywhere,
> while llvm generates them; are those appropriate only when a function
> is guaranteed to be run unconditionally from the toplevel kernel,
> or even in spots in arbitrary functions which might not be run
> unconditionally by all threads in thread block, but all threads
> that encounter the particular function will run the specific spot
> unconditionally? I mean, if we have arbitrary function:
> void foo (void) { something; bar (); something; }
> then the call is unconditional in there, but there is no guarantee
> somebody will not do
> void baz (int x) { if (x > 20) foo (); }
> and run foo only in a subset of the threads.
It's unclear to me what the .uni suffix even does on calls. Google finds
this:
http://divmap.wordpress.com/home/divopt/
which suggests that it says that the call's predicate will evaluate to
the same value on all threads. So I think for an unconditional call
instruction it's just meaningless.
Bernd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 11:17 ` Bernd Schmidt
@ 2014-11-14 11:53 ` Jakub Jelinek
2014-11-14 12:22 ` Bernd Schmidt
` (2 more replies)
2014-11-14 16:48 ` Jeff Law
1 sibling, 3 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 11:53 UTC (permalink / raw)
To: Bernd Schmidt; +Cc: Richard Henderson, gcc-patches
On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
> >I have some questions about nvptx:
> >1) you've said that alloca isn't supported, but it seems
> > to be wired up and uses the %alloca documented in the PTX
> > manual, what is the issue with that? %alloca not being actually
> > implemented by the current PTX assembler or translator?
>
> Yes, it's unimplemented. There's an internal declaration for it but that
> seems to be as far as it goes, and that declaration is 32-bit only anyway.
:(. Does NVidia plan to fix that in next version?
> >2) what is the reason why TLS isn't supported by the port (well,
> > __emutls is emitted, but I doubt pthread_[gs]etspecific is
> > implementable and thus it will not really do anything.
> > Can't the port just emit all DECL_THREAD_LOCAL_P variables
> > into .local instead of .global address space?
>
> .local is stack frame memory, not TLS. The ptx docs mention the use of
> .local at file-scope as occurring only in "legacy" ptx code and I get the
> impression it's discouraged.
:(. So what other option one has to implement something like TLS, even
using inline asm or similar? There is %tid, so perhaps indexing some array
with %tid? The trouble with that is that some thread can do
#pragma omp parallel again, and I bet the %tid afterwards would be
again 0-(n-1), and if it is an index into a global array, it wouldn't work
well then. Maybe without anything like TLS we can't really support nested
parallelism, only one level of #pragma omp parallel inside of nvptx regions.
But, if we add support for #pragma omp team, we'd either need the array
in gang-local memory, or some other special register to give us gang id.
BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
multiple host threads, so the question is how without local TLS we can
actually do anything at all. Sure, we can pass parameters to the kernel,
but we'd need to propagate it through all functions. Or can
cudaGetParameterBuffer be used for that?
> >4) I had a brief look at what it would take to port libgomp to PTX,
> > which is needed for OpenMP offloading. OpenMP offloaded kernels
> > should start with 1 team and 1 thread in it, if we ignore
> > GOMP_teams for now, I think the major things are:
> > - right now libgomp is heavily pthread_* based, which is a no-go
> > for nvptx I assume, I think we'll need some ifdefs in the sources
>
> I haven't looked into whether libpthread is doable. I suspect it's a poor
> match. I also haven't really looked into OpenMP, so I'm feeling a bit
> uncertain about answering your further questions.
What OpenMP needs is essentially:
- some way to spawn multiple threads (fork-join model), where the parent
thread is the first one among those other threads, or, if that isn't
possible, the first thread pretends to be the same as the first thread
and the parent thread sleeps
- something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex
we use for Linux)
- something like sem_* semaphore
- and some TLS or something similar (pthread_[gs]etspecific etc.)
> > - the main thing is that I believe we just have to replace
> > gomp_team_start for nvptx; seems there are
> > cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
> > to spawn selected kernel in selected number of threads (and teams),
> > from the docs it isn't exactly clear what the calling thread will do,
> > if it is suspended and the HW core given to it is reused by something
> > else (e.g. one of the newly spawned threads), then I think it should
> > be usable. Not sure what happens with .local memory of the parent
> > task, if the children all have different .local memory, then
> > perhaps one could just copy over what is needed from the
> > invoking to the first invoked thread at start.
>
> I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
> from ptx code? These are called from the host. As mentioned above, .local is
> probably not useful for what you want.
In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
it should be possible, there is:
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
.param .b64 func,
.param .b64 parameterBuffer,
.param .align 4 .b8 gridDimension[12],
.param .align 4 .b8 blockDimension[12],
.param .b32 sharedMemSize,
.param .b64 stream
)
;
(or s/.b64/.b32/ for -m32) that should be usable from within PTX.
The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
parallelism (because all other variants are just bad for OpenMP, you'd need
to preallocate all the gangs/threads (without knowing how many you'll need),
and perhaps let them sleep on some barrier until you have work for them.
> > - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
> > enough to start that any "thread" pooling should be removed for nvptx?
>
> Sorry, I don't understand the question.
I meant what is the cost of cudaLaunchDevice from within PTX compared to
keeping the threads around sleeping on a barrier. As OpenMP doesn't support
threadprivate user vars in the offloaded regions, we don't have to preserve
any state and thus could always launch threads and tear them appart again.
> > - we'll need some synchronization primitives, I see atomic support is
> > there, we need mutexes and semaphores I think, is that implementable
> > using bar instruction?
>
> It's probably membar you need.
That is a memory barrier, I need threads to wait on each other, wake up one
another etc.
> > - is there any way to do any affinity management, or shall we just
> > ignore affinity strategies?
>
> Not sure what they do in libgomp. It's probably not a match for GPU
> architectures.
Ok.
> > - any way how to query time?
>
> There are %clock and %clock64 cycle counters.
Thanks.
Jakub
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 11:53 ` Jakub Jelinek
@ 2014-11-14 12:22 ` Bernd Schmidt
2014-11-14 13:32 ` Jakub Jelinek
2014-11-14 15:46 ` Cesar Philippidis
2014-11-14 17:00 ` Jeff Law
2014-11-14 17:02 ` Jeff Law
2 siblings, 2 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 12:22 UTC (permalink / raw)
To: Jakub Jelinek
Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis
I'm adding Thomas and Cesar to the Cc list, they may have more insight
into CUDA library questions as I haven't really looked into that part
all that much.
On 11/14/2014 12:39 PM, Jakub Jelinek wrote:
> On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
>>> I have some questions about nvptx:
>>> 1) you've said that alloca isn't supported, but it seems
>>> to be wired up and uses the %alloca documented in the PTX
>>> manual, what is the issue with that? %alloca not being actually
>>> implemented by the current PTX assembler or translator?
>>
>> Yes, it's unimplemented. There's an internal declaration for it but that
>> seems to be as far as it goes, and that declaration is 32-bit only anyway.
>
> :(. Does NVidia plan to fix that in next version?
I very much doubt it. It was like this in CUDA 5.0 when we started
working on it, and it's still like this in CUDA 6.5.
>>> 2) what is the reason why TLS isn't supported by the port (well,
>>> __emutls is emitted, but I doubt pthread_[gs]etspecific is
>>> implementable and thus it will not really do anything.
>>> Can't the port just emit all DECL_THREAD_LOCAL_P variables
>>> into .local instead of .global address space?
>>
>> .local is stack frame memory, not TLS. The ptx docs mention the use of
>> .local at file-scope as occurring only in "legacy" ptx code and I get the
>> impression it's discouraged.
>
> :(. So what other option one has to implement something like TLS, even
> using inline asm or similar? There is %tid, so perhaps indexing some array
> with %tid?
That ought to work. For performance you'd want that array in .shared
memory but I believe that's limited in size.
> BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
> multiple host threads, so the question is how without local TLS we can
> actually do anything at all. Sure, we can pass parameters to the kernel,
> but we'd need to propagate it through all functions. Or can
> cudaGetParameterBuffer be used for that?
Presumably a kernel could copy its arguments out to memory somewhere
when it's called?
>>> 4) I had a brief look at what it would take to port libgomp to PTX,
>>> which is needed for OpenMP offloading. OpenMP offloaded kernels
>>> should start with 1 team and 1 thread in it, if we ignore
>>> GOMP_teams for now, I think the major things are:
>>> - right now libgomp is heavily pthread_* based, which is a no-go
>>> for nvptx I assume, I think we'll need some ifdefs in the sources
>>
>> I haven't looked into whether libpthread is doable. I suspect it's a poor
>> match. I also haven't really looked into OpenMP, so I'm feeling a bit
>> uncertain about answering your further questions.
>
> What OpenMP needs is essentially:
> - some way to spawn multiple threads (fork-join model), where the parent
> thread is the first one among those other threads, or, if that isn't
> possible, the first thread pretends to be the same as the first thread
> and the parent thread sleeps
> - something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex
> we use for Linux)
> - something like sem_* semaphore
> - and some TLS or something similar (pthread_[gs]etspecific etc.)
>
>>> - the main thing is that I believe we just have to replace
>>> gomp_team_start for nvptx; seems there are
>>> cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
>>> to spawn selected kernel in selected number of threads (and teams),
>>> from the docs it isn't exactly clear what the calling thread will do,
>>> if it is suspended and the HW core given to it is reused by something
>>> else (e.g. one of the newly spawned threads), then I think it should
>>> be usable. Not sure what happens with .local memory of the parent
>>> task, if the children all have different .local memory, then
>>> perhaps one could just copy over what is needed from the
>>> invoking to the first invoked thread at start.
>>
>> I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
>> from ptx code? These are called from the host. As mentioned above, .local is
>> probably not useful for what you want.
>
> In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
> it should be possible, there is:
> .extern .func(.param .b32 func_retval0) cudaLaunchDevice
> (
> .param .b64 func,
> .param .b64 parameterBuffer,
> .param .align 4 .b8 gridDimension[12],
> .param .align 4 .b8 blockDimension[12],
> .param .b32 sharedMemSize,
> .param .b64 stream
> )
> ;
> (or s/.b64/.b32/ for -m32) that should be usable from within PTX.
> The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
> parallelism (because all other variants are just bad for OpenMP, you'd need
> to preallocate all the gangs/threads (without knowing how many you'll need),
> and perhaps let them sleep on some barrier until you have work for them.
The latter would have been essentially the model I'd have tried to use
(instead of sleeping, conditionalize on %tid==0). I didn't know there
was a way to launch kernels from ptx code and haven't thought about what
this implies.
>>> - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
>>> enough to start that any "thread" pooling should be removed for nvptx?
>>
>> Sorry, I don't understand the question.
>
> I meant what is the cost of cudaLaunchDevice from within PTX compared to
> keeping the threads around sleeping on a barrier. As OpenMP doesn't support
> threadprivate user vars in the offloaded regions, we don't have to preserve
> any state and thus could always launch threads and tear them appart again.
No idea.
>>> - we'll need some synchronization primitives, I see atomic support is
>>> there, we need mutexes and semaphores I think, is that implementable
>>> using bar instruction?
>>
>> It's probably membar you need.
>
> That is a memory barrier, I need threads to wait on each other, wake up one
> another etc.
Hmm. It's worthwhile to keep in mind that GPU threads really behave
somewhat differently from CPUs (they don't really execute
independently); the OMP model may just be a poor match for the
architecture in general.
One could busywait on a spinlock, but AFAIK there isn't really a way to
put a thread to sleep. By not executing independently, I mean this: I
believe if one thread in a warp is waiting on the spinlock, all the
other ones are also busywaiting. There may be other effects that seem
odd if one approaches it from a CPU perspective - for example you
probably want only one thread in a warp to try to take the spinlock.
Bernd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 12:22 ` Bernd Schmidt
@ 2014-11-14 13:32 ` Jakub Jelinek
2014-11-14 13:45 ` Bernd Schmidt
2014-11-14 17:09 ` Jeff Law
2014-11-14 15:46 ` Cesar Philippidis
1 sibling, 2 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 13:32 UTC (permalink / raw)
To: Bernd Schmidt
Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis
On Fri, Nov 14, 2014 at 01:12:40PM +0100, Bernd Schmidt wrote:
> >:(. So what other option one has to implement something like TLS, even
> >using inline asm or similar? There is %tid, so perhaps indexing some array
> >with %tid?
>
> That ought to work. For performance you'd want that array in .shared memory
> but I believe that's limited in size.
Any way to query those limits? Size of .shared memory, number of threads in
warp, number of warps, etc.? In OpenACC, are all workers in a single gang
the same warp?
> >BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
> >multiple host threads, so the question is how without local TLS we can
> >actually do anything at all. Sure, we can pass parameters to the kernel,
> >but we'd need to propagate it through all functions. Or can
> >cudaGetParameterBuffer be used for that?
>
> Presumably a kernel could copy its arguments out to memory somewhere when
> it's called?
The question is where. If it is global memory, then how would you find out
what value is for your team and what value is for some other team?
> >>> - we'll need some synchronization primitives, I see atomic support is
> >>> there, we need mutexes and semaphores I think, is that implementable
> >>> using bar instruction?
> >>
> >>It's probably membar you need.
> >
> >That is a memory barrier, I need threads to wait on each other, wake up one
> >another etc.
>
> Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat
> differently from CPUs (they don't really execute independently); the OMP
> model may just be a poor match for the architecture in general.
> One could busywait on a spinlock, but AFAIK there isn't really a way to put
> a thread to sleep. By not executing independently, I mean this: I believe if
> one thread in a warp is waiting on the spinlock, all the other ones are also
> busywaiting. There may be other effects that seem odd if one approaches it
> from a CPU perspective - for example you probably want only one thread in a
> warp to try to take the spinlock.
So, for a warp, if some threads perform one branch of an if and other
threads another one, all threads perform the first one first (with some
maybe not doing anything), then all the threads the others (again, other
threads not doing anything)?
As for the match, OpenMP isn't written for a particular accelerator, though
supposedly the addition of #pragma omp teams construct was done for NVidia.
So, some OpenMP code may be efficient on PTX, while other code might not be
that much (e.g. if all threads in a warp need to execute the same thing,
supposedly #pragma omp task isn't very good idea for the devices).
Jakub
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 13:32 ` Jakub Jelinek
@ 2014-11-14 13:45 ` Bernd Schmidt
2014-11-14 17:09 ` Jeff Law
1 sibling, 0 replies; 19+ messages in thread
From: Bernd Schmidt @ 2014-11-14 13:45 UTC (permalink / raw)
To: Jakub Jelinek
Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis
On 11/14/2014 01:36 PM, Jakub Jelinek wrote:
> Any way to query those limits? Size of .shared memory, number of threads in
> warp, number of warps, etc.?
I'd have to google most of that. There seems to be a WARP_SZ constant
available in ptx to get the size of the warp.
> In OpenACC, are all workers in a single gang
> the same warp?
No, warps are a relatively small size (32 threads).
> So, for a warp, if some threads perform one branch of an if and other
> threads another one, all threads perform the first one first (with some
> maybe not doing anything), then all the threads the others (again, other
> threads not doing anything)?
I believe that's what happens.
Bernd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 12:22 ` Bernd Schmidt
2014-11-14 13:32 ` Jakub Jelinek
@ 2014-11-14 15:46 ` Cesar Philippidis
2014-11-14 16:33 ` Jakub Jelinek
1 sibling, 1 reply; 19+ messages in thread
From: Cesar Philippidis @ 2014-11-14 15:46 UTC (permalink / raw)
To: Bernd Schmidt, Jakub Jelinek
Cc: Richard Henderson, gcc-patches, Thomas Schwinge
On 11/14/2014 04:12 AM, Bernd Schmidt wrote:
>>>> - we'll need some synchronization primitives, I see atomic
>>>> support is
>>>> there, we need mutexes and semaphores I think, is that
>>>> implementable
>>>> using bar instruction?
>>>
>>> It's probably membar you need.
>>
>> That is a memory barrier, I need threads to wait on each other, wake
>> up one
>> another etc.
>
> Hmm. It's worthwhile to keep in mind that GPU threads really behave
> somewhat differently from CPUs (they don't really execute
> independently); the OMP model may just be a poor match for the
> architecture in general.
> One could busywait on a spinlock, but AFAIK there isn't really a way to
> put a thread to sleep. By not executing independently, I mean this: I
> believe if one thread in a warp is waiting on the spinlock, all the
> other ones are also busywaiting. There may be other effects that seem
> odd if one approaches it from a CPU perspective - for example you
> probably want only one thread in a warp to try to take the spinlock.
Thread synchronization in CUDA is different from conventional CPUs.
Using the gang/thread terminology, there's no way to synchronize two
threads in two different gangs in PTX without invoking separate kernels.
Basically, after a kernel is invoked, the host/accelerator (the later
using dynamic parallelism) waits for the kernel to finish, and that
effectively creates a barrier.
PTX does have an intra-gang synchronization primitive, which is helpful
if the control flow diverges within a gang. Also, unless I'm mistaken,
the PTX atomic operations only work within a gang.
Also, keep in mind that PTX doesn't have a global TID. The user needs to
calculate it using ctaid/tid and friends.
Cesar
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 15:46 ` Cesar Philippidis
@ 2014-11-14 16:33 ` Jakub Jelinek
2014-11-14 16:44 ` Cesar Philippidis
0 siblings, 1 reply; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 16:33 UTC (permalink / raw)
To: Cesar Philippidis
Cc: Bernd Schmidt, Richard Henderson, gcc-patches, Thomas Schwinge
On Fri, Nov 14, 2014 at 07:37:49AM -0800, Cesar Philippidis wrote:
> > Hmm. It's worthwhile to keep in mind that GPU threads really behave
> > somewhat differently from CPUs (they don't really execute
> > independently); the OMP model may just be a poor match for the
> > architecture in general.
> > One could busywait on a spinlock, but AFAIK there isn't really a way to
> > put a thread to sleep. By not executing independently, I mean this: I
> > believe if one thread in a warp is waiting on the spinlock, all the
> > other ones are also busywaiting. There may be other effects that seem
> > odd if one approaches it from a CPU perspective - for example you
> > probably want only one thread in a warp to try to take the spinlock.
>
> Thread synchronization in CUDA is different from conventional CPUs.
> Using the gang/thread terminology, there's no way to synchronize two
> threads in two different gangs in PTX without invoking separate kernels.
> Basically, after a kernel is invoked, the host/accelerator (the later
> using dynamic parallelism) waits for the kernel to finish, and that
> effectively creates a barrier.
I believe in OpenMP terminology a gang is a team, and inter-teams barriers
are not supposed to work etc. (though, I think locks and atomic instructions
still are, so is critical region, so I really hope atomics are atomic even
inter-gang). So for synchronization (mutexes and semaphores, from which
barriers are implemented; but perhaps could also use bar.arrive and bar.sync)
we mainly need synchronization within the gang.
> Also, keep in mind that PTX doesn't have a global TID. The user needs to
> calculate it using ctaid/tid and friends.
Ok. Is %gridid needed for that combo too?
Jakub
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 16:33 ` Jakub Jelinek
@ 2014-11-14 16:44 ` Cesar Philippidis
2014-11-14 17:04 ` Jakub Jelinek
0 siblings, 1 reply; 19+ messages in thread
From: Cesar Philippidis @ 2014-11-14 16:44 UTC (permalink / raw)
To: Jakub Jelinek
Cc: Bernd Schmidt, Richard Henderson, gcc-patches, Thomas Schwinge
On 11/14/2014 08:18 AM, Jakub Jelinek wrote:
>> Also, keep in mind that PTX doesn't have a global TID. The user needs to
>> calculate it using ctaid/tid and friends.
>
> Ok. Is %gridid needed for that combo too?
Eventually, probably. Currently, we're launching all of our kernels with
cuLaunchKernel, and that function doesn't take grids into account.
Nvidia's documentation is kind of confusing. They use different
terminology for their high level CUDA stuff and the low level PTX. E.g.,
what CUDA refers to blocks/warps, PTX calls CTAs. I'm not sure what
grids corresponds to, but I think it might be devices. If that's the
case, the runtime does have the capability to select which device to run
a kernel on. But, it can't run a single kernel on multiple devices
unless you use asynchronous kernel invocations.
Cesar
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 11:17 ` Bernd Schmidt
2014-11-14 11:53 ` Jakub Jelinek
@ 2014-11-14 16:48 ` Jeff Law
2014-11-17 22:49 ` Nathan Sidwell
1 sibling, 1 reply; 19+ messages in thread
From: Jeff Law @ 2014-11-14 16:48 UTC (permalink / raw)
To: Bernd Schmidt, Jakub Jelinek, Richard Henderson; +Cc: gcc-patches
On 11/14/14 04:09, Bernd Schmidt wrote:
> Hi Jakub,
>
>> I have some questions about nvptx:
>> 1) you've said that alloca isn't supported, but it seems
>> to be wired up and uses the %alloca documented in the PTX
>> manual, what is the issue with that? %alloca not being actually
>> implemented by the current PTX assembler or translator?
>
> Yes, it's unimplemented. There's an internal declaration for it but that
> seems to be as far as it goes, and that declaration is 32-bit only anyway.
Right. My recollection is it's defined in the vISA, but unimplemented.
Jeff
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 11:53 ` Jakub Jelinek
2014-11-14 12:22 ` Bernd Schmidt
@ 2014-11-14 17:00 ` Jeff Law
2014-11-14 17:02 ` Jeff Law
2 siblings, 0 replies; 19+ messages in thread
From: Jeff Law @ 2014-11-14 17:00 UTC (permalink / raw)
To: Jakub Jelinek, Bernd Schmidt; +Cc: Richard Henderson, gcc-patches
On 11/14/14 04:39, Jakub Jelinek wrote:
> On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
>>> I have some questions about nvptx:
>>> 1) you've said that alloca isn't supported, but it seems
>>> to be wired up and uses the %alloca documented in the PTX
>>> manual, what is the issue with that? %alloca not being actually
>>> implemented by the current PTX assembler or translator?
>>
>> Yes, it's unimplemented. There's an internal declaration for it but that
>> seems to be as far as it goes, and that declaration is 32-bit only anyway.
>
> :(. Does NVidia plan to fix that in next version?
They haven't indicated any such plans to me directly. However, there's
a clear direction to support arbitrary C/C++ over time.
jeff
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 11:53 ` Jakub Jelinek
2014-11-14 12:22 ` Bernd Schmidt
2014-11-14 17:00 ` Jeff Law
@ 2014-11-14 17:02 ` Jeff Law
2 siblings, 0 replies; 19+ messages in thread
From: Jeff Law @ 2014-11-14 17:02 UTC (permalink / raw)
To: Jakub Jelinek, Bernd Schmidt; +Cc: Richard Henderson, gcc-patches
On 11/14/14 04:39, Jakub Jelinek wrote:
> :(. So what other option one has to implement something like TLS, even
> using inline asm or similar? There is %tid, so perhaps indexing some array
> with %tid? The trouble with that is that some thread can do
> #pragma omp parallel again, and I bet the %tid afterwards would be
> again 0-(n-1), and if it is an index into a global array, it wouldn't work
> well then. Maybe without anything like TLS we can't really support nested
> parallelism, only one level of #pragma omp parallel inside of nvptx regions.
> But, if we add support for #pragma omp team, we'd either need the array
> in gang-local memory, or some other special register to give us gang id.
Does the interface to the hardware even allow a model where we can
launch another offload task while one is in progress?
Jeff
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 16:44 ` Cesar Philippidis
@ 2014-11-14 17:04 ` Jakub Jelinek
0 siblings, 0 replies; 19+ messages in thread
From: Jakub Jelinek @ 2014-11-14 17:04 UTC (permalink / raw)
To: Cesar Philippidis
Cc: Bernd Schmidt, Richard Henderson, gcc-patches, Thomas Schwinge
On Fri, Nov 14, 2014 at 08:37:52AM -0800, Cesar Philippidis wrote:
> On 11/14/2014 08:18 AM, Jakub Jelinek wrote:
>
> >> Also, keep in mind that PTX doesn't have a global TID. The user needs to
> >> calculate it using ctaid/tid and friends.
> >
> > Ok. Is %gridid needed for that combo too?
>
> Eventually, probably. Currently, we're launching all of our kernels with
> cuLaunchKernel, and that function doesn't take grids into account.
I wonder if cudaLaunchDevice called from PTX will result in a different
%gridid or not, will see next week if I manage to get the HW and SW stack
Jakub
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 13:32 ` Jakub Jelinek
2014-11-14 13:45 ` Bernd Schmidt
@ 2014-11-14 17:09 ` Jeff Law
2014-11-17 22:32 ` Nathan Sidwell
1 sibling, 1 reply; 19+ messages in thread
From: Jeff Law @ 2014-11-14 17:09 UTC (permalink / raw)
To: Jakub Jelinek, Bernd Schmidt
Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis
On 11/14/14 05:36, Jakub Jelinek wrote:
>
> So, for a warp, if some threads perform one branch of an if and other
> threads another one, all threads perform the first one first (with some
> maybe not doing anything), then all the threads the others (again, other
> threads not doing anything)?
Nobody ever specified exactly what happens in this case to me, but I
gathered from reading the docs that once you have some threads in one
path and others in a different path, things slow down to a horrid crawl.
So you try to avoid that :-)
jeff
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 17:09 ` Jeff Law
@ 2014-11-17 22:32 ` Nathan Sidwell
0 siblings, 0 replies; 19+ messages in thread
From: Nathan Sidwell @ 2014-11-17 22:32 UTC (permalink / raw)
To: Jeff Law, Jakub Jelinek, Bernd Schmidt
Cc: Richard Henderson, gcc-patches, Thomas Schwinge, Cesar Philippidis
On 11/14/14 11:04, Jeff Law wrote:
> On 11/14/14 05:36, Jakub Jelinek wrote:
>>
>> So, for a warp, if some threads perform one branch of an if and other
>> threads another one, all threads perform the first one first (with some
>> maybe not doing anything), then all the threads the others (again, other
>> threads not doing anything)?
> Nobody ever specified exactly what happens in this case to me, but I gathered
> from reading the docs that once you have some threads in one path and others in
> a different path, things slow down to a horrid crawl. So you try to avoid that :-)
this is correct. Don't do that.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: The nvptx port
2014-11-14 16:48 ` Jeff Law
@ 2014-11-17 22:49 ` Nathan Sidwell
0 siblings, 0 replies; 19+ messages in thread
From: Nathan Sidwell @ 2014-11-17 22:49 UTC (permalink / raw)
To: Jeff Law, Bernd Schmidt, Jakub Jelinek, Richard Henderson; +Cc: gcc-patches
On 11/14/14 10:43, Jeff Law wrote:
> On 11/14/14 04:09, Bernd Schmidt wrote:
>> Hi Jakub,
>>
>>> I have some questions about nvptx:
>>> 1) you've said that alloca isn't supported, but it seems
>> Yes, it's unimplemented. There's an internal declaration for it but that
>> seems to be as far as it goes, and that declaration is 32-bit only anyway.
> Right. My recollection is it's defined in the vISA, but unimplemented.
yup, all PTX docs I've seen (which is up to 3.2) say:
'Note: The current version of PTX does not support alloca.'
and as Bernd says, the associated text only talks about a declaration for 32-bit
land.
nathan
^ permalink raw reply [flat|nested] 19+ messages in thread
* the nvptx port
@ 2014-11-07 8:30 VandeVondele Joost
0 siblings, 0 replies; 19+ messages in thread
From: VandeVondele Joost @ 2014-11-07 8:30 UTC (permalink / raw)
To: bernds; +Cc: gcc-patches
Hi Bernd,
reading the patches, it seems like there is no mention of sm_35, only sm_30. So, I'm wondering what 'sub'targets will initially be supported, and if/how/when various processors will be selected.
Thanks,
Joost
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-11-17 22:19 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-14 8:42 The nvptx port Jakub Jelinek
2014-11-14 10:17 ` Jakub Jelinek
2014-11-14 11:31 ` Bernd Schmidt
2014-11-14 11:17 ` Bernd Schmidt
2014-11-14 11:53 ` Jakub Jelinek
2014-11-14 12:22 ` Bernd Schmidt
2014-11-14 13:32 ` Jakub Jelinek
2014-11-14 13:45 ` Bernd Schmidt
2014-11-14 17:09 ` Jeff Law
2014-11-17 22:32 ` Nathan Sidwell
2014-11-14 15:46 ` Cesar Philippidis
2014-11-14 16:33 ` Jakub Jelinek
2014-11-14 16:44 ` Cesar Philippidis
2014-11-14 17:04 ` Jakub Jelinek
2014-11-14 17:00 ` Jeff Law
2014-11-14 17:02 ` Jeff Law
2014-11-14 16:48 ` Jeff Law
2014-11-17 22:49 ` Nathan Sidwell
-- strict thread matches above, loose matches on Subject: below --
2014-11-07 8:30 the " VandeVondele Joost
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).