[OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant')

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant')
@ 2023-08-21 16:23 Tobias Burnus
  2023-08-22  7:25 ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Burnus @ 2023-08-21 16:23 UTC (permalink / raw)
  To: gcc-patches, Jakub Jelinek, Thomas Schwinge, Joseph Myers

RFC – and idea how to handle this best in GCC? See the two examples
below for what we would like to support.

* * *

In GCC, we handle OpenMP (and OpenACC) by parsing the input file once,
produce an internal representation (in LTO format) for offloading code
and only at link time process it by passing it via the LTO wrapper to
the offloading-device compilers (mkoffload / device lto1).
See https://gcc.gnu.org/wiki/Offloading

This works okayish - even though it causes some issues like with
metadirectives (they are implemened on the OG13 branch, however).
And with declare variant or a nohost version, where getting rid of
the host version is not that easy as it has to be in there until
omp-offload.cc's functions are run, which comes rather late.

There are currently already some issues like with -ffast-math
and GLIBC's finite math functions, which are not be available
on the device side when using newlib's libm..
(However, GLIBC has removed those.)

Likewise, it would be nice to do like Clang+LLVM does: Auto-enable
some device-specific math functions. (Albeit that won't work well
with Fortran.)

However, with OpenMP 5.1, there is a real issue. In 5.1, Appendix B
it reads as:
"For C/C++, the declare variant directive was extended to support elision
of preprocessed code and to allow enclosed function definitions to be
interpreted as variant functions (see Section 7.5)."

The problem is the "elision of preprocessed" as it permits code like the
following:

|#ifdef _OPENMP #pragma omp begin declare variant
match(device={arch=NVPTX}) #include "cuda/math.h" #pragma omp begin
declare variant match(device={isa=sm70}) #include "cuda/sm70/math.h"
#pragma omp end declare variant #pragma omp end declare variant #pragma
omp begin declare variant match(arch=AMD) #include "amdgpu/math.h"
#pragma omp end declare variant #endif|

And such code needs to keep working if there is a '#define ABC ...' in
one file and an '#ifndef ABC / #define ABC ...' in the other file.

Additionally, it would be neat if it would handle target-specific defines
like '#if __PTX_SM__ == 350' for the relevant parts (here: arch=nvptx).
(We already do support context selectors via the gcc/config/*/t-omp-device files;
see also https://gcc.gnu.org/onlinedocs/libgomp/OpenMP-Context-Selectors.html )

Thoughts?

* * *

The question is also what to support – "just" function declarations which are specific
to a device or some generic replacement of the kind:

|#pragma omp begin declare variant match(device={arch=NVPTX})|
   #define NUM_THREADS 128
#pragma omp end declare variant
|#pragma omp begin declare variant match(device={arch=AMDGCN})|
   #define NUM_THREADS 64
#pragma omp end declare variant

#ifndef NUM_THREADSß
   #define NUM_THREADS 16
#endif

...
printf ("Running with %d threads\n", NUM_THREADS);
#pragma omp parallel for num_threads(NUM_THREADS)

* * *

If we only handle 'begin/end declare variant', the following
works in principle:
- Parse the file once with only host-code parsing but
- keep track of delimited '|omp begin declare variant|'
   where the context selector matches one of the supported
   offload targets.
- parse the file n-times again but this time set the
   target-#defined (extended version of gcc/config/*/t-omp-device
   to make them available?)
- When doing so, ignore all non-offloading bits (issue: implicit
   'declare target' + have the data available for variant resolution).
- Store this in some way.

But it is not really clear to me how to do this in actual code.

Any suggestion?

Tobias

PS: I would like to have some input before the Cauldron, but we might want
to additionally discuss this in detail during the cauldron, possibly some
brainstorming before the BoF and then surely also in the BoF.

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant')
  2023-08-21 16:23 [OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant') Tobias Burnus
@ 2023-08-22  7:25 ` Richard Biener
  2023-08-22  8:43   ` Tobias Burnus
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Biener @ 2023-08-22  7:25 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge, Joseph Myers

On Mon, Aug 21, 2023 at 6:23 PM Tobias Burnus <tobias@codesourcery.com> wrote:
>
> RFC – and idea how to handle this best in GCC? See the two examples
> below for what we would like to support.
>
> * * *
>
> In GCC, we handle OpenMP (and OpenACC) by parsing the input file once,
> produce an internal representation (in LTO format) for offloading code
> and only at link time process it by passing it via the LTO wrapper to
> the offloading-device compilers (mkoffload / device lto1).
> See https://gcc.gnu.org/wiki/Offloading
>
> This works okayish - even though it causes some issues like with
> metadirectives (they are implemened on the OG13 branch, however).
> And with declare variant or a nohost version, where getting rid of
> the host version is not that easy as it has to be in there until
> omp-offload.cc's functions are run, which comes rather late.
>
> There are currently already some issues like with -ffast-math
> and GLIBC's finite math functions, which are not be available
> on the device side when using newlib's libm..
> (However, GLIBC has removed those.)
>
> Likewise, it would be nice to do like Clang+LLVM does: Auto-enable
> some device-specific math functions. (Albeit that won't work well
> with Fortran.)
>
>
> However, with OpenMP 5.1, there is a real issue. In 5.1, Appendix B
> it reads as:
> "For C/C++, the declare variant directive was extended to support elision
> of preprocessed code and to allow enclosed function definitions to be
> interpreted as variant functions (see Section 7.5)."
>
> The problem is the "elision of preprocessed" as it permits code like the
> following:
>
> |#ifdef _OPENMP #pragma omp begin declare variant
> match(device={arch=NVPTX}) #include "cuda/math.h" #pragma omp begin
> declare variant match(device={isa=sm70}) #include "cuda/sm70/math.h"
> #pragma omp end declare variant #pragma omp end declare variant #pragma
> omp begin declare variant match(arch=AMD) #include "amdgpu/math.h"
> #pragma omp end declare variant #endif|
>
> And such code needs to keep working if there is a '#define ABC ...' in
> one file and an '#ifndef ABC / #define ABC ...' in the other file.
>
> Additionally, it would be neat if it would handle target-specific defines
> like '#if __PTX_SM__ == 350' for the relevant parts (here: arch=nvptx).
> (We already do support context selectors via the gcc/config/*/t-omp-device files;
> see also https://gcc.gnu.org/onlinedocs/libgomp/OpenMP-Context-Selectors.html )
>
> Thoughts?

Err, so the OMP standard doesn't put any constraints on what to allow inside the
variants?  Is declare variant always at the toplevel?

> * * *
>
> The question is also what to support – "just" function declarations which are specific
> to a device or some generic replacement of the kind:
>
> |#pragma omp begin declare variant match(device={arch=NVPTX})|
>    #define NUM_THREADS 128
> #pragma omp end declare variant
> |#pragma omp begin declare variant match(device={arch=AMDGCN})|
>    #define NUM_THREADS 64
> #pragma omp end declare variant
>
> #ifndef NUM_THREADSß
>    #define NUM_THREADS 16
> #endif
>
> ...
> printf ("Running with %d threads\n", NUM_THREADS);
> #pragma omp parallel for num_threads(NUM_THREADS)
>
> * * *
>
> If we only handle 'begin/end declare variant', the following
> works in principle:
> - Parse the file once with only host-code parsing but
> - keep track of delimited '|omp begin declare variant|'
>    where the context selector matches one of the supported
>    offload targets.
> - parse the file n-times again but this time set the
>    target-#defined (extended version of gcc/config/*/t-omp-device
>    to make them available?)
> - When doing so, ignore all non-offloading bits (issue: implicit
>    'declare target' + have the data available for variant resolution).
> - Store this in some way.
>
> But it is not really clear to me how to do this in actual code.

But does that really help?  Consider

#ifdef _OPENMP
#pragma omp begin declare variant match(device={arch=NVPTX})
#include "cuda/math.h"
...

#pragma omp begin declare variant match(device={arch=NVPTX})
#include "conflicting with cuda/math.h"
...

or is there a constraint that "un-varianting" same-match variants need
to produce a valid translation unit?  That is, don't you get combinatorical
explosion with sequenced variants?

Does the OMP standard at all think of how the resulting C/C++ translation
unit is formed or does it simply take each variant as "finishing" a TU after
omp end declare variant?  Thus do declarations leak out of the "active"
variant into the following parts of the C/C++ TU?

To me it really looks like a very badly designed feature, not to mention
that it involves the preprocessor ...

> Any suggestion?

Something like you propose.  I'd even do it "harder", inventing a new
omppd (openmp preprocessor driver) which will pre-parse a TU and
invoke several compiler instances (GCC drivers) with -fomp-variant=X
making only variants "X" active.  Doesn't really solve the issue with
sequenced variants unless there are constraints in the OMP spec
making that work.  It should be possible to have the separate compilers
produce LTO bytecode (for the offload target then) from the "same"
C TU and combine them at WPA time.  All the offload table handling
might need to improve here of course, but the omppd might produce
enough meta data to help here.

That said, I really wouldn't try to fiddle "omppd" into the host
compiler parts, that doesn't sound fun for maintainance purposes.

Richard.

> Tobias
>
> PS: I would like to have some input before the Cauldron, but we might want
> to additionally discuss this in detail during the cauldron, possibly some
> brainstorming before the BoF and then surely also in the BoF.
>
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant')
  2023-08-22  7:25 ` Richard Biener
@ 2023-08-22  8:43   ` Tobias Burnus
  2023-08-22  9:05     ` Jakub Jelinek
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Burnus @ 2023-08-22  8:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek, Thomas Schwinge, Joseph Myers

On 22.08.23 09:25, Richard Biener wrote:
> On Mon, Aug 21, 2023 at 6:23 PM Tobias Burnus <tobias@codesourcery.com> wrote:
>> ...
> Err, so the OMP standard doesn't put any constraints on what to allow inside the
> variants?  Is declare variant always at the toplevel?

Actually, the OpenMP specification only states the following – which is less than I claimed:

"If the context selector of a begin declare variant directive contains traits in the device
or implementation set that are known never to be compatible with an OpenMP context during
the current compilation, the preprocessed code that follows the begin declare variant
directive up to its paired end directive is elided."

With once per target parsing as with clang, code like:

#pragma omp begin declare variant ... arch={...}
  #define FOO 5
  ...
#pragma omp end declare variant ... arch={}

could be effectively replaced by:

#ifdef __nvptx__
  #define FOO 5
  #pragma omp begin declare variant ... arch={...}
    ...
  #pragma omp end declare variant ... arch={}
#end if

such that only code remains which matches the architecture
and ISA – but for all other selectors like
while for

#pragma omp begin declare variant ... construct={teams,parallel,for})
   #define BAR 1
   ...
#pragma omp end declare variant
#pragma omp begin declare variant ... construct={distribute})
   #define FOOBAR 1
#pragma omp end declare variant

the two defines would remain, visible in the whole TU.

* * *

Thus, for GCC not eliding anything – because it might get used in later
processing – would be a conforming implementation.

However, I fear that users expect that code like in the shown example
works, i.e. at least 'arch' (for us: host + nvptx, amdgcn) and possibly
'isa' (host ISA + gfx906, sm_80 etc.) "work", i.e. preprocessed code is
elided.

As such a support come for free with Clang (and most/all other
compilers, most of them are based on Clang), it will work there,
increasing the chance that users want to use it.

* * *

Regarding top level or not, the spec does not really tell – except that
it has to be used in declarative context.

In practical terms, I assume that code elision will (nearly) only be
used at top-level context and via #include – with the idea that this
brings in function declarations (but is likely to bring in #defines as
side effect).

In terms of the spec, more is permitted - including using it inside C++
classes, albeit not for constructors/destructors, virtual, defaulted and
deleted functions. - But that seems to be an odd place for adding an
#include or #define or some other code.

Likewise for inside a function or some scope inside a function.

Thus, IMHO, not supporting non-toplevel elision would be fine.

[Thinking of it, the problem is not only conflicting function
declarations and #define but also conflicting typedef and enum/struct.]

>> ...
> But does that really help?  Consider
>
> #ifdef _OPENMP
> #pragma omp begin declare variant match(device={arch=NVPTX})
> #include "cuda/math.h"
> ...
>
> #pragma omp begin declare variant match(device={arch=NVPTX})
> #include "conflicting with cuda/math.h"
> ...
I think this will produce a conflict (when the compiler accepts
arch=nvptx) as it has the same context selector; that's independent
whether nested (begin / begin ... end / end) or squential (begin ...
end; begin ... end).
> or is there a constraint that "un-varianting" same-match variants need
> to produce a valid translation unit?  That is, don't you get combinatorical
> explosion with sequenced variants?

You do get an combinatorical explosion – but only handling arch + isa
currently leaves host (+enabled ISA) plus devices (each: + enabled ISA)
such that with current GCC support, only up to 3 combinations remain
(host, amdgcn, nvptx).

And in case of Clang or any once-per-device parsing, only a single
combination remains. The spec permits to handle other things, but
arch/isa seems to be the most useful and for (multi-parse compiler) the
simplest.

(Side remark: It would be useful if we could support multiple ISA per
offload target, e.g. compiling for gfx908 *and* gfx90a, but that's
currently not possible with GCC (but it is with Clang).)
> Does the OMP standard at all think of how the resulting C/C++ translation
> unit is formed or does it simply take each variant as "finishing" a TU after
> omp end declare variant?  Thus do declarations leak out of the "active"
> variant into the following parts of the C/C++ TU?

I think it does not really think of finishing but of eliding before it
is processed – such that the remaining code simply applies to the TU as
if there were no surrounding begin/end declare variant.

And it assume a multi-parse setup.

> To me it really looks like a very badly designed feature, not to mention
> that it involves the preprocessor ...

Yes, it is an odd combination of preprocessor and code gen. If both is
done in one step, I think it works - instructing the extended processor
to skip until 'end declare variant' if the context selector for 'begin
declare variant' does not match.

But if one splits it into two parts: 'cpp' and only then 'compiling',
the parser has to skip over all code, including code it potentially does
not handle, until the #pragma omp end declare variant.

I am not sure whether both variants should to be supported, but only
supporting the former seems to be more important and sufficient.

* * *

>> Any suggestion?
> Something like you propose.  I'd even do it "harder", inventing a new
> omppd (openmp preprocessor driver) which will pre-parse a TU and
> invoke several compiler instances (GCC drivers) with -fomp-variant=X
> making only variants "X" active.
Hmm. That would kind of undo the parse once we currently have, but solve
the issue.
>    Doesn't really solve the issue with
> sequenced variants unless there are constraints in the OMP spec
> making that work.  It should be possible to have the separate compilers
> produce LTO bytecode (for the offload target then) from the "same"
> C TU and combine them at WPA time.  All the offload table handling
> might need to improve here of course, but the omppd might produce
> enough meta data to help here.
>
> That said, I really wouldn't try to fiddle "omppd" into the host
> compiler parts, that doesn't sound fun for maintainance purposes.

Thanks for your comments!

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant')
  2023-08-22  8:43   ` Tobias Burnus
@ 2023-08-22  9:05     ` Jakub Jelinek
  0 siblings, 0 replies; 4+ messages in thread
From: Jakub Jelinek @ 2023-08-22  9:05 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Richard Biener, gcc-patches, Thomas Schwinge, Joseph Myers

On Tue, Aug 22, 2023 at 10:43:54AM +0200, Tobias Burnus wrote:
> On 22.08.23 09:25, Richard Biener wrote:
> > On Mon, Aug 21, 2023 at 6:23 PM Tobias Burnus <tobias@codesourcery.com> wrote:
> > > ...
> > Err, so the OMP standard doesn't put any constraints on what to allow inside the
> > variants?  Is declare variant always at the toplevel?
> 
> Actually, the OpenMP specification only states the following – which is less than I claimed:
> 
> "If the context selector of a begin declare variant directive contains traits in the device
> or implementation set that are known never to be compatible with an OpenMP context during
> the current compilation, the preprocessed code that follows the begin declare variant
> directive up to its paired end directive is elided."

The reason for the way how GCC implements the offloading is make sure the
layout of types/variables/functions is the same so that the host and
offloading side can actually interoperate.  I think it is much cleaner
design.
The unfortunate thing is that LLVM decided to do it differently, by separate
parsing/compilation for host cases and device cases.
That allows the various preprocessor games and the like, but on the other
side allows the user to make host vs. offloading inoperable - say #ifdefing
out some members of a struct, using different attributes which cause
different alignment and the like.  If source comes from a pipe, what do you
do so that you can preprocess multiple times?  The offloading compilation
still needs to be some weird hybrid of the offloading target and host target,
because e.g. the structure/variable layout/alignment etc. decisions need to
be done according to host target.
The worst thing is that the bad way LLVM decided to implement this later
leaks into the standard, where some people who propose new features just
don't think that it could be implemented differently and that results in
cases like the begin declare variant eliding what is in between.  It takes
time to adjust the wording so that it is acceptable even for the GCC way
of doing offloading and sometimes we aren't successful at it.
So, the long term question is if we should't give up and do it with separate
parsing as well.  But that would be a lot of work...

	Jakub

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-08-22  9:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-21 16:23 [OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant') Tobias Burnus
2023-08-22  7:25 ` Richard Biener
2023-08-22  8:43   ` Tobias Burnus
2023-08-22  9:05     ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).