public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
@ 2020-09-25  8:52 burnus at gcc dot gnu.org
  2020-09-25  8:57 ` [Bug target/97203] " burnus at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2020-09-25  8:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

            Bug ID: 97203
           Summary: [nvptx] 'illegal memory access was encountered' with
                    'omp simd'/SIMT and cexpf call
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: openmp, wrong-code
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: burnus at gcc dot gnu.org
                CC: vries at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

Created attachment 49269
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49269&action=edit
C testcase - compile with -fopenmp and "-O0", "-O1", and "-O1
-funsafe-math-optimizations"

My impression is that this is again (→ PR95654) related to SIMT going somehow
wrong, but I do not quite understand why.


The code uses 'omp simd ... reduction(…)' — using 'omp parallel do ...' instead
works.


The big program works at -O0, fails with -O1/-O2 but starts working again if
additionally -ffast-math is used. The fail is:
  libgomp: cuCtxSynchronize error: invalid program counter
or
  libgomp: cuCtxSynchronize error: unspecified launch failure (perhaps abort
was called) 


The attached program is a vastly reduced version, which has a similar fail and
similar pattern, which may or may not have the same cause. – In any case:

It uses 'omp simd' and, hence, nvptx's SIMT and inside 'omp simd':
            float cosArg = __builtin_cosf(expArg);
            float sinArg = __builtin_sinf(expArg);

With with -O0 but also with -O1/-O2 -funsafe-math-optimizations it works and
the code contains with -funsafe-math-optimizations:
                cos.approx.f32  %r73, %r75;
                sin.approx.f32  %r72, %r75;
and with -O0 (and unsafe math disabled):
                call (%value_in), cosf, (%out_arg1);
                call (%value_in), sinf, (%out_arg1);

But with -O1/-O2 it fails with:
   libgomp: cuCtxSynchronize error: an illegal memory access was encountered
here, the sin/cos was turned into BUILT_IN_SINCOSF and we end up with the code:
   call cexpf, (%out_arg1, %out_arg2, %out_arg3);


I have no idea why 'call cosf/sinf' inside 'omp simd' works but 'call cexpf'
fails – nor whether that is indeed related to SIMT.


I think there are two issues. Mainly:

FIRST ISSUE: Why does it fail with 'cexpf'?

 * * *

SECOND ISSUE: Missed optimization for BUILT_IN_SINCOSF:

  if (optab_handler (sincos_optab, mode) != CODE_FOR_nothing)
...
  else if (targetm.libc_has_function (function_sincos))
...
  else
...
        fn = builtin_decl_explicit (BUILT_IN_CEXPF);


Seems as if we do the latter. In newlib's ./newlib/libm/complex/cexpf.c:

cexpf(float complex z)
...
        x = crealf(z);
        y = cimagf(z);
        r = expf(x);
        w = r * cosf(y) + r * sinf(y) * I;
        return w;

which is not really a performance boost compared to just calling sinf/cosf ...

Note that newlib does have newlib/libm/math/wf_sincos.c which does:
        void sincosf(float x, float *sinx, float *cosx)
{
  *sinx = sinf (x);
  *cosx = cosf (x);

Which avoids a bunch of '*' and '+' and inparticular an 'expf' call. (Should be
still slower than directly calling sinf/cosf due to the call overhead, but much
better than cexpf, unless implemented in hardware.)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
@ 2020-09-25  8:57 ` burnus at gcc dot gnu.org
  2020-10-08  9:56 ` vries at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2020-09-25  8:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #1 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Besides PR95654, see PR81778 and PR80053.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
  2020-09-25  8:57 ` [Bug target/97203] " burnus at gcc dot gnu.org
@ 2020-10-08  9:56 ` vries at gcc dot gnu.org
  2020-10-08 12:57 ` vries at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: vries at gcc dot gnu.org @ 2020-10-08  9:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #2 from Tom de Vries <vries at gcc dot gnu.org> ---
Minimal version (without inlining sinf code from newlib):
...
/* { dg-additional-options "-lm -foffload=-lm" } */

#define N 1

int
main (void) {
  float k[N];
  float res;

  for (int i = 0; i < N; i++)
    k[i] = 300;

#pragma omp target map(to:k) map(from:res)
  {
    float sum = 0.0;
#pragma omp simd reduction(+:sum)
    for (int i = 0; i < N; i++)
      sum += __builtin_sinf (k[i]);

    res = sum;
  }

  return 0;
}
...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
  2020-09-25  8:57 ` [Bug target/97203] " burnus at gcc dot gnu.org
  2020-10-08  9:56 ` vries at gcc dot gnu.org
@ 2020-10-08 12:57 ` vries at gcc dot gnu.org
  2020-10-08 15:27 ` vries at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: vries at gcc dot gnu.org @ 2020-10-08 12:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #3 from Tom de Vries <vries at gcc dot gnu.org> ---
[ Note, this is with GOMP_NVPTX_JIT=-O0. ]

In sinf, we have:
...
 45:                        return -__kernel_cosf(y[0],y[1]);
...
which translates to:
...
.loc 1 45 12
ld.f32 %r67,[%frame+4];
ld.f32 %r65,[%frame];
{
.param .f32 %value_in;
.param .f32 %out_arg1;
st.param.f32 [%out_arg1],%r65;
.param .f32 %out_arg2;
st.param.f32 [%out_arg2],%r67;
call (%value_in),__kernel_cosf,(%out_arg1,%out_arg2);
ld.param.f32 %r68,[%value_in];
}
.loc 1 45 11
neg.f32 %r37,%r68;
...

If I place (using GOMP_NVPTX_PTXRW) a trap before the first load:
...
 .loc 1 45 12
+trap
 ld.f32 %r67,[%frame+4];
...
I get:
...
libgomp: cuCtxSynchronize error: an illegal instruction was encountered
...

If I place it after the first load, I get:
...
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-10-08 12:57 ` vries at gcc dot gnu.org
@ 2020-10-08 15:27 ` vries at gcc dot gnu.org
  2020-10-08 15:37 ` vries at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: vries at gcc dot gnu.org @ 2020-10-08 15:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #4 from Tom de Vries <vries at gcc dot gnu.org> ---
So, I think calling functions from simd code is atm not supported for nvptx.

Stack variables in simd code are mapped on a per-thread stack rather than on
the
usual per-warp stack.

The functions are compiled with the usual per-warp stack, so calling those
functions from simd might mean the different lanes are gonna disagree about
what the value in a stack variable should be.

Having said that, for the example in comment 2, there only should be one thread
executing the call, so this doesn't explain the illegal memory access.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-10-08 15:27 ` vries at gcc dot gnu.org
@ 2020-10-08 15:37 ` vries at gcc dot gnu.org
  2020-10-09  8:36 ` amonakov at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: vries at gcc dot gnu.org @ 2020-10-08 15:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #5 from Tom de Vries <vries at gcc dot gnu.org> ---
FWIW, another aspect here is convergence (as usual).

Looking at the SASS code for main$_omp_fn$0$impl, I don't find evidence for the
usual divergence/convergence ops (SSY/SYNC), which might mean that the
following shfl is executed in divergent mode, so, even if we would not get the
memory access error, we would not get correct results.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-10-08 15:37 ` vries at gcc dot gnu.org
@ 2020-10-09  8:36 ` amonakov at gcc dot gnu.org
  2020-10-12  8:44 ` vries at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-10-09  8:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #4)
> So, I think calling functions from simd code is atm not supported for nvptx.
> 
> Stack variables in simd code are mapped on a per-thread stack rather than on
> the
> usual per-warp stack.
> 
> The functions are compiled with the usual per-warp stack, so calling those
> functions from simd might mean the different lanes are gonna disagree about
> what the value in a stack variable should be.

This is inaccurate. In -msoft-stack mode there's no baked-in assumption that
stacks are always per-warp. The "soft stack" pointer can point either to global
memory (outside of SIMD regions), or to local memory (inside SIMD regions). The
pointer is switched between per-warp global memory and per-lane local memory by
nvptx.c:nvptx_output_softstack_switch.

The main requirement is that functions callable from OpenMP offloaded code are
compiled for -mgomp multilib variant. The design allows calling functions even
from inside SIMD regions, and it should be supported.

It is very disappointing that the first reaction was "I think ... is not
supported" without reaching out and asking questions. Lack of efficient
communication was a huge issue when OpenMP offloading support was contributed,
and it's disappointing to see it again years later.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-10-09  8:36 ` amonakov at gcc dot gnu.org
@ 2020-10-12  8:44 ` vries at gcc dot gnu.org
  2020-10-12  9:36 ` amonakov at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: vries at gcc dot gnu.org @ 2020-10-12  8:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #7 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #6)
> (In reply to Tom de Vries from comment #4)
> > So, I think calling functions from simd code is atm not supported for nvptx.
> > 
> > Stack variables in simd code are mapped on a per-thread stack rather than on
> > the
> > usual per-warp stack.
> > 
> > The functions are compiled with the usual per-warp stack, so calling those
> > functions from simd might mean the different lanes are gonna disagree about
> > what the value in a stack variable should be.
> 
> This is inaccurate. In -msoft-stack mode there's no baked-in assumption that
> stacks are always per-warp. The "soft stack" pointer can point either to
> global memory (outside of SIMD regions), or to local memory (inside SIMD
> regions). The pointer is switched between per-warp global memory and
> per-lane local memory by nvptx.c:nvptx_output_softstack_switch.
> 
> The main requirement is that functions callable from OpenMP offloaded code
> are compiled for -mgomp multilib variant. The design allows calling
> functions even from inside SIMD regions, and it should be supported.

I see, that's helpful, thanks.

I guess I was thrown off by seeing a %simtstack_ar of 136 bytes:
...
.local .align 8 .b8 %simtstack_ar[136];
...
which seems more of an amount claimed by a single function.

Is it possible you meant the default of -msoft-stack-reserve-local=128 to mean
128kb (similar to what is claimed in nvptx_stacks_size in the plugin)? Because
currently it means 128 bytes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2020-10-12  8:44 ` vries at gcc dot gnu.org
@ 2020-10-12  9:36 ` amonakov at gcc dot gnu.org
  2020-10-12 13:00 ` vries at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-10-12  9:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
No, -msoft-stack-reserve-local is really meant to be in bytes: it may not
exceed the amount of .local memory reserved by CUDA driver (which is just 1-2
KB, unless overridden via cuCtxSetLimit, which nvptx-run.c does, but
plugin-nvptx.c does not).

Keep in mind that .local memory reservation is multiplied by number of active
contexts, which could be in range 20000-30000 when the code was written: 128KB
local memory per active thread would imply a 2.5GB allocation on the GPU.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2020-10-12  9:36 ` amonakov at gcc dot gnu.org
@ 2020-10-12 13:00 ` vries at gcc dot gnu.org
  2020-10-12 15:10 ` vries at gcc dot gnu.org
  2020-10-12 15:15 ` amonakov at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: vries at gcc dot gnu.org @ 2020-10-12 13:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #9 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #2)
> Minimal version (without inlining sinf code from newlib):
> ...
> /* { dg-additional-options "-lm -foffload=-lm" } */
> 
> #define N 1
> 
> int
> main (void) {
>   float k[N];
>   float res;
> 
>   for (int i = 0; i < N; i++)
>     k[i] = 300;
>   
> #pragma omp target map(to:k) map(from:res)
>   {
>     float sum = 0.0;
> #pragma omp simd reduction(+:sum)
>     for (int i = 0; i < N; i++)
>       sum += __builtin_sinf (k[i]);
>     
>     res = sum;
>   }
> 
>   return 0;
> }
> ...

Starts passing at -foffload=-msoft-stack-reserve-local=346.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2020-10-12 13:00 ` vries at gcc dot gnu.org
@ 2020-10-12 15:10 ` vries at gcc dot gnu.org
  2020-10-12 15:15 ` amonakov at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: vries at gcc dot gnu.org @ 2020-10-12 15:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #10 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #8)
> No, -msoft-stack-reserve-local is really meant to be in bytes: it may not
> exceed the amount of .local memory reserved by CUDA driver (which is just
> 1-2 KB, unless overridden via cuCtxSetLimit, which nvptx-run.c does, but
> plugin-nvptx.c does not).
> 
> Keep in mind that .local memory reservation is multiplied by number of
> active contexts, which could be in range 20000-30000 when the code was
> written: 128KB local memory per active thread would imply a 2.5GB allocation
> on the GPU.

With the number of active contexts, do you mean the sm_count * thread_max as
used in nvptx-run.c (which, FWIW, is 10.240 on my card)?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
  2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2020-10-12 15:10 ` vries at gcc dot gnu.org
@ 2020-10-12 15:15 ` amonakov at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-10-12 15:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #11 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Yes, that.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-10-12 15:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25  8:52 [Bug target/97203] New: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call burnus at gcc dot gnu.org
2020-09-25  8:57 ` [Bug target/97203] " burnus at gcc dot gnu.org
2020-10-08  9:56 ` vries at gcc dot gnu.org
2020-10-08 12:57 ` vries at gcc dot gnu.org
2020-10-08 15:27 ` vries at gcc dot gnu.org
2020-10-08 15:37 ` vries at gcc dot gnu.org
2020-10-09  8:36 ` amonakov at gcc dot gnu.org
2020-10-12  8:44 ` vries at gcc dot gnu.org
2020-10-12  9:36 ` amonakov at gcc dot gnu.org
2020-10-12 13:00 ` vries at gcc dot gnu.org
2020-10-12 15:10 ` vries at gcc dot gnu.org
2020-10-12 15:15 ` amonakov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).