public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "tschwinge at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/104364] New: [12 Regression] OpenMP/nvptx regressions after "[nvptx] Add some support for .local atomics"
Date: Thu, 03 Feb 2022 09:53:43 +0000	[thread overview]
Message-ID: <bug-104364-4@http.gcc.gnu.org/bugzilla/> (raw)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104364

            Bug ID: 104364
           Summary: [12 Regression] OpenMP/nvptx regressions after
                    "[nvptx] Add some support for .local atomics"
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: openmp
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: vries at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

As we've (slightly) discussed before, I acknowledge that commit
r12-6966-ge0451f93d9faa13495132f4e246e9bef30b51417 "[nvptx] Add some support
for .local atomics" is a good idea, thanks!  (Though I wonder, wouldn't it be
possible for the Driver/JIT to handle that internally?  Aha, you had the same
comment; first sentence in
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96494#c2>.)

For certain GPU/Driver combinations, this change however causes regressions in
OpenMP/nvptx testing:

    PASS: libgomp.c/../libgomp.c-c++-common/for-11.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-11.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-12.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-12.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-3.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-3.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-5.c (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-5.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-6.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-6.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-9.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-9.c execution test

But varying, not always all of these.

Sometimes with, sometimes without the timeout.

Similar for C++.

... but only seen regressing for:

  - Nvidia Tesla K20c, Driver Version: 346.46
  - Nvidia Tesla K20c, Driver Version: 455.38
  - Nvidia Tesla K40c, Driver Version: 455.38
  - Nvidia Tesla K80, Driver Version: 440.33.01

... and never seen regressing for:

  - Nvidia Quadro P1000, Driver Version: 450.119.03
  - Nvidia GeForce GTX 1080, Driver Version: 440.33.01
  - Nvidia TITAN V, Driver Version: 455.23.05

(What is the underlying characteristic here?)


I did notice that this change affects nvptx target libraries -- but OpenMP
'mgomp' multilib only -- as follows:

'nvptx-none/mgomp/libatomic/cas_1_.o' (complete diff):

    @@ -113,7 +113,7 @@
     .loc 3 80 9
     or.b64 %r61,%r60,%r39;
     .loc 3 82 11
    -@ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61;
    +atom.cas.b64 %r62,[%r35],%r29,%r61;
     mov.b64 {%r69,%r70},%r62;
     shfl.idx.b32 %r69,%r69,%r68,31;
     shfl.idx.b32 %r70,%r70,%r68,31;

Similar for 'nvptx-none/mgomp/libatomic/cas_2_.o'.

'nvptx-none/mgomp/libatomic/cas_4_.o' (complete diff):

    @@ -50,9 +50,9 @@
     mov.u32 %r25,%ar2;
     .loc 2 41 12
     ld.u32 %r28,[%r24];
    -@ %r37 membar.sys;
    -@ %r37 atom.cas.b32 %r29,[%r23],%r28,%r25;
    -@ %r37 membar.sys;
    +membar.sys;
    +atom.cas.b32 %r29,[%r23],%r28,%r25;
    +membar.sys;
     shfl.idx.b32 %r29,%r29,%r36,31;
     setp.eq.u32 %r31,%r29,%r28;
     selp.u32 %r30,1,0,%r31;

Similar for 'nvptx-none/mgomp/libatomic/cas_8_.o'.

Simlarly for several others in 'nvptx-none/mgomp/libatomic/',
'nvptx-none/mgomp/libbacktrace/dwarf.o', 'nvptx-none/mgomp/libgcc/atomic.o',
'nvptx-none/mgomp/libgfortran/single.o', several in
'nvptx-none/mgomp/libgomp/'.

That is, remove predication from all the "atomic" instructions.  I of course do
see the code changes related to that in the commit, but I do wonder whether
that's all intentional/correct?  (Or, is the 'atom' provably a no-op if not '@
%r37'?)

And, in 'nvptx-none/mgomp/libgomp/task.o' we've got several things like:

    -@ %r330 atom.cas.b32 %r208,[%frame+16],%r80,%r207;
    -@ %r330 membar.sys;
    +{
    +.reg .pred %eq_p;
    +.reg .u32 %val;
    +ld.u32 %val,[%frame+16];
    +setp.eq.u32 %eq_p,%val,%r80;
    +@ %eq_p st.u32 [%frame+16],%r207;
    +mov.u32 %r208,%val;
    +}

So these are the non-'atom' replacement sequences that now get synthesized
(haven't verified) -- but again this one lost the '@ %r330' predication from
all the "atomic" instructions?

And, is it correct here to use the non-'atom' replacement, though?  '%frame'
comes from:

    .visible .func GOMP_taskwait
    {
    .reg .u64 %stack;
    .reg .u64 %frame;
    .reg .u64 %sspslot;
    .reg .u64 %sspprev;
    {
    .reg .u32 %fstmp0;
    .reg .u64 %fstmp1;
    .reg .u64 %fstmp2;
    mov.u32 %fstmp0,%tid.y;
    mul.wide.u32 %fstmp1,%fstmp0,8;
    mov.u64 %fstmp2,__nvptx_stacks;
    add.u64 %sspslot,%fstmp2,%fstmp1;
    ld.shared.u64 %sspprev,[%sspslot];
    sub.u64 %frame,%sspprev,32;
    sub.u64 %stack,%frame,16;
    st.shared.u64 [%sspslot],%stack;
    }
    [...]

... which -- at least from my superficial look -- seems to be some kind of
'shared' location, thus mandating 'atom' access?  (But haven't looked in
detail, so I may certainly be wrong, of course.)

Are these changes and/or similar changes in nvptx code generation responsible
for the regressing test cases?  (Again, I have not looked.)


I'll report in case anything here changes due to Tom's further recent GCC/nvptx
back end commits.

             reply	other threads:[~2022-02-03  9:53 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-03  9:53 tschwinge at gcc dot gnu.org [this message]
2022-02-03 10:35 ` [Bug target/104364] " vries at gcc dot gnu.org
2022-02-03 11:45 ` rguenth at gcc dot gnu.org
2022-02-03 11:48 ` vries at gcc dot gnu.org
2022-02-03 12:47 ` vries at gcc dot gnu.org
2022-02-03 13:03 ` vries at gcc dot gnu.org
2022-02-03 13:35 ` vries at gcc dot gnu.org
2022-02-04  7:27 ` tschwinge at gcc dot gnu.org
2022-02-08  9:01 ` cvs-commit at gcc dot gnu.org
2022-02-08  9:07 ` vries at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-104364-4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).