Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Thomas Schwinge <tschwinge@baylibre.com>
To: Richard Biener <rguenther@suse.de>, Andrew Stubbs <ams@baylibre.com>
Cc: Tobias Burnus <tburnus@baylibre.com>,
	gcc-patches@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>
Subject: Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status)
Date: Wed, 21 Feb 2024 13:34:01 +0100	[thread overview]
Message-ID: <87il2ij8sm.fsf@euler.schwinge.ddns.net> (raw)
In-Reply-To: <7sn70594-70r4-q5pp-7q5p-qr865r9q53qn@fhfr.qr>

Hi!

On 2024-02-01T15:49:02+0100, Richard Biener <rguenther@suse.de> wrote:
> On Thu, 1 Feb 2024, Thomas Schwinge wrote:
>> On 2024-01-26T10:45:10+0100, Richard Biener <rguenther@suse.de> wrote:
>> > On Fri, 26 Jan 2024, Richard Biener wrote:
>> >> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
>> >> > [...] is enough to get gfx1100 working for most purposes, on top of the
>> >> > patch that Tobias committed a week or so ago; there are still some test
>> >> > failures to investigate, and probably some tuning to do.
>> >> > 
>> >> > It might also get gfx1030 working too. @Richi, could you test it,
>> >> > please?
>> >> 
>> >> I can report partial success here.  [...]
>> 
>> >> I'll followup with a test summary once the (serial) run of libgomp
>> >> testing finished.
>> 
>> (Why serial, by the way?)
>
> Just out of caution ... (I'm using the GPU for the desktop at the
> same time and dmesg gets spammed with some not-so reassuring
> "errors" during the offloading)

Yeah, indeed 'dmesg' is full of "notes"...

However, note that per my work on <https://gcc.gnu.org/PR66005>
"libgomp make check time is excessive", all execution testing in libgomp
is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  So,
no problem/difference in that regard, to run parallel
'check-target-libgomp'.  (... with the caveat that execution tests for
effective-targets are *not* governed by that, as I've found yesterday.
I have a WIP hack for that, too.)

>> [...] what I
>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  [...]

>> [...] execution test FAILs.  Not all FAILs appear all the time [...]

What disturbs the testing a lot is, that the GPU may get into a bad
state, upon which any use either fails with a
'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
'libhsa-runtime64.so.1'...

I've now tried to debug the latter case (hang).  When the GPU gets into
this bad state (whatever exactly that is),
'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
There it hangs until killed (for example, until DejaGnu's timeout
mechanism kills the process -- just that the next GPU-using execution
test then runs into the same thing again...).

In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
we're able to recover via:

    $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
    0

This is, obviously, a hack, probably needs a serial lock to not disturb
other things, has hard-coded 'dri/0', and as I said in
<https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
"GCN RDNA2+ vs. GCC SLP vectorizer":

| I've no idea what
| 'amdgpu_gpu_recover' would do if the GPU is also used for display.

However, it's very useful in my testing.  :-|

The questions is, how to detect the "hang" state without first running
into a timeout (and disambiguating such a timeout from a user code
timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
initialization, and before the actual GPU kernel launch cancel it with
'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
error message that we can then react on, like for
'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
no-go in libgomp -- instead, use a helper thread to similarly implement a
watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
other purposes.)  Any other clever ideas?  What's a suitable value for
"a few seconds"?

Grüße
 Thomas

next prev parent reply	other threads:[~2024-02-21 12:34 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24 12:43 [PATCH] amdgcn: additional gfx1100 support Andrew Stubbs
2024-01-24 16:01 ` [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs (was: [PATCH] amdgcn: additional gfx1100 support) Tobias Burnus
2024-01-26 12:26   ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs (was: [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs) Tobias Burnus
2024-01-26 12:32     ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs Tobias Burnus
2024-01-26 12:40       ` Richard Biener
2024-01-26 12:59         ` Tobias Burnus
2024-01-26 16:21       ` Thomas Schwinge
2024-01-26 16:36         ` Richard Biener
2024-01-26 16:45         ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled (was: [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs) Tobias Burnus
2024-01-29 10:01           ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled Andrew Stubbs
2024-01-26  8:56 ` [PATCH] amdgcn: additional gfx1100 support Richard Biener
2024-01-26  9:45   ` Richard Biener
2024-01-26 10:19     ` Andrew Stubbs
2024-01-26 10:22       ` Richard Biener
2024-01-26 10:31         ` Andrew Stubbs
2024-02-01 14:41     ` libgomp GCN gfx1030/gfx1100 offloading status (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-02-01 14:49       ` Richard Biener
2024-02-21 12:34         ` Thomas Schwinge [this message]
2024-02-21 16:32           ` Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status) Richard Biener
2024-03-06 12:09             ` Stabilize flaky GCN target/offloading testing Thomas Schwinge
2024-03-06 12:39               ` Andrew Stubbs
2024-03-06 13:29                 ` Richard Biener
2024-03-08 10:34           ` GCN, nvptx: Errors during device probing are fatal (was: Stabilizing flaky libgomp GCN target/offloading testing) Thomas Schwinge
2024-03-06 13:49 ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-03-06 14:03   ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases Andrew Stubbs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87il2ij8sm.fsf@euler.schwinge.ddns.net \
    --to=tschwinge@baylibre.com \
    --cc=ams@baylibre.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=rguenther@suse.de \
    --cc=tburnus@baylibre.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).