Re: Stabilize flaky GCN target/offloading testing

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Andrew Stubbs <ams@baylibre.com>
To: Thomas Schwinge <tschwinge@baylibre.com>,
	Richard Biener <rguenther@suse.de>
Cc: Tobias Burnus <tburnus@baylibre.com>,
	gcc-patches@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>
Subject: Re: Stabilize flaky GCN target/offloading testing
Date: Wed, 6 Mar 2024 12:39:25 +0000	[thread overview]
Message-ID: <3508fe1e-63d3-4bde-9b19-6a531d6eebfe@baylibre.com> (raw)
In-Reply-To: <87il1z7e9m.fsf@euler.schwinge.ddns.net>

On 06/03/2024 12:09, Thomas Schwinge wrote:
> Hi!
> 
> On 2024-02-21T17:32:13+0100, Richard Biener <rguenther@suse.de> wrote:
>> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>:
>>> [...] per my work on <https://gcc.gnu.org/PR66005>
>>> "libgomp make check time is excessive", all execution testing in libgomp
>>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
>>> (... with the caveat that execution tests for
>>> effective-targets are *not* governed by that, as I've found yesterday.
>>> I have a WIP hack for that, too.)
> 
>>> What disturbs the testing a lot is, that the GPU may get into a bad
>>> state, upon which any use either fails with a
>>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
>>> 'libhsa-runtime64.so.1'...
>>>
>>> I've now tried to debug the latter case (hang).  When the GPU gets into
>>> this bad state (whatever exactly that is),
>>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
>>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
>>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
>>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
>>> There it hangs until killed (for example, until DejaGnu's timeout
>>> mechanism kills the process -- just that the next GPU-using execution
>>> test then runs into the same thing again...).
>>>
>>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
>>> we're able to recover via:
>>>
>>>     $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>>>     0
> 
> At least most of the times.  I've found that -- sometimes... ;-( -- if
> you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
> 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
> into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
> by injecting some artificial "cool-down period"...  (The latter I've not
> yet tested extensively.)
> 
>>> This is, obviously, a hack, probably needs a serial lock to not disturb
>>> other things, has hard-coded 'dri/0', and as I said in
>>> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
>>> "GCN RDNA2+ vs. GCC SLP vectorizer":
>>>
>>> | I've no idea what
>>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.
>>
>> It ends up terminating your X session…
> 
> Eh....  ;'-|
> 
>> (there’s some automatic driver recovery that’s also sometimes triggered which sounds like the same thing).
> 
>> I need to try using the integrated graphics for X11 to see if that avoids the issue.
> 
> A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
> remember correctly -- basically got it to work, via hand-editing
> '/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
> to work in that setup, and therefore reverted to "standard".
> 
>> Guess AMD needs to improve the driver/runtime (or we - it’s open source at least up to the firmware).
> 
>>> However, it's very useful in my testing.  :-|
>>>
>>> The questions is, how to detect the "hang" state without first running
>>> into a timeout (and disambiguating such a timeout from a user code
>>> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
>>> initialization, and before the actual GPU kernel launch cancel it with
>>> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
>>> error message that we can then react on, like for
>>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
>>> no-go in libgomp -- instead, use a helper thread to similarly implement a
>>> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
>>> other purposes.)  Any other clever ideas?  What's a suitable value for
>>> "a few seconds"?
> 
> I'm attaching my current "GCN: Watchdog for device image load", covering
> both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
> (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )
> 
> That, plus routing *all* potential GPU usage (in particular: including
> execution tests for effective-targets, see above) through a serial lock
> ('flock', implemented in DejaGnu board file, outside of the the
> "DejaGnu timeout domain", similar to
> 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
> catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
> the "fake" ones via "GCN: Watchdog for device image load") and in that
> case 'amdgpu_gpu_recover' and re-execution of the respective executable,
> does greatly stabilize flaky GCN target/offloading testing.
> 
> Do we have consensus to move forward with this approach, generally?

I've also observed a number of random hangs in host-side code outside 
our control, but after the kernel has exited. In general this watchdog 
approach might help with these. I do feel like it's "papering over the 
cracks", but if we can't fix it.... at the end of the day it's just a 
little extra code.

My only concern is that it might actually cause failures, perhaps on 
heavily loaded systems, or with network filesystems, or during debugging.

Andrew

next prev parent reply	other threads:[~2024-03-06 12:39 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24 12:43 [PATCH] amdgcn: additional gfx1100 support Andrew Stubbs
2024-01-24 16:01 ` [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs (was: [PATCH] amdgcn: additional gfx1100 support) Tobias Burnus
2024-01-26 12:26   ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs (was: [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs) Tobias Burnus
2024-01-26 12:32     ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs Tobias Burnus
2024-01-26 12:40       ` Richard Biener
2024-01-26 12:59         ` Tobias Burnus
2024-01-26 16:21       ` Thomas Schwinge
2024-01-26 16:36         ` Richard Biener
2024-01-26 16:45         ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled (was: [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs) Tobias Burnus
2024-01-29 10:01           ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled Andrew Stubbs
2024-01-26  8:56 ` [PATCH] amdgcn: additional gfx1100 support Richard Biener
2024-01-26  9:45   ` Richard Biener
2024-01-26 10:19     ` Andrew Stubbs
2024-01-26 10:22       ` Richard Biener
2024-01-26 10:31         ` Andrew Stubbs
2024-02-01 14:41     ` libgomp GCN gfx1030/gfx1100 offloading status (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-02-01 14:49       ` Richard Biener
2024-02-21 12:34         ` Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status) Thomas Schwinge
2024-02-21 16:32           ` Richard Biener
2024-03-06 12:09             ` Stabilize flaky GCN target/offloading testing Thomas Schwinge
2024-03-06 12:39               ` Andrew Stubbs [this message]
2024-03-06 13:29                 ` Richard Biener
2024-03-08 10:34           ` GCN, nvptx: Errors during device probing are fatal (was: Stabilizing flaky libgomp GCN target/offloading testing) Thomas Schwinge
2024-03-06 13:49 ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-03-06 14:03   ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases Andrew Stubbs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3508fe1e-63d3-4bde-9b19-6a531d6eebfe@baylibre.com \
    --to=ams@baylibre.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=rguenther@suse.de \
    --cc=tburnus@baylibre.com \
    --cc=tschwinge@baylibre.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).