public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Andrew Stubbs <ams@baylibre.com>
To: Thomas Schwinge <tschwinge@baylibre.com>,
	Richard Biener <rguenther@suse.de>
Cc: Tobias Burnus <tburnus@baylibre.com>,
	gcc-patches@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>
Subject: Re: Stabilize flaky GCN target/offloading testing
Date: Wed, 6 Mar 2024 12:39:25 +0000	[thread overview]
Message-ID: <3508fe1e-63d3-4bde-9b19-6a531d6eebfe@baylibre.com> (raw)
In-Reply-To: <87il1z7e9m.fsf@euler.schwinge.ddns.net>

On 06/03/2024 12:09, Thomas Schwinge wrote:
> Hi!
> 
> On 2024-02-21T17:32:13+0100, Richard Biener <rguenther@suse.de> wrote:
>> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>:
>>> [...] per my work on <https://gcc.gnu.org/PR66005>
>>> "libgomp make check time is excessive", all execution testing in libgomp
>>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
>>> (... with the caveat that execution tests for
>>> effective-targets are *not* governed by that, as I've found yesterday.
>>> I have a WIP hack for that, too.)
> 
>>> What disturbs the testing a lot is, that the GPU may get into a bad
>>> state, upon which any use either fails with a
>>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
>>> 'libhsa-runtime64.so.1'...
>>>
>>> I've now tried to debug the latter case (hang).  When the GPU gets into
>>> this bad state (whatever exactly that is),
>>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
>>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
>>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
>>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
>>> There it hangs until killed (for example, until DejaGnu's timeout
>>> mechanism kills the process -- just that the next GPU-using execution
>>> test then runs into the same thing again...).
>>>
>>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
>>> we're able to recover via:
>>>
>>>     $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>>>     0
> 
> At least most of the times.  I've found that -- sometimes... ;-( -- if
> you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
> 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
> into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
> by injecting some artificial "cool-down period"...  (The latter I've not
> yet tested extensively.)
> 
>>> This is, obviously, a hack, probably needs a serial lock to not disturb
>>> other things, has hard-coded 'dri/0', and as I said in
>>> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
>>> "GCN RDNA2+ vs. GCC SLP vectorizer":
>>>
>>> | I've no idea what
>>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.
>>
>> It ends up terminating your X session…
> 
> Eh....  ;'-|
> 
>> (there’s some automatic driver recovery that’s also sometimes triggered which sounds like the same thing).
> 
>> I need to try using the integrated graphics for X11 to see if that avoids the issue.
> 
> A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
> remember correctly -- basically got it to work, via hand-editing
> '/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
> to work in that setup, and therefore reverted to "standard".
> 
>> Guess AMD needs to improve the driver/runtime (or we - it’s open source at least up to the firmware).
> 
>>> However, it's very useful in my testing.  :-|
>>>
>>> The questions is, how to detect the "hang" state without first running
>>> into a timeout (and disambiguating such a timeout from a user code
>>> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
>>> initialization, and before the actual GPU kernel launch cancel it with
>>> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
>>> error message that we can then react on, like for
>>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
>>> no-go in libgomp -- instead, use a helper thread to similarly implement a
>>> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
>>> other purposes.)  Any other clever ideas?  What's a suitable value for
>>> "a few seconds"?
> 
> I'm attaching my current "GCN: Watchdog for device image load", covering
> both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
> (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )
> 
> That, plus routing *all* potential GPU usage (in particular: including
> execution tests for effective-targets, see above) through a serial lock
> ('flock', implemented in DejaGnu board file, outside of the the
> "DejaGnu timeout domain", similar to
> 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
> catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
> the "fake" ones via "GCN: Watchdog for device image load") and in that
> case 'amdgpu_gpu_recover' and re-execution of the respective executable,
> does greatly stabilize flaky GCN target/offloading testing.
> 
> Do we have consensus to move forward with this approach, generally?

I've also observed a number of random hangs in host-side code outside 
our control, but after the kernel has exited. In general this watchdog 
approach might help with these. I do feel like it's "papering over the 
cracks", but if we can't fix it.... at the end of the day it's just a 
little extra code.

My only concern is that it might actually cause failures, perhaps on 
heavily loaded systems, or with network filesystems, or during debugging.

Andrew

  reply	other threads:[~2024-03-06 12:39 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24 12:43 [PATCH] amdgcn: additional gfx1100 support Andrew Stubbs
2024-01-24 16:01 ` [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs (was: [PATCH] amdgcn: additional gfx1100 support) Tobias Burnus
2024-01-26 12:26   ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs (was: [patch] amdgcn: config.gcc - enable gfx1100 multilib; add gfx1100 to docs) Tobias Burnus
2024-01-26 12:32     ` [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs Tobias Burnus
2024-01-26 12:40       ` Richard Biener
2024-01-26 12:59         ` Tobias Burnus
2024-01-26 16:21       ` Thomas Schwinge
2024-01-26 16:36         ` Richard Biener
2024-01-26 16:45         ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled (was: [patch] amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to the docs) Tobias Burnus
2024-01-29 10:01           ` [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled Andrew Stubbs
2024-01-26  8:56 ` [PATCH] amdgcn: additional gfx1100 support Richard Biener
2024-01-26  9:45   ` Richard Biener
2024-01-26 10:19     ` Andrew Stubbs
2024-01-26 10:22       ` Richard Biener
2024-01-26 10:31         ` Andrew Stubbs
2024-02-01 14:41     ` libgomp GCN gfx1030/gfx1100 offloading status (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-02-01 14:49       ` Richard Biener
2024-02-21 12:34         ` Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status) Thomas Schwinge
2024-02-21 16:32           ` Richard Biener
2024-03-06 12:09             ` Stabilize flaky GCN target/offloading testing Thomas Schwinge
2024-03-06 12:39               ` Andrew Stubbs [this message]
2024-03-06 13:29                 ` Richard Biener
2024-03-08 10:34           ` GCN, nvptx: Errors during device probing are fatal (was: Stabilizing flaky libgomp GCN target/offloading testing) Thomas Schwinge
2024-03-06 13:49 ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases (was: [PATCH] amdgcn: additional gfx1100 support) Thomas Schwinge
2024-03-06 14:03   ` amdgcn: additional gfx1030/gfx1100 support: adjust test cases Andrew Stubbs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3508fe1e-63d3-4bde-9b19-6a531d6eebfe@baylibre.com \
    --to=ams@baylibre.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=rguenther@suse.de \
    --cc=tburnus@baylibre.com \
    --cc=tschwinge@baylibre.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).