From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by sourceware.org (Postfix) with ESMTPS id DFFE838582AC for ; Wed, 6 Mar 2024 13:29:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DFFE838582AC Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org DFFE838582AC Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709731752; cv=none; b=YPF4y6b8rOC9weuBguxT1zecadyATnUqVlZEgyf+puWWkBjeWsh3+mYzbFjSJy1ZQSMrPDtYGSo1N4FJooujS0uHE8IIguJ3CX+rbCPnfehwqnidFy8elb8It1bjbyAcqQA9CCirQwvGHRYKY2t38FyAIUAGuNqRKMJ7WZVY4uE= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709731752; c=relaxed/simple; bh=rNmd5nH15VHOGa2hm9EfVmk0Xe7GtxD+rTh6z47bOiY=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:Message-ID:MIME-Version; b=JR0inzwqehtyoBG9Qurp/bLBKF83DlgaXNVi4pNgsL2TFsKcf7de5fRoq1ayP2nAqjP4nAQTaEsoJwlDf0cuaKbnuv3zPsHFqb/sbAHdvcfKpvpa27PoX8fyJ1YGjD75UVQkBK15Eqp1ZbaWYtlp2/0IE8oZRJX1/bFLbj0XS6U= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from [10.168.4.150] (unknown [10.168.4.150]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id B1FAA6A963; Wed, 6 Mar 2024 13:29:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1709731748; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kKUHQxcMwHosVoo61JTeTP3zrd+7I7sDtpCeUWsIKOo=; b=R2fh/xyAXvg2rLGwaQYIPnWN66G9z5rH98N4g2fprx98AtfPsusNsXAC+K8N26TVLVPow8 v0wMeK7DDO9oIbOyXs2m8QdBrO2F8AFqlIPnEzduO2fK+FRh0cPQmi9BexJvoVJUwCnDY0 tdHZ5AWqV7CYpOyuqI4ZzSzF8hObpMY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1709731748; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kKUHQxcMwHosVoo61JTeTP3zrd+7I7sDtpCeUWsIKOo=; b=/o/7fTs0wUSKv5XKOZkoqmZM/Qz8T0T1RFdaKcLSpxqTntUaSPAoFVElmzod1+cmGZcbX6 YEZ6CabxSb6ELiAw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1709731748; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kKUHQxcMwHosVoo61JTeTP3zrd+7I7sDtpCeUWsIKOo=; b=R2fh/xyAXvg2rLGwaQYIPnWN66G9z5rH98N4g2fprx98AtfPsusNsXAC+K8N26TVLVPow8 v0wMeK7DDO9oIbOyXs2m8QdBrO2F8AFqlIPnEzduO2fK+FRh0cPQmi9BexJvoVJUwCnDY0 tdHZ5AWqV7CYpOyuqI4ZzSzF8hObpMY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1709731748; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kKUHQxcMwHosVoo61JTeTP3zrd+7I7sDtpCeUWsIKOo=; b=/o/7fTs0wUSKv5XKOZkoqmZM/Qz8T0T1RFdaKcLSpxqTntUaSPAoFVElmzod1+cmGZcbX6 YEZ6CabxSb6ELiAw== Date: Wed, 6 Mar 2024 14:29:08 +0100 (CET) From: Richard Biener To: Andrew Stubbs cc: Thomas Schwinge , Tobias Burnus , gcc-patches@gcc.gnu.org, Jakub Jelinek Subject: Re: Stabilize flaky GCN target/offloading testing In-Reply-To: <3508fe1e-63d3-4bde-9b19-6a531d6eebfe@baylibre.com> Message-ID: References: <87il2ij8sm.fsf@euler.schwinge.ddns.net> <87il1z7e9m.fsf@euler.schwinge.ddns.net> <3508fe1e-63d3-4bde-9b19-6a531d6eebfe@baylibre.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Authentication-Results: smtp-out1.suse.de; none X-Spam-Level: X-Spam-Score: -4.22 X-Spamd-Result: default: False [-4.22 / 50.00]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.999]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_FIVE(0.00)[5]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; NEURAL_HAM_SHORT(-0.12)[-0.582]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:email]; FUZZY_BLOCKED(0.00)[rspamd.com]; RCVD_COUNT_ZERO(0.00)[0]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; BAYES_HAM(-3.00)[100.00%] X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, 6 Mar 2024, Andrew Stubbs wrote: > On 06/03/2024 12:09, Thomas Schwinge wrote: > > Hi! > > > > On 2024-02-21T17:32:13+0100, Richard Biener wrote: > >> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge : > >>> [...] per my work on > >>> "libgomp make check time is excessive", all execution testing in libgomp > >>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'. [...] > >>> (... with the caveat that execution tests for > >>> effective-targets are *not* governed by that, as I've found yesterday. > >>> I have a WIP hack for that, too.) > > > >>> What disturbs the testing a lot is, that the GPU may get into a bad > >>> state, upon which any use either fails with a > >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in > >>> 'libhsa-runtime64.so.1'... > >>> > >>> I've now tried to debug the latter case (hang). When the GPU gets into > >>> this bad state (whatever exactly that is), > >>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but > >>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze' > >>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right > >>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'. > >>> There it hangs until killed (for example, until DejaGnu's timeout > >>> mechanism kills the process -- just that the next GPU-using execution > >>> test then runs into the same thing again...). > >>> > >>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state), > >>> we're able to recover via: > >>> > >>> $ flock /tmp/gpu.lock sudo cat > >>> /sys/kernel/debug/dri/0/amdgpu_gpu_recover > >>> 0 > > > > At least most of the times. I've found that -- sometimes... ;-( -- if > > you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do > > 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run > > into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'. That appears to be avoidable > > by injecting some artificial "cool-down period"... (The latter I've not > > yet tested extensively.) > > > >>> This is, obviously, a hack, probably needs a serial lock to not disturb > >>> other things, has hard-coded 'dri/0', and as I said in > >>> > >>> "GCN RDNA2+ vs. GCC SLP vectorizer": > >>> > >>> | I've no idea what > >>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display. > >> > >> It ends up terminating your X session? > > > > Eh.... ;'-| > > > >> (there?s some automatic driver recovery that?s also sometimes triggered > >> which sounds like the same thing). > > > >> I need to try using the integrated graphics for X11 to see if that avoids > >> the issue. > > > > A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now > > remember correctly -- basically got it to work, via hand-editing > > '/etc/X11/xorg.conf' and all that... But: I couldn't get external HDMI > > to work in that setup, and therefore reverted to "standard". > > > >> Guess AMD needs to improve the driver/runtime (or we - it?s open source at > >> least up to the firmware). > > > >>> However, it's very useful in my testing. :-| > >>> > >>> The questions is, how to detect the "hang" state without first running > >>> into a timeout (and disambiguating such a timeout from a user code > >>> timeout)? Add a watchdog: call 'alarm([a few seconds])' before device > >>> initialization, and before the actual GPU kernel launch cancel it with > >>> 'alarm(0)'? (..., and add a handler for 'SIGALRM' to print a distinct > >>> error message that we can then react on, like for > >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.) Probably 'alarm'/'SIGALRM' is a > >>> no-go in libgomp -- instead, use a helper thread to similarly implement a > >>> watchdog? ('libgomp/plugin/plugin-gcn.c' already is using pthreads for > >>> other purposes.) Any other clever ideas? What's a suitable value for > >>> "a few seconds"? > > > > I'm attaching my current "GCN: Watchdog for device image load", covering > > both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'. > > (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. ) > > > > That, plus routing *all* potential GPU usage (in particular: including > > execution tests for effective-targets, see above) through a serial lock > > ('flock', implemented in DejaGnu board file, outside of the the > > "DejaGnu timeout domain", similar to > > 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus > > catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and > > the "fake" ones via "GCN: Watchdog for device image load") and in that > > case 'amdgpu_gpu_recover' and re-execution of the respective executable, > > does greatly stabilize flaky GCN target/offloading testing. > > > > Do we have consensus to move forward with this approach, generally? > > I've also observed a number of random hangs in host-side code outside our > control, but after the kernel has exited. In general this watchdog approach > might help with these. I do feel like it's "papering over the cracks", but if > we can't fix it.... at the end of the day it's just a little extra code. I wonder if you maybe have contact to people at AMD that are willing to debug this and improve the driver side of this? I'm seeing quite a number of similar reports for the issue I hit in the github tracker, multiple years old and also current, so that doesn't seem to be a good way to get things fixed ... Richard. > My only concern is that it might actually cause failures, perhaps on heavily > loaded systems, or with network filesystems, or during debugging. > > Andrew >