From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by sourceware.org (Postfix) with ESMTPS id 68BBE3858CDA for ; Wed, 21 Feb 2024 16:32:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 68BBE3858CDA Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 68BBE3858CDA Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708533163; cv=none; b=gGwOCedafYAxDz48h/d+98HsmUBXsLvlYmXQvCdABb1ytGrAYrSK8h30uEqCVfxN/d0aOoiTSAiYXnOV6icPXIte0QXTF75nFwIt3ifICfksGr4BwU5V9b6to1wr0GuZab475zgXvOHtSipdf9vSVFo8K4twZUGS1M+CAglEHmo= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708533163; c=relaxed/simple; bh=vtsIF/F+jMvquQf3pMgvdaJr9Q8MClIIj8mdyZ46V50=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:From: Mime-Version:Subject:Date:Message-Id:To; b=PLOcPdRo186RSI9U3Ls2V7UsLQFDT+nvUMwJCga1VOmd4WOZ+GL88I2DM4PEVseqIlNyRf80eDaGH7H8bJ2t6Q3VqqDqHRn5GhanpzvGQ7dh/c9CB+3z4bCElqxZG8fq7ZDVOKOTOY2Kjoo8Xb3A6kt3S55OSuwyfJc8DBHZok0= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from imap2.dmz-prg2.suse.org (imap2.dmz-prg2.suse.org [10.150.64.98]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 447E721FEE; Wed, 21 Feb 2024 16:32:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1708533152; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=te8hTPJVtrZDvXrykfziWOlzVfIdDeNBFWd83fma2QI=; b=YjiSHe7IwFKJP10+sbR7MNpDiaLhbMfpqgC2ju5QEzczo/szoGCf3Wy31M7ikbDIqWHDvd DeheQorBl2NKu0dTixUQMi/PlGf2+BLxQ9SOrpxZny2fK3saxV5vLXohXo2BQR+0hQcDm8 rDev+qptUl/f70sK9jwqtV3iiZIuwKI= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1708533152; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=te8hTPJVtrZDvXrykfziWOlzVfIdDeNBFWd83fma2QI=; b=Hxmt4+yCeYoSkPkr6T7YW1/GXwXEfRtfmiTlalKAl1i+n9WqsiyvsFehCg05RNOf1o6wsJ cTp8rGEnEq5ydjAA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1708533152; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=te8hTPJVtrZDvXrykfziWOlzVfIdDeNBFWd83fma2QI=; b=YjiSHe7IwFKJP10+sbR7MNpDiaLhbMfpqgC2ju5QEzczo/szoGCf3Wy31M7ikbDIqWHDvd DeheQorBl2NKu0dTixUQMi/PlGf2+BLxQ9SOrpxZny2fK3saxV5vLXohXo2BQR+0hQcDm8 rDev+qptUl/f70sK9jwqtV3iiZIuwKI= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1708533152; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=te8hTPJVtrZDvXrykfziWOlzVfIdDeNBFWd83fma2QI=; b=Hxmt4+yCeYoSkPkr6T7YW1/GXwXEfRtfmiTlalKAl1i+n9WqsiyvsFehCg05RNOf1o6wsJ cTp8rGEnEq5ydjAA== Received: from imap2.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap2.dmz-prg2.suse.org (Postfix) with ESMTPS id 34A46139D1; Wed, 21 Feb 2024 16:32:32 +0000 (UTC) Received: from dovecot-director2.suse.de ([10.150.64.162]) by imap2.dmz-prg2.suse.org with ESMTPSA id NMnTDKAl1mVcZwAAn2gu4w (envelope-from ); Wed, 21 Feb 2024 16:32:32 +0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Biener Mime-Version: 1.0 (1.0) Subject: Re: Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status) Date: Wed, 21 Feb 2024 17:32:13 +0100 Message-Id: References: <87il2ij8sm.fsf@euler.schwinge.ddns.net> Cc: Andrew Stubbs , Tobias Burnus , gcc-patches@gcc.gnu.org, Jakub Jelinek In-Reply-To: <87il2ij8sm.fsf@euler.schwinge.ddns.net> To: Thomas Schwinge X-Mailer: iPhone Mail (21D61) Authentication-Results: smtp-out1.suse.de; none X-Spamd-Result: default: False [-2.60 / 50.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; BAYES_HAM(-3.00)[100.00%]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; MV_CASE(0.50)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_FIVE(0.00)[5]; RCVD_COUNT_THREE(0.00)[3]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:email,sourceware.org:url]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[] X-Spam-Level: X-Spam-Score: -2.60 X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > Am 21.02.2024 um 13:34 schrieb Thomas Schwinge : >=20 > =EF=BB=BFHi! >=20 >> On 2024-02-01T15:49:02+0100, Richard Biener wrote: >>> On Thu, 1 Feb 2024, Thomas Schwinge wrote: >>> On 2024-01-26T10:45:10+0100, Richard Biener wrote: >>>> On Fri, 26 Jan 2024, Richard Biener wrote: >>>>> On Wed, 24 Jan 2024, Andrew Stubbs wrote: >>>>>> [...] is enough to get gfx1100 working for most purposes, on top of t= he >>>>>> patch that Tobias committed a week or so ago; there are still some te= st >>>>>> failures to investigate, and probably some tuning to do. >>>>>>=20 >>>>>> It might also get gfx1030 working too. @Richi, could you test it, >>>>>> please? >>>>>=20 >>>>> I can report partial success here. [...] >>>=20 >>>>> I'll followup with a test summary once the (serial) run of libgomp >>>>> testing finished. >>>=20 >>> (Why serial, by the way?) >>=20 >> Just out of caution ... (I'm using the GPU for the desktop at the >> same time and dmesg gets spammed with some not-so reassuring >> "errors" during the offloading) >=20 > Yeah, indeed 'dmesg' is full of "notes"... >=20 > However, note that per my work on > "libgomp make check time is excessive", all execution testing in libgomp > is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'. So, > no problem/difference in that regard, to run parallel > 'check-target-libgomp'. (... with the caveat that execution tests for > effective-targets are *not* governed by that, as I've found yesterday. > I have a WIP hack for that, too.) >=20 >=20 >>> [...] what I >>> got with '-march=3Dgfx1100' for AMD Radeon RX 7900 XTX. [...] >=20 >>> [...] execution test FAILs. Not all FAILs appear all the time [...] >=20 > What disturbs the testing a lot is, that the GPU may get into a bad > state, upon which any use either fails with a > 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in > 'libhsa-runtime64.so.1'... >=20 > I've now tried to debug the latter case (hang). When the GPU gets into > this bad state (whatever exactly that is), > 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but > then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze' > vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right > before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'. > There it hangs until killed (for example, until DejaGnu's timeout > mechanism kills the process -- just that the next GPU-using execution > test then runs into the same thing again...). >=20 > In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state), > we're able to recover via: >=20 > $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recov= er > 0 >=20 > This is, obviously, a hack, probably needs a serial lock to not disturb > other things, has hard-coded 'dri/0', and as I said in > > "GCN RDNA2+ vs. GCC SLP vectorizer": >=20 > | I've no idea what > | 'amdgpu_gpu_recover' would do if the GPU is also used for display. It ends up terminating your X session=E2=80=A6 (there=E2=80=99s some automat= ic driver recovery that=E2=80=99s also sometimes triggered which sounds like= the same thing). I need to try using the integrated graphics for X11 to se= e if that avoids the issue. Guess AMD needs to improve the driver/runtime (or we - it=E2=80=99s open sou= rce at least up to the firmware). Richard=20 > However, it's very useful in my testing. :-| >=20 > The questions is, how to detect the "hang" state without first running > into a timeout (and disambiguating such a timeout from a user code > timeout)? Add a watchdog: call 'alarm([a few seconds])' before device > initialization, and before the actual GPU kernel launch cancel it with > 'alarm(0)'? (..., and add a handler for 'SIGALRM' to print a distinct > error message that we can then react on, like for > 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.) Probably 'alarm'/'SIGALRM' is a > no-go in libgomp -- instead, use a helper thread to similarly implement a > watchdog? ('libgomp/plugin/plugin-gcn.c' already is using pthreads for > other purposes.) Any other clever ideas? What's a suitable value for > "a few seconds"? >=20 >=20 > Gr=C3=BC=C3=9Fe > Thomas