From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x12d.google.com (mail-lf1-x12d.google.com [IPv6:2a00:1450:4864:20::12d]) by sourceware.org (Postfix) with ESMTPS id A9FB73858D20 for ; Wed, 21 Feb 2024 12:34:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A9FB73858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=baylibre.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A9FB73858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::12d ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708518853; cv=none; b=DmrMBdsWXjpapGaCA+rMQH4UbGk4uXBoTKK5xu6eFij5OGvYJhzHz8upUR8ep92RX9pIhaWzdxwVzG8F4ahr7c+lmY5HjzxCkZZjbe7djjTNmIg/c6zfTzTlSBRFD5H0xq/2Jk90CDeOCt27SI4hBzMEmKf3TEevlivIAJA87SI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708518853; c=relaxed/simple; bh=G3jRML8dMW8ByOlmCICo73B+8lVznHWh3uldC0KrnUE=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=Ayy3bPzsHYwFCnRNnJ0QbOlh9mUMFiTkiUMmNxH/UJ0g1noK7eSR8LV5R0iRs42/76vat8KaE9qSP0XCjXHfZy2rr8rpnUDBHAQ7d2mF4DCwFBn/mbDA3hezfT5j4YVOgiPKou7ww6SNPlqzkpKmHv+wnK/dibtMWXMsfK2vcRs= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-lf1-x12d.google.com with SMTP id 2adb3069b0e04-512bb2ed1f7so3975230e87.3 for ; Wed, 21 Feb 2024 04:34:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1708518848; x=1709123648; darn=gcc.gnu.org; h=content-transfer-encoding:mime-version:message-id:date:user-agent :references:in-reply-to:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=I+PGMdghBuESRB4XHb+QY8n8EhE+K5VRn/CuK13qStI=; b=vPpANpN3azxhEUlCfD5q8bZSUBBhIo0GXUNux4+1TWZgem10iTpH0TdSyMy8GlXrdg Xfo1LQXtXdW+DhYSVP/TovvKkcuLJvN/lxMoOYH8PVhexED2bz6bc+nz1GVKJDFRppll k4KeCWa7uAA6xaIQ3Z1gCB2lzyh7Mztp+zNRl5xf+uNZnWmqrAccjeepdmxYOVswyPLg Tizq5uVwBUFkiDAxttF9wvEV3CsuKwCiuyTarqWCQREmSJZ925oZuDIT1aXFzhwh5ZxM IkcrulIFh0KNgZkfmMj2ueZ70w42ORkQvwCq0GcKePbhes1QLvqG2sGgPOQN6CGyZzNC cF0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708518848; x=1709123648; h=content-transfer-encoding:mime-version:message-id:date:user-agent :references:in-reply-to:subject:cc:to:from:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=I+PGMdghBuESRB4XHb+QY8n8EhE+K5VRn/CuK13qStI=; b=LEhI2p+LHb+rytUYgjNyonoW/9O7nr2fH4+y2XURtt9/pkp5zScCvYz0FPLscuBclW YYsGZGaZhyYmhcuw5DrgFJt1T+adnF1xZmfpEb6fzaiZ4iLpuLUw+xNZsKLiTsnNcgEy pX25Xa02V1NJAgOAGVBzFPCiNXjRmsutE/OoOQ4XuG6/LyaVy7xbpFndP0IgJQQqtdcC FFH7UQhjpVxLbi3L5xigil0a1gFg1W2QKNYV/YtoXR9B3RMN3UJtTfa35bB49fcnDy83 K8Fniw0Rkakp4cObFuZ4st1PNXD0Wb9k+sVmW8GGIuUjKDyveICZdyos/zh9BgVpKd4d 8Y+g== X-Forwarded-Encrypted: i=1; AJvYcCUo6YPFbEk0x5QbQyjjdT6YG150t+yfF6gDtyKVIjRF/n6nOqbsafZWGIreE47NHpFU4Pc64qDkJeWFhhiAZe7gFKC1G4V20w== X-Gm-Message-State: AOJu0Yyt07sKs/iOrnIkiEnlwA+N/K0PxSPtEYll2IfpUue1i0HhLdrk 9DOJ9ZYI8v0pNE2oyfR4Z8PhjggPskwvjCpzCYkbIKnqoAuN3MsassFAWg8sXo4= X-Google-Smtp-Source: AGHT+IEoRkuMGuQsqaHPKaBaZHBr6312C3bG7sO8bZj3iQHAeY0rK60+x2l2PirpdXIMzw72QrHohg== X-Received: by 2002:ac2:5f71:0:b0:512:a3d7:58dd with SMTP id c17-20020ac25f71000000b00512a3d758ddmr7315812lfc.30.1708518847788; Wed, 21 Feb 2024 04:34:07 -0800 (PST) Received: from euler.schwinge.homeip.net (p200300c8b7064b007c26ccd247ed7255.dip0.t-ipconnect.de. [2003:c8:b706:4b00:7c26:ccd2:47ed:7255]) by smtp.gmail.com with ESMTPSA id fs14-20020a05600c3f8e00b00411a595d56bsm17804670wmb.14.2024.02.21.04.34.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Feb 2024 04:34:07 -0800 (PST) From: Thomas Schwinge To: Richard Biener , Andrew Stubbs Cc: Tobias Burnus , gcc-patches@gcc.gnu.org, Jakub Jelinek Subject: Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status) In-Reply-To: <7sn70594-70r4-q5pp-7q5p-qr865r9q53qn@fhfr.qr> References: <20240124124304.1780645-1-ams@baylibre.com> <78875q15-qq2n-45o2-nooo-59r0s0ss9031@fhfr.qr> <56rn2n3n-n340-n6on-6prr-soqpr9r7083q@fhfr.qr> <878r44l00i.fsf@euler.schwinge.ddns.net> <7sn70594-70r4-q5pp-7q5p-qr865r9q53qn@fhfr.qr> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/29.1 (x86_64-pc-linux-gnu) Date: Wed, 21 Feb 2024 13:34:01 +0100 Message-ID: <87il2ij8sm.fsf@euler.schwinge.ddns.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi! On 2024-02-01T15:49:02+0100, Richard Biener wrote: > On Thu, 1 Feb 2024, Thomas Schwinge wrote: >> On 2024-01-26T10:45:10+0100, Richard Biener wrote: >> > On Fri, 26 Jan 2024, Richard Biener wrote: >> >> On Wed, 24 Jan 2024, Andrew Stubbs wrote: >> >> > [...] is enough to get gfx1100 working for most purposes, on top of= the >> >> > patch that Tobias committed a week or so ago; there are still some = test >> >> > failures to investigate, and probably some tuning to do. >> >> >=20 >> >> > It might also get gfx1030 working too. @Richi, could you test it, >> >> > please? >> >>=20 >> >> I can report partial success here. [...] >>=20 >> >> I'll followup with a test summary once the (serial) run of libgomp >> >> testing finished. >>=20 >> (Why serial, by the way?) > > Just out of caution ... (I'm using the GPU for the desktop at the > same time and dmesg gets spammed with some not-so reassuring > "errors" during the offloading) Yeah, indeed 'dmesg' is full of "notes"... However, note that per my work on "libgomp make check time is excessive", all execution testing in libgomp is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'. So, no problem/difference in that regard, to run parallel 'check-target-libgomp'. (... with the caveat that execution tests for effective-targets are *not* governed by that, as I've found yesterday. I have a WIP hack for that, too.) >> [...] what I >> got with '-march=3Dgfx1100' for AMD Radeon RX 7900 XTX. [...] >> [...] execution test FAILs. Not all FAILs appear all the time [...] What disturbs the testing a lot is, that the GPU may get into a bad state, upon which any use either fails with a 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in 'libhsa-runtime64.so.1'... I've now tried to debug the latter case (hang). When the GPU gets into this bad state (whatever exactly that is), 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze' vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'. There it hangs until killed (for example, until DejaGnu's timeout mechanism kills the process -- just that the next GPU-using execution test then runs into the same thing again...). In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state), we're able to recover via: $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recov= er 0 This is, obviously, a hack, probably needs a serial lock to not disturb other things, has hard-coded 'dri/0', and as I said in "GCN RDNA2+ vs. GCC SLP vectorizer": | I've no idea what | 'amdgpu_gpu_recover' would do if the GPU is also used for display. However, it's very useful in my testing. :-| The questions is, how to detect the "hang" state without first running into a timeout (and disambiguating such a timeout from a user code timeout)? Add a watchdog: call 'alarm([a few seconds])' before device initialization, and before the actual GPU kernel launch cancel it with 'alarm(0)'? (..., and add a handler for 'SIGALRM' to print a distinct error message that we can then react on, like for 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.) Probably 'alarm'/'SIGALRM' is a no-go in libgomp -- instead, use a helper thread to similarly implement a watchdog? ('libgomp/plugin/plugin-gcn.c' already is using pthreads for other purposes.) Any other clever ideas? What's a suitable value for "a few seconds"? Gr=C3=BC=C3=9Fe Thomas