From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) by sourceware.org (Postfix) with ESMTPS id 33E2D3858CDA for ; Fri, 28 Jul 2023 18:01:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 33E2D3858CDA Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=palves.net Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-wm1-f43.google.com with SMTP id 5b1f17b1804b1-3fb4146e8ceso24316615e9.0 for ; Fri, 28 Jul 2023 11:01:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690567271; x=1691172071; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=psLZmW4jW1AAE37ovLS9TCN1hIRBrOhxcsjZ5SkLTsE=; b=ICn3xqz7MjrVNJSMlmxA9ObH52CrYwCl3kgC8Qtxckv3UGwoTUjCy/y2AP+hJkaO2v d4uIIZXK6WL8DY1WDO/2kQqjwpqaBp4keIpIAlRKsVscQm9KbRhT8+9z47q248lFBXTC vA6NjtzQv/DaNoApIazYzUsYHnOB7XgkYFW/ge8XhfInevRAJdBup9ROCc0JVGXyTVRd jvsi9rqeQCNS8aTRVEUmaQQvOYVoXGsR0c/NWy7LbSsAF7MA9fIIiDr6qRli2D6umiIG 9CIMTFKMLGrRxJDL7JdxMfIPeT6VwD4eYWu+cUhndZdMpghUjFhUPoF2LFE5ulQlKuYg LqCw== X-Gm-Message-State: ABy/qLaFeN13hHCl4CpxbwDiyVkeR6PaNZULUHD2s3qezcneiEbo7km8 9dIlbXAn7LM613IJX3sgzdqDT1e6yJw= X-Google-Smtp-Source: APBJJlGIjTGVjO7HGfmSj9Z7i9To236ICdrSRESXKCsQKhpGJLpw6S1GJR/LNNkkfLdMEB7sSHuJmg== X-Received: by 2002:a1c:4c11:0:b0:3fe:90e:59d8 with SMTP id z17-20020a1c4c11000000b003fe090e59d8mr2166321wmf.38.1690567270767; Fri, 28 Jul 2023 11:01:10 -0700 (PDT) Received: from ?IPV6:2001:8a0:f922:de00:5b94:75a9:c970:55df? ([2001:8a0:f922:de00:5b94:75a9:c970:55df]) by smtp.gmail.com with ESMTPSA id n7-20020a1c7207000000b003fbb8c7c799sm7560728wmc.30.2023.07.28.11.01.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 28 Jul 2023 11:01:10 -0700 (PDT) Message-ID: <0bffd0c1-1aeb-a401-2509-c96905ba2e28@palves.net> Date: Fri, 28 Jul 2023 19:01:08 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [PATCH 2/2] gdb/amdgpu: Fix debugging multiple inferiors using the ROCm runtime Content-Language: en-US To: Lancelot Six , gdb-patches@sourceware.org Cc: lsix@lancelotsix.com References: <20230630145755.6500-1-lancelot.six@amd.com> <20230630145755.6500-3-lancelot.six@amd.com> From: Pedro Alves In-Reply-To: <20230630145755.6500-3-lancelot.six@amd.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.3 required=5.0 tests=BAYES_00,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,GIT_PATCH_0,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Lancelot, This LGTM with some nits below addressed. No need for another round of review. Just post the updated patch, and merge it. With that, Approved-By: Pedro Alves On to the nits... On 2023-06-30 15:57, Lancelot Six via Gdb-patches wrote: > - The driver creates the runtime activation for inferior 2 and writes to > the associated file descriptor. > - GDB has inferior 1 selected and calls target_wait for some reason. > - This prompts amd_dbgapi_target::wait to be called. The method pulls > all events from the driver, including the runtime activation event for > inferior 2, leading to the insertion failure. insertion -> assertion. > > The fix for this problem is simple. To avoid such problem, we need to > make sure that amd_dbgapi_target::wait only pulls events for the current > inferior from the driver. This is what this patch implements. > > This patch also includes a testcase which could fail before this patch. > > This patch has been tested on a system with multiple GPUs which had more > chances to reproduce the original bug. It has also been tested on top > of the downstream ROCgdb port which has more AMDGPU related tests. The > testcase have been tested with `make check check-read1 check-readmore`. "have been" -> "has been" > --- > gdb/amd-dbgapi-target.c | 6 +- > gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp | 111 ++++++++++++++++++ > gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp | 86 ++++++++++++++ > 3 files changed, 201 insertions(+), 2 deletions(-) > create mode 100644 gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp > create mode 100644 gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp > > diff --git a/gdb/amd-dbgapi-target.c b/gdb/amd-dbgapi-target.c > index 5565cf907fa..371f0683754 100644 > --- a/gdb/amd-dbgapi-target.c > +++ b/gdb/amd-dbgapi-target.c > @@ -1255,8 +1255,10 @@ amd_dbgapi_target::wait (ptid_t ptid, struct target_waitstatus *ws, > std::tie (event_ptid, gpu_waitstatus) = consume_one_event (ptid.pid ()); > if (event_ptid == minus_one_ptid) > { > - /* Drain the events from the amd_dbgapi and preserve the ordering. */ > - process_event_queue (); > + /* Drain the events for the current inferior from the amd_dbgapi and > + preserve the ordering. */ > + auto info = get_amd_dbgapi_inferior_info (current_inferior ()); > + process_event_queue (info->process_id, AMD_DBGAPI_EVENT_KIND_NONE); I think the process_event_queue's process_id parameter could stop having a default argument. > > std::tie (event_ptid, gpu_waitstatus) = consume_one_event (ptid.pid ()); > if (event_ptid == minus_one_ptid) > diff --git a/gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp > new file mode 100644 > index 00000000000..828dc0cf7d4 > --- /dev/null > +++ b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp > @@ -0,0 +1,111 @@ ... > + if (pid == 0) > + { > + /* Exec to be fore the child to re-initialize the ROCm runtime. */ I can't parse the "Exec to be fore the child" comment. I think you mean: "Exec the child" ? > + if (execl (argv[0], argv[0], n) == -1) > + { > + perror ("Failed to exec"); > + return -1; > + } > + } > + } > + > + /* Wait for all children. */ > + int ws; > + pid_t ret; > + do > + ret = waitpid (-1, &ws, 0); > + while (!(ret == -1 && errno == ECHILD)); At , we have: "Format do-while statements like this: do { a = foo (a); } while (a > 0); " IMO, this is more readable, and lets you keep the variables in the scope: while (1) { int ws; pid_t ret = waitpid (-1, &ws, 0); if (ret == -1 && errno == ECHILD) break; } > + > + /* Last break here. */ > + return 0; > +} > + > +static int > +child (int argc, char **argv) > +{ > + int dev_number; > + if (sscanf (argv[1], "%d", &dev_number) != 1) > + { > + fprintf (stderr, "Invalid argument \"%s\"\n", argv[1]); > + return -1; > + } > + > + CHECK (hipSetDevice (dev_number)); > + kern<<<1, 1>>> (); > + hipDeviceSynchronize (); > + return 0; > +} > + > +/* When called with no argument, identify how many AMDGPU devices are > + available on the system and spawn one worker process per GPU. If a > + command-line argument is provided, it is the index of the GPU to use. */ > + > +int > +main (int argc, char **argv) > +{ > + if (argc <= 1) > + return parent (argc, argv); > + else > + return child (argc, argv); > +} > diff --git a/gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp > new file mode 100644 > index 00000000000..3e8934645e6 > --- /dev/null > +++ b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp > @@ -0,0 +1,86 @@ > +# Copyright 2023 Free Software Foundation, Inc. > + > +# This program is free software; you can redistribute it and/or modify > +# it under the terms of the GNU General Public License as published by > +# the Free Software Foundation; either version 3 of the License, or > +# (at your option) any later version. > +# > +# This program is distributed in the hope that it will be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program. If not, see . > + > +# This test checks that GDB can debug multiple inferior which uses all > +# the ROCm runtime. # This test checks that GDB can debug multiple inferiors that all use # the ROCm runtime. > + > +load_lib rocm.exp > + > +standard_testfile .cpp > + > +require allow_hipcc_tests > +require hip_devices_support_debug_multi_process > + > +if {[build_executable "failed to prepare" $testfile $srcfile {debug hip}]} { > + return > +} > + > +proc do_test {} { > + clean_restart $::binfile > + gdb_test_no_output "set non-stop on" > + gdb_test_no_output "set detach-on-fork off" > + gdb_test_no_output "set follow-fork parent" > + > + with_rocm_gpu_lock { > + gdb_breakpoint [gdb_get_line_number "Break here"] > + gdb_breakpoint kern allow-pending > + gdb_breakpoint [gdb_get_line_number "Last break here"] > + > + # Run intil we reach the first breakpoint where we can figure "Run intil" -> "Run until". > + # out how many children will be spawned. > + gdb_test "run" "hit Breakpoint.*" > + > + set num_childs [get_integer_valueof "num_devices" 0] num_childs -> num_children ? > + set bp_to_see $num_childs > + set stopped_threads [list] > + > + gdb_test_multiple "continue -a" "continue to gpu breakpoints" { > + -re "Thread ($::decimal\.$::decimal)\[^\r\n\]* hit Breakpoint\[^\r\n\]*, kern \(\)\[^\r\n\]*\r\n" { > + lappend stopped_threads $expect_out(1,string) > + incr bp_to_see -1 > + if {$bp_to_see != 0} { > + exp_continue > + } else { > + pass $gdb_test_name > + } > + } > + -re "^\[^\r\n\]*\r\n" { > + exp_continue > + } > + } Since this is non-stop, this "continue -a" will cause the first stop to print the prompt, and other stops to not print it. The "-re" cases above don't explicitly handle the prompt, which seems brittle to me. "continue -a&" instead of ""continue -a" immediately prints the prompt. It would be better IMO to explicitly consume the prompt with that. Like (untested): gdb_test_multiple "continue -a &" "continue to gpu breakpoints" { -re "Continuing\.\r\n$gdb_prompt " { pass $gdb_test_name } } gdb_test_multiple "" "wait for gpu stops { -re "Thread ($::decimal\.$::decimal)\[^\r\n\]* hit Breakpoint\[^\r\n\]*, kern \(\)\[^\r\n\]*\r\n" { lappend stopped_threads $expect_out(1,string) incr bp_to_see -1 if {$bp_to_see != 0} { exp_continue } else { pass $gdb_test_name } } } > + > + # Continue all the children processes until they exit. Maybe say: # Continue all the GPU kernels until all the children processes exit. If I am not mistaken, the children processes on the CPU side are already running at this point, only the GPU kernels were stopped. > + foreach thread $stopped_threads { I would rename "stopped_threads" -> stopped_gdb_threads. That's it. Thanks for the fix! Pedro Alves > + set infnumber [lindex [split $thread .] 0] > + gdb_test "thread $thread" "Switching to thread.*" > + gdb_test_multiple "continue $thread" "" { > + -re "\\\[Inferior $infnumber \[^\n\r\]* exited normally\\]\r\n$::gdb_prompt " { > + pass $gdb_test_name > + } > + } > + } > + > + gdb_test_multiple "" "reach breakpoint in main" { > + -re "hit Breakpoint.*parent" { > + pass $gdb_test_name > + } > + } > + # Select main inferior > + gdb_test "inferior 1" "Switching to inferior 1.*" > + gdb_continue_to_end "" "continue -a" 1 > + } > +} > + > +do_test