Re: [PATCH 2/2] gdb/amdgpu: Fix debugging multiple inferiors using the ROCm runtime

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

From: Pedro Alves <pedro@palves.net>
To: Lancelot Six <lancelot.six@amd.com>, gdb-patches@sourceware.org
Cc: lsix@lancelotsix.com
Subject: Re: [PATCH 2/2] gdb/amdgpu: Fix debugging multiple inferiors using the ROCm runtime
Date: Fri, 28 Jul 2023 19:01:08 +0100	[thread overview]
Message-ID: <0bffd0c1-1aeb-a401-2509-c96905ba2e28@palves.net> (raw)
In-Reply-To: <20230630145755.6500-3-lancelot.six@amd.com>

Hi Lancelot,

This LGTM with some nits below addressed.  No need for another round
of review.  Just post the updated patch, and merge it.

With that,
  Approved-By: Pedro Alves <pedro@palves.net>

On to the nits...

On 2023-06-30 15:57, Lancelot Six via Gdb-patches wrote:

> - The driver creates the runtime activation for inferior 2 and writes to
>   the associated file descriptor.
> - GDB has inferior 1 selected and calls target_wait for some reason.
> - This prompts amd_dbgapi_target::wait to be called.  The method pulls
>   all events from the driver, including the runtime activation event for
>   inferior 2, leading to the insertion failure.

insertion -> assertion.

> 
> The fix for this problem is simple.  To avoid such problem, we need to
> make sure that amd_dbgapi_target::wait only pulls events for the current
> inferior from the driver.  This is what this patch implements.
> 
> This patch also includes a testcase which could fail before this patch.
> 
> This patch has been tested on a system with multiple GPUs which had more
> chances to reproduce the original bug.  It has also been tested on top
> of the downstream ROCgdb port which has more AMDGPU related tests.  The
> testcase have been tested with `make check check-read1 check-readmore`.

"have been" -> "has been"

> ---
>  gdb/amd-dbgapi-target.c                       |   6 +-
>  gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp | 111 ++++++++++++++++++
>  gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp |  86 ++++++++++++++
>  3 files changed, 201 insertions(+), 2 deletions(-)
>  create mode 100644 gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp
>  create mode 100644 gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp
> 
> diff --git a/gdb/amd-dbgapi-target.c b/gdb/amd-dbgapi-target.c
> index 5565cf907fa..371f0683754 100644
> --- a/gdb/amd-dbgapi-target.c
> +++ b/gdb/amd-dbgapi-target.c
> @@ -1255,8 +1255,10 @@ amd_dbgapi_target::wait (ptid_t ptid, struct target_waitstatus *ws,
>    std::tie (event_ptid, gpu_waitstatus) = consume_one_event (ptid.pid ());
>    if (event_ptid == minus_one_ptid)
>      {
> -      /* Drain the events from the amd_dbgapi and preserve the ordering.  */
> -      process_event_queue ();
> +      /* Drain the events for the current inferior from the amd_dbgapi and
> +	 preserve the ordering.  */
> +      auto info = get_amd_dbgapi_inferior_info (current_inferior ());
> +      process_event_queue (info->process_id, AMD_DBGAPI_EVENT_KIND_NONE);

I think the process_event_queue's process_id parameter could stop having
a default argument.

>  
>        std::tie (event_ptid, gpu_waitstatus) = consume_one_event (ptid.pid ());
>        if (event_ptid == minus_one_ptid)
> diff --git a/gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp
> new file mode 100644
> index 00000000000..828dc0cf7d4
> --- /dev/null
> +++ b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.cpp
> @@ -0,0 +1,111 @@

...

> +      if (pid == 0)
> +	{
> +	  /* Exec to be fore the child to re-initialize the ROCm runtime.  */

I can't parse the

 "Exec to be fore the child"

comment.  I think you mean:

 "Exec the child"

?

> +	  if (execl (argv[0], argv[0], n) == -1)
> +	    {
> +	      perror ("Failed to exec");
> +	      return -1;
> +	    }
> +	}
> +    }
> +
> +  /* Wait for all children.  */
> +  int ws;
> +  pid_t ret;
> +  do
> +    ret = waitpid (-1, &ws, 0);
> +  while (!(ret == -1 && errno == ECHILD));

At <https://www.gnu.org/prep/standards/standards.html>, we have:

"Format do-while statements like this:
 
  do
    {
      a = foo (a);
    }
  while (a > 0);
"

IMO, this is more readable, and lets you keep the
variables in the scope:

  while (1)
   {
    int ws;
    pid_t ret = waitpid (-1, &ws, 0);
    if (ret == -1 && errno == ECHILD)
      break;
   }

> +
> +  /* Last break here.  */
> +  return 0;
> +}
> +
> +static int
> +child (int argc, char **argv)
> +{
> +  int dev_number;
> +  if (sscanf (argv[1], "%d", &dev_number) != 1)
> +    {
> +      fprintf (stderr, "Invalid argument \"%s\"\n", argv[1]);
> +      return -1;
> +    }
> +
> +  CHECK (hipSetDevice (dev_number));
> +  kern<<<1, 1>>> ();
> +  hipDeviceSynchronize ();
> +  return 0;
> +}
> +
> +/* When called with no argument, identify how many AMDGPU devices are
> +   available on the system and spawn one worker process per GPU.  If a
> +   command-line argument is provided, it is the index of the GPU to use.  */
> +
> +int
> +main (int argc, char **argv)
> +{
> +  if (argc <= 1)
> +    return parent (argc, argv);
> +  else
> +    return child (argc, argv);
> +}
> diff --git a/gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp
> new file mode 100644
> index 00000000000..3e8934645e6
> --- /dev/null
> +++ b/gdb/testsuite/gdb.rocm/multi-inferior-gpu.exp
> @@ -0,0 +1,86 @@
> +# Copyright 2023 Free Software Foundation, Inc.
> +
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +# This test checks that GDB can debug multiple inferior which uses all
> +# the ROCm runtime.

# This test checks that GDB can debug multiple inferiors that all use
# the ROCm runtime.

> +
> +load_lib rocm.exp
> +
> +standard_testfile .cpp
> +
> +require allow_hipcc_tests
> +require hip_devices_support_debug_multi_process
> +
> +if {[build_executable "failed to prepare" $testfile $srcfile {debug hip}]} {
> +    return
> +}
> +
> +proc do_test {} {
> +    clean_restart $::binfile
> +    gdb_test_no_output "set non-stop on"
> +    gdb_test_no_output "set detach-on-fork off"
> +    gdb_test_no_output "set follow-fork parent"
> +
> +    with_rocm_gpu_lock {
> +	gdb_breakpoint [gdb_get_line_number "Break here"]
> +	gdb_breakpoint kern allow-pending
> +	gdb_breakpoint [gdb_get_line_number "Last break here"]
> +
> +	# Run intil we reach the first breakpoint where we can figure

"Run intil" -> "Run until".


> +	# out how many children will be spawned.
> +	gdb_test "run" "hit Breakpoint.*"
> +
> +	set num_childs [get_integer_valueof "num_devices" 0]

num_childs -> num_children ?

> +	set bp_to_see $num_childs
> +	set stopped_threads [list]
> +
> +	gdb_test_multiple "continue -a" "continue to gpu breakpoints" {
> +	    -re "Thread ($::decimal\.$::decimal)\[^\r\n\]* hit Breakpoint\[^\r\n\]*, kern \(\)\[^\r\n\]*\r\n" {
> +		lappend stopped_threads $expect_out(1,string)
> +		incr bp_to_see -1
> +		if {$bp_to_see != 0} {
> +		    exp_continue
> +		} else {
> +		    pass $gdb_test_name
> +		}
> +	    }
> +	    -re "^\[^\r\n\]*\r\n" {
> +		exp_continue
> +	    }
> +	}

Since this is non-stop, this "continue -a" will cause the first
stop to print the prompt, and other stops to not print it.  The
"-re" cases above don't explicitly handle the prompt, which seems
brittle to me.

"continue -a&" instead of ""continue -a" immediately prints the
prompt.  It would be better IMO to explicitly consume the prompt
with that.  Like (untested):

    gdb_test_multiple "continue -a &" "continue to gpu breakpoints" {
        -re "Continuing\.\r\n$gdb_prompt " {
            pass $gdb_test_name
        }
    }
    gdb_test_multiple "" "wait for gpu stops {
        -re "Thread ($::decimal\.$::decimal)\[^\r\n\]* hit Breakpoint\[^\r\n\]*, kern \(\)\[^\r\n\]*\r\n" {
            lappend stopped_threads $expect_out(1,string)
            incr bp_to_see -1
            if {$bp_to_see != 0} {
               exp_continue
            } else {
               pass $gdb_test_name
            }
        }
    }

> +
> +	# Continue all the children processes until they exit.

Maybe say:

	# Continue all the GPU kernels until all the children processes exit.

If I am not mistaken, the children processes on the CPU side are already
running at this point, only the GPU kernels were stopped.

> +	foreach thread $stopped_threads {

I would rename "stopped_threads" -> stopped_gdb_threads.

That's it.  Thanks for the fix!

Pedro Alves

> +	    set infnumber [lindex [split $thread .] 0]
> +	    gdb_test "thread $thread" "Switching to thread.*"
> +	    gdb_test_multiple "continue $thread" "" {
> +		-re "\\\[Inferior $infnumber \[^\n\r\]* exited normally\\]\r\n$::gdb_prompt " {
> +		    pass $gdb_test_name
> +		}
> +	    }
> +	}
> +
> +	gdb_test_multiple "" "reach breakpoint in main" {
> +	    -re "hit Breakpoint.*parent" {
> +		pass $gdb_test_name
> +	    }
> +	}
> +	# Select main inferior
> +	gdb_test "inferior 1" "Switching to inferior 1.*"
> +	gdb_continue_to_end "" "continue -a" 1
> +    }
> +}
> +
> +do_test

next prev parent reply	other threads:[~2023-07-28 18:01 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-30 14:57 [PATCH 0/2] Fix debugging multi " Lancelot Six
2023-06-30 14:57 ` [PATCH 1/2] gdb/testsuite/rocm: Add the hip_devices_support_debug_multi_process proc Lancelot Six
2023-07-28 18:00   ` Pedro Alves
2023-06-30 14:57 ` [PATCH 2/2] gdb/amdgpu: Fix debugging multiple inferiors using the ROCm runtime Lancelot Six
2023-07-28 18:01   ` Pedro Alves [this message]
2023-07-28 18:04     ` Pedro Alves
2023-07-28 16:47 ` [PING] [PATCH 0/2] Fix debugging multi " Lancelot SIX

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0bffd0c1-1aeb-a401-2509-c96905ba2e28@palves.net \
    --to=pedro@palves.net \
    --cc=gdb-patches@sourceware.org \
    --cc=lancelot.six@amd.com \
    --cc=lsix@lancelotsix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).