From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=LbPP=4O=simark.ca=simark@sourceware.org>
Received: from simark.ca (simark.ca [158.69.221.121])
	by sourceware.org (Postfix) with ESMTPS id CEE64383679A
	for <gdb-patches@sourceware.org>; Fri, 16 Dec 2022 17:37:45 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CEE64383679A
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=simark.ca
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=simark.ca
Received: from [10.0.0.11] (unknown [217.28.27.60])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPSA id E89E11E0D3;
	Fri, 16 Dec 2022 12:37:44 -0500 (EST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=simark.ca; s=mail;
	t=1671212265; bh=jUf0SbTNRbH/USt8poTUQNQkjRT+zrjFbzTxeUJRwHk=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=rF+TfPNUY5+onF1Nci4VDGHGz10glg1XcF0nLlrHhNMHrbOFwobao398I6NoOf+Xx
	 ZQFu/LPvY7Z37R8IQ0Fjas1oWKLjGejAEN51atbt/pSmDyitFy9S5qIQZPxlEE23R1
	 sFE98AGZ6r6sLJwmOJAPsbh0S63albRglhE00gh8=
Message-ID: <b5febd32-b50c-2da2-c1f2-4210994fddea@simark.ca>
Date: Fri, 16 Dec 2022 12:37:44 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.5.1
Subject: Re: [PATCH 12/12] gdb: initial support for ROCm platform (AMDGPU)
 debugging
Content-Language: en-US
To: Eli Zaretskii <eliz@gnu.org>
Cc: simon.marchi@efficios.com, gdb-patches@sourceware.org,
 zoran.zaric@amd.com, laurent.morichetti@amd.com, Tony.Tye@amd.com,
 lancelot.six@amd.com, pedro@palves.net
References: <20221206135729.3937767-1-simon.marchi@efficios.com>
 <20221206135729.3937767-13-simon.marchi@efficios.com>
 <835yeo7d3i.fsf@gnu.org> <e3bcf411-ce93-c391-2a09-5eb4876f5475@simark.ca>
 <83wn734a0w.fsf@gnu.org>
From: Simon Marchi <simark@simark.ca>
In-Reply-To: <83wn734a0w.fsf@gnu.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gdb-patches.sourceware.org>


On 12/7/22 08:29, Eli Zaretskii wrote:
>> Date: Tue, 6 Dec 2022 21:17:25 -0500
>> Cc: gdb-patches@sourceware.org, zoran.zaric@amd.com,
>>  laurent.morichetti@amd.com, Tony.Tye@amd.com, lancelot.six@amd.com,
>>  pedro@palves.net
>> From: Simon Marchi <simark@simark.ca>
>>
>>>> +@node set scheduler-locking
>>>
>>> This @node is without any @chapter/@section, and does not appear in any
>>> @menu.  That doesn't look right; did you actually succeed in building the
>>> manual with these changes?
>> Yes, it builds.  It is used as a destination for a @pxref{set
>> scheduler-locking} later in the patch, if I remove it doesn't build:
>>
>>     gdb.texinfo:26469: @pxref reference to nonexistent node `set scheduler-locking'
>>
>> But I see the problem, in the HTML it introduces a new page starting at
>> that point, we don't want that.  Perhaps we should use @anchor instead?
> 
> You could use @anchor if all you want is a place to direct an @xref.

Ok, made that change.

> 
>>>> +The following @acronym{AMD GPU} architectures are supported:
>>>> +
>>>> +@table @emph
>>>> +
>>>> +@item @samp{gfx900}
>>>> +AMD Vega 10 devices, displayed as @samp{vega10} by @value{GDBN}.
>>>> +
>>>> +@item @samp{gfx906}
>>>> +AMD Vega 7nm devices, displayed as @samp{vega20} by @value{GDBN}.
>>>
>>> Do we really need this long list of architectures in the GDB manual?  It
>>> sounds like an ad for AMD...
>>
>> We found it useful, as people often ask which devices / models the
>> debugger supports.  AMD produces a lot of GPU models.  A subset of that
>> can run ROCm programs.  And a subset of that support debugging.  I think
>> it's useful to tell users which devices GDB is expected to work with.
>>
>> And it would not be a very good ad either, as most of these devices are
>> far from the latest and greatest :).
> 
> I understand, but this is a _GDB_ manual, not a manual for debugging
> AMD GPU programs.  We need to draw the line at some point.  Why cannot
> these details be in some README somewhere, or on the Wiki?

After discussing with the original author, we came to a conclusion that
agrees with you.  The supported devices in fact mostly depends on the
version of the amd-dbgapi library that GDB uses to debug ROCm programs.
An upgrade of that library, without changing / upgrading / rebuilding
GDB can change the list of supported devices.  Therefore, it would be
wrong to have a list of supported devices in the manual for a given
version of GDB.

If the device support was baked into GDB itself, then I would argue that
it would be useful to have a list of supported devices in the GDB
manual, but it's not the case.

So, I'm replacing the contents of that subsubsection with:

  The list of @acronym{AMD GPU} architectures supported by @value{GDBN}
  depends on the version of the AMD Debugger API library installed.  See
  its @uref{https://docs.amd.com/bundle/ROCDebugger_User_and_API,
  documentation} for more details.

>>> This and other @smallexamples in the patch have too long lines; please break
>>> them into two or more, to avoid problems with the printed format of the
>>> manual.
>>
>> What is the maximum line length for this?
> 
> I think 70 or 72.

Ack, will try to use 72 then.

> 
>> I'll try, but it's a bit difficult when quoting actual GDB output.  For
>> instance, how would you do this one?
>>
>> @smallexample
>> (@value{GDBP}) info sharedlibrary
>> >From                To                  Syms Read   Shared Object Library
>> 0x00007fd120664ac0  0x00007fd120682790  Yes (*)     /lib64/ld-linux-x86-64.so.2
> 
> Use shorter addresses and directory names: they are immaterial for
> your purposes here.

Ok.

>>>> +@subsubsection @acronym{AMD GPU} Wavefronts
>>>
>>> I think a @cindex about wavefronts would be useful here.
>>
>> I just add a `@cindex Wavefronts` under the line quoted above?
> 
> Yes, but "wavefronts", lower-case.  In general, index entries should
> not use upper-case unless really necessary, because the sorting order
> of mixed-case text depends on the locale and the underlying C library.

Ok.

> 
>>>> +@item file_path
>>>> +The file's path specified as a URI encoded UTF-8 string.  In URI
>>>> +encoding, every character that is not:
>>>> +
>>>> +@itemize
>>>> +@item In the @samp{a-z}, @samp{A-Z}, @samp{0-9} ranges
>>>> +@item @samp{/}, @samp{_}, @samp{.}, @samp{~} or @samp{-}
>>>> +@end itemize
>>>> +
>>>> +is encoded as two uppercase hexadecimal digits proceeded by @samp{%}.
>>>
>>> You want @noindent before the last line.
>>
>> I can add it, but I don't see the difference (at least in the HTML
>> and PDF outputs).
> 
> It's unreliable to rely on this to produce un-indented lines.
> Depending on the global settings such as @paragraphindent you can get
> something you don't want.

Note: I ended up deleting this part.

>>> Finally: maybe it's just me, but isn't this documentation way too detailed?
>>> It weighs in at 800 lines, and includes many details that seem to be more
>>> related to AMD, GPU, and HIP than to GDB.  Would it be reasonable to make
>>> this section shorter by omitting too low-level and unrelated details?
>>
>> I went over the page, and while I agree it's very thorough and detailed,
>> my impression is that it's all information that is one way or another
>> related to how GDB interacts with ROCm / HIP / AMD GPUs.  So, all
>> information that could be useful to someone with good knowledge of ROCm
>> / HIP / AMD GPUs, if they wanted to use GDB to debug their program.  For
>> instance, the description of when GDB reports a SIGABRT is useful, as
>> the mapping between target debug events and Unix signals in GDB is kind
>> of arbitrary.
> 
> But you also describe in detail what each signal means, for example,
> which is either redundant or belongs to the documentation of GPU
> programming.  E.g., we don't explain in the manual what kind of signal
> is SIGBUS or SIGFPE in GP CPUs, so why should we have this spelled out
> for GPU programs?

I think it is relevant, because GDB makes some arbitrary mappings
between target events and Unix signals, in order to present stops to the
user as Unix signals.  The text says "AMD GPU wavefronts can raise the
following signals when executing instructions", but I think it's a bit
misleading.  The wavefronts don't raise signals.  They encounter some
events or conditions that GDB translates to these signals.  Perhaps
that's not the greatest user experience, for GDB to shoehorn everything
into Unix signals, but that's what it is.  We don't really need to
document what signals mean for Unix-based platforms, because there is
no translations, the signals shown in GDB mean what they mean on the
platform.  But for non-Unix platforms, GDB makes some arbitrary
decisions that I think are relevant to document.  If an AMD GPU user
sees their program stop with SIGBUS, what does it mean for them?  SIGBUS
is not a thing on a GPU.

> 
>> If you can point out specific parts that you think are not relevant, we
>> can discuss them specifically.
> 
> For example, this sounds like a description of the GPU, not of GDB
> features:
> 
>> +@acronym{AMD GPU} supports the following @var{reggroup} values for the
>> +@samp{info registers @var{reggroup} @dots{}} command:
>> +
>> +@itemize @bullet
>> +
>> +@item
>> +general
>> +
>> +@item
>> +vector
>> +
>> +@item
>> +scalar
>> +
>> +@item
>> +system
>> +
>> +@end itemize
>> +
>> +The number of scalar and vector registers is configured when a
>> +wavefront is created.  Only allocated registers are displayed.

Well, the first part describes what registers groups the user can use
when debugging an AMD GPU.  It sounds to me like it describes the
behavior of GDB, when debugging an AMD GPU (different arches support
different reggroups).

The second part explains that GDB will not necessarily show all hardware
registers.  I feel like this describes how GDB behaves when debugging
an AMD GPU.

I agree that none of this is super critical, we can always let the users
figure it out by themselves but... isn't the goal of the doc to avoid
users having to figure things out the hard way?  I don't understand
where to draw the line between what's relevant and what is not.

> 
> Or why do we need this in our manual:
> 
>> +The code object path for @acronym{AMD GPU} code objects is shown as a
>> +@acronym{URI, Universal Location Identifier} with a syntax defined by
>> +the following BNF syntax:
>> +
>> +@smallexample
>> +code_object_uri ::== file_uri | memory_uri
>> +file_uri        ::== "file://" file_path [ range_specifier ]
>> +memory_uri      ::== "memory://" process_id range_specifier
>> +range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
>> +file_path       ::== URI_ENCODED_OS_FILE_PATH
>> +process_id      ::== DECIMAL_NUMBER
>> +number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
>> +@end smallexample
> 
> (followed by a longish legend of what each atom means above).

I agree that this is a bit too much.  I think the example, plus a less
formal description, is sufficent for a reasonably informed human to
understand.  I will change that part to be less long and dry.

However, one reason I see why we would want this is for MI.  If there
are people writing frontends for this (and there are), they'll need to
know precisely the format to expect, if they want to parse it.  Of
course, they can work something out of the example or go look at the
code, but I think it's bad to have poorly / loosely documented machine
interfaces.

> 
> And this paragraph seems to describe the GPU, not what GDB does:
> 
>> +If any of these signals are delivered to the wavefront, it will cause
>> +the wavefront to enter the halt state and cause the @acronym{AMD ROCm}
>> +runtime to put the associated queue into the queue error state.  All
>> +wavefronts associated with a queue that is in the queue error state
>> +are inhibited from executing further instructions even if they are not
>> +in the halt state.  In addition, when the @acronym{AMD ROCm} runtime
>> +puts a queue into the queue error state it may invoke an application
>> +registered callback that could either abort the application or delete
>> +the queue which will delete any wavefronts associated with the queue.

Ok, I'll delete that.

> 
> There's also a lot of stuff only very remotely related to GDB, which
> basically reads like a large number of tips and tricks for someone who
> needs this mode.  For example:
> 
>> +@item
>> +By default, for some architectures, the @acronym{AMD GPU} device
>> +driver causes all @acronym{AMD GPU} wavefronts created when
>> +@value{GDBN} is not attached to be unable to report the dispatch
>> +associated with the wavefront, or the wavefront's work-group
>> +position.  The @samp{info threads} command will display this
>> +missing information with a @samp{?}.
>> +
>> +For example,
>> +
>> +@smallexample
>> +(gdb) info threads
>> +  Id   Target Id                                       Frame
>> +* 1    Thread 0x7ffff6987840 (LWP 62056) "bit_extract" 0x00007ffff6da489b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
>> +  2    Thread 0x7ffff6986700 (LWP 62064) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78
>> +  3    Thread 0x7ffff5f7f700 (LWP 62066) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78
>> +  4    Thread 0x7ffff597f700 (LWP 62067) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78
>> +  5    AMDGPU Wave 1:2:?:1 (?,?,?)/? "bit_extract"     bit_extract_kernel (C_d=<optimized out>, A_d=<optimized out>, N=<optimized out>) at bit_extract.cpp:41
>> +@end smallexample
>> +
>> +This does not affect wavefronts created while @value{GDBN} is attached
>> +which are always capable of reporting this information.
>> +
>> +If the @env{HSA_ENABLE_DEBUG} environment variable is set to @samp{1}
>> +when the @acronym{AMD ROCm} runtime is initialized, then this
>> +information will be available for all architectures even for wavefronts
>> +created when @value{GDBN} was not attached.  Setting this environment
>> +variable may very marginally increase wavefront launch latency for some
>> +architectures for very short lived wavefronts.

I can imagine a user being in this situation and asking "When I
attach to my program, why do I see question marks in the thread / wave
ids?".  It seems relevant to me to mention it here so we can refer them
to the doc.  And then we give them a workaround, if they want to have
that information.  Seems useful to me.

>> +
>> +@item
>> +If an @acronym{AMD GPU} wavefront has the @code{DX10_CLAMP} bit set in
>> +the @code{MODE} register, enabled arithmetic exceptions will not be
>> +reported as @code{SIGFPE} signals.  This happens if the
>> +@code{DX10_CLAMP} kernel descriptor field is enabled.
> 
> The last paragraph in particular reads like something from the GPU
> programming manual.

Ok, I can agree with that.  We already describe above which condition
leads to GDB reporting a SIGFPE (some arithmetic error).  The fact that
setting this bit makes the GPU not report the arithmetic error is indeed
behavior specific to the GPU.

> And even if this kind of info is useful and
> should be in the GDB manual, why does it need to be so wordy, with so
> many detailed examples?

This is the writing style of Tony Tye, the original author, he's used to
writing very formally and unambiguously.

Some info is there because users asked these questions, and then we
thought that we might as well document it, since others may have the
same question.

> 
> I understand the urge to document all the potentially useful stuff
> about this mode of GDB, but the result looks disproportionally long
> and full of low-level information only tangentially related to GDB.
> 
> That said, if you feel strongly about the need to include all this,
> and I'm the only one who raises the brow, feel free to install this, I
> won't object anymore.

I think you raise some good points.  So far we were working in the ROCm
bubble, where the line between ROCm-GDB and the rest of the ecosystem is
sometimes blurry.  So it's easy to leak in some details that are not
really relevant for upstream GDB.  We may not agree completely on what
is relevant and what isn't, there are definitely things that don't
belong here.  I already pruned a bunch of stuff.  I will send a v2 with
updated doc, but it will likely be after the new year.

Thanks,

Simon