From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca (simark.ca [158.69.221.121]) by sourceware.org (Postfix) with ESMTPS id CEE64383679A for ; Fri, 16 Dec 2022 17:37:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CEE64383679A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=simark.ca Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=simark.ca Received: from [10.0.0.11] (unknown [217.28.27.60]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by simark.ca (Postfix) with ESMTPSA id E89E11E0D3; Fri, 16 Dec 2022 12:37:44 -0500 (EST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=simark.ca; s=mail; t=1671212265; bh=jUf0SbTNRbH/USt8poTUQNQkjRT+zrjFbzTxeUJRwHk=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=rF+TfPNUY5+onF1Nci4VDGHGz10glg1XcF0nLlrHhNMHrbOFwobao398I6NoOf+Xx ZQFu/LPvY7Z37R8IQ0Fjas1oWKLjGejAEN51atbt/pSmDyitFy9S5qIQZPxlEE23R1 sFE98AGZ6r6sLJwmOJAPsbh0S63albRglhE00gh8= Message-ID: Date: Fri, 16 Dec 2022 12:37:44 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 Subject: Re: [PATCH 12/12] gdb: initial support for ROCm platform (AMDGPU) debugging Content-Language: en-US To: Eli Zaretskii Cc: simon.marchi@efficios.com, gdb-patches@sourceware.org, zoran.zaric@amd.com, laurent.morichetti@amd.com, Tony.Tye@amd.com, lancelot.six@amd.com, pedro@palves.net References: <20221206135729.3937767-1-simon.marchi@efficios.com> <20221206135729.3937767-13-simon.marchi@efficios.com> <835yeo7d3i.fsf@gnu.org> <83wn734a0w.fsf@gnu.org> From: Simon Marchi In-Reply-To: <83wn734a0w.fsf@gnu.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 12/7/22 08:29, Eli Zaretskii wrote: >> Date: Tue, 6 Dec 2022 21:17:25 -0500 >> Cc: gdb-patches@sourceware.org, zoran.zaric@amd.com, >> laurent.morichetti@amd.com, Tony.Tye@amd.com, lancelot.six@amd.com, >> pedro@palves.net >> From: Simon Marchi >> >>>> +@node set scheduler-locking >>> >>> This @node is without any @chapter/@section, and does not appear in any >>> @menu. That doesn't look right; did you actually succeed in building the >>> manual with these changes? >> Yes, it builds. It is used as a destination for a @pxref{set >> scheduler-locking} later in the patch, if I remove it doesn't build: >> >> gdb.texinfo:26469: @pxref reference to nonexistent node `set scheduler-locking' >> >> But I see the problem, in the HTML it introduces a new page starting at >> that point, we don't want that. Perhaps we should use @anchor instead? > > You could use @anchor if all you want is a place to direct an @xref. Ok, made that change. > >>>> +The following @acronym{AMD GPU} architectures are supported: >>>> + >>>> +@table @emph >>>> + >>>> +@item @samp{gfx900} >>>> +AMD Vega 10 devices, displayed as @samp{vega10} by @value{GDBN}. >>>> + >>>> +@item @samp{gfx906} >>>> +AMD Vega 7nm devices, displayed as @samp{vega20} by @value{GDBN}. >>> >>> Do we really need this long list of architectures in the GDB manual? It >>> sounds like an ad for AMD... >> >> We found it useful, as people often ask which devices / models the >> debugger supports. AMD produces a lot of GPU models. A subset of that >> can run ROCm programs. And a subset of that support debugging. I think >> it's useful to tell users which devices GDB is expected to work with. >> >> And it would not be a very good ad either, as most of these devices are >> far from the latest and greatest :). > > I understand, but this is a _GDB_ manual, not a manual for debugging > AMD GPU programs. We need to draw the line at some point. Why cannot > these details be in some README somewhere, or on the Wiki? After discussing with the original author, we came to a conclusion that agrees with you. The supported devices in fact mostly depends on the version of the amd-dbgapi library that GDB uses to debug ROCm programs. An upgrade of that library, without changing / upgrading / rebuilding GDB can change the list of supported devices. Therefore, it would be wrong to have a list of supported devices in the manual for a given version of GDB. If the device support was baked into GDB itself, then I would argue that it would be useful to have a list of supported devices in the GDB manual, but it's not the case. So, I'm replacing the contents of that subsubsection with: The list of @acronym{AMD GPU} architectures supported by @value{GDBN} depends on the version of the AMD Debugger API library installed. See its @uref{https://docs.amd.com/bundle/ROCDebugger_User_and_API, documentation} for more details. >>> This and other @smallexamples in the patch have too long lines; please break >>> them into two or more, to avoid problems with the printed format of the >>> manual. >> >> What is the maximum line length for this? > > I think 70 or 72. Ack, will try to use 72 then. > >> I'll try, but it's a bit difficult when quoting actual GDB output. For >> instance, how would you do this one? >> >> @smallexample >> (@value{GDBP}) info sharedlibrary >> >From To Syms Read Shared Object Library >> 0x00007fd120664ac0 0x00007fd120682790 Yes (*) /lib64/ld-linux-x86-64.so.2 > > Use shorter addresses and directory names: they are immaterial for > your purposes here. Ok. >>>> +@subsubsection @acronym{AMD GPU} Wavefronts >>> >>> I think a @cindex about wavefronts would be useful here. >> >> I just add a `@cindex Wavefronts` under the line quoted above? > > Yes, but "wavefronts", lower-case. In general, index entries should > not use upper-case unless really necessary, because the sorting order > of mixed-case text depends on the locale and the underlying C library. Ok. > >>>> +@item file_path >>>> +The file's path specified as a URI encoded UTF-8 string. In URI >>>> +encoding, every character that is not: >>>> + >>>> +@itemize >>>> +@item In the @samp{a-z}, @samp{A-Z}, @samp{0-9} ranges >>>> +@item @samp{/}, @samp{_}, @samp{.}, @samp{~} or @samp{-} >>>> +@end itemize >>>> + >>>> +is encoded as two uppercase hexadecimal digits proceeded by @samp{%}. >>> >>> You want @noindent before the last line. >> >> I can add it, but I don't see the difference (at least in the HTML >> and PDF outputs). > > It's unreliable to rely on this to produce un-indented lines. > Depending on the global settings such as @paragraphindent you can get > something you don't want. Note: I ended up deleting this part. >>> Finally: maybe it's just me, but isn't this documentation way too detailed? >>> It weighs in at 800 lines, and includes many details that seem to be more >>> related to AMD, GPU, and HIP than to GDB. Would it be reasonable to make >>> this section shorter by omitting too low-level and unrelated details? >> >> I went over the page, and while I agree it's very thorough and detailed, >> my impression is that it's all information that is one way or another >> related to how GDB interacts with ROCm / HIP / AMD GPUs. So, all >> information that could be useful to someone with good knowledge of ROCm >> / HIP / AMD GPUs, if they wanted to use GDB to debug their program. For >> instance, the description of when GDB reports a SIGABRT is useful, as >> the mapping between target debug events and Unix signals in GDB is kind >> of arbitrary. > > But you also describe in detail what each signal means, for example, > which is either redundant or belongs to the documentation of GPU > programming. E.g., we don't explain in the manual what kind of signal > is SIGBUS or SIGFPE in GP CPUs, so why should we have this spelled out > for GPU programs? I think it is relevant, because GDB makes some arbitrary mappings between target events and Unix signals, in order to present stops to the user as Unix signals. The text says "AMD GPU wavefronts can raise the following signals when executing instructions", but I think it's a bit misleading. The wavefronts don't raise signals. They encounter some events or conditions that GDB translates to these signals. Perhaps that's not the greatest user experience, for GDB to shoehorn everything into Unix signals, but that's what it is. We don't really need to document what signals mean for Unix-based platforms, because there is no translations, the signals shown in GDB mean what they mean on the platform. But for non-Unix platforms, GDB makes some arbitrary decisions that I think are relevant to document. If an AMD GPU user sees their program stop with SIGBUS, what does it mean for them? SIGBUS is not a thing on a GPU. > >> If you can point out specific parts that you think are not relevant, we >> can discuss them specifically. > > For example, this sounds like a description of the GPU, not of GDB > features: > >> +@acronym{AMD GPU} supports the following @var{reggroup} values for the >> +@samp{info registers @var{reggroup} @dots{}} command: >> + >> +@itemize @bullet >> + >> +@item >> +general >> + >> +@item >> +vector >> + >> +@item >> +scalar >> + >> +@item >> +system >> + >> +@end itemize >> + >> +The number of scalar and vector registers is configured when a >> +wavefront is created. Only allocated registers are displayed. Well, the first part describes what registers groups the user can use when debugging an AMD GPU. It sounds to me like it describes the behavior of GDB, when debugging an AMD GPU (different arches support different reggroups). The second part explains that GDB will not necessarily show all hardware registers. I feel like this describes how GDB behaves when debugging an AMD GPU. I agree that none of this is super critical, we can always let the users figure it out by themselves but... isn't the goal of the doc to avoid users having to figure things out the hard way? I don't understand where to draw the line between what's relevant and what is not. > > Or why do we need this in our manual: > >> +The code object path for @acronym{AMD GPU} code objects is shown as a >> +@acronym{URI, Universal Location Identifier} with a syntax defined by >> +the following BNF syntax: >> + >> +@smallexample >> +code_object_uri ::== file_uri | memory_uri >> +file_uri ::== "file://" file_path [ range_specifier ] >> +memory_uri ::== "memory://" process_id range_specifier >> +range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number >> +file_path ::== URI_ENCODED_OS_FILE_PATH >> +process_id ::== DECIMAL_NUMBER >> +number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER >> +@end smallexample > > (followed by a longish legend of what each atom means above). I agree that this is a bit too much. I think the example, plus a less formal description, is sufficent for a reasonably informed human to understand. I will change that part to be less long and dry. However, one reason I see why we would want this is for MI. If there are people writing frontends for this (and there are), they'll need to know precisely the format to expect, if they want to parse it. Of course, they can work something out of the example or go look at the code, but I think it's bad to have poorly / loosely documented machine interfaces. > > And this paragraph seems to describe the GPU, not what GDB does: > >> +If any of these signals are delivered to the wavefront, it will cause >> +the wavefront to enter the halt state and cause the @acronym{AMD ROCm} >> +runtime to put the associated queue into the queue error state. All >> +wavefronts associated with a queue that is in the queue error state >> +are inhibited from executing further instructions even if they are not >> +in the halt state. In addition, when the @acronym{AMD ROCm} runtime >> +puts a queue into the queue error state it may invoke an application >> +registered callback that could either abort the application or delete >> +the queue which will delete any wavefronts associated with the queue. Ok, I'll delete that. > > There's also a lot of stuff only very remotely related to GDB, which > basically reads like a large number of tips and tricks for someone who > needs this mode. For example: > >> +@item >> +By default, for some architectures, the @acronym{AMD GPU} device >> +driver causes all @acronym{AMD GPU} wavefronts created when >> +@value{GDBN} is not attached to be unable to report the dispatch >> +associated with the wavefront, or the wavefront's work-group >> +position. The @samp{info threads} command will display this >> +missing information with a @samp{?}. >> + >> +For example, >> + >> +@smallexample >> +(gdb) info threads >> + Id Target Id Frame >> +* 1 Thread 0x7ffff6987840 (LWP 62056) "bit_extract" 0x00007ffff6da489b in sched_yield () at ../sysdeps/unix/syscall-template.S:78 >> + 2 Thread 0x7ffff6986700 (LWP 62064) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78 >> + 3 Thread 0x7ffff5f7f700 (LWP 62066) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78 >> + 4 Thread 0x7ffff597f700 (LWP 62067) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78 >> + 5 AMDGPU Wave 1:2:?:1 (?,?,?)/? "bit_extract" bit_extract_kernel (C_d=, A_d=, N=) at bit_extract.cpp:41 >> +@end smallexample >> + >> +This does not affect wavefronts created while @value{GDBN} is attached >> +which are always capable of reporting this information. >> + >> +If the @env{HSA_ENABLE_DEBUG} environment variable is set to @samp{1} >> +when the @acronym{AMD ROCm} runtime is initialized, then this >> +information will be available for all architectures even for wavefronts >> +created when @value{GDBN} was not attached. Setting this environment >> +variable may very marginally increase wavefront launch latency for some >> +architectures for very short lived wavefronts. I can imagine a user being in this situation and asking "When I attach to my program, why do I see question marks in the thread / wave ids?". It seems relevant to me to mention it here so we can refer them to the doc. And then we give them a workaround, if they want to have that information. Seems useful to me. >> + >> +@item >> +If an @acronym{AMD GPU} wavefront has the @code{DX10_CLAMP} bit set in >> +the @code{MODE} register, enabled arithmetic exceptions will not be >> +reported as @code{SIGFPE} signals. This happens if the >> +@code{DX10_CLAMP} kernel descriptor field is enabled. > > The last paragraph in particular reads like something from the GPU > programming manual. Ok, I can agree with that. We already describe above which condition leads to GDB reporting a SIGFPE (some arithmetic error). The fact that setting this bit makes the GPU not report the arithmetic error is indeed behavior specific to the GPU. > And even if this kind of info is useful and > should be in the GDB manual, why does it need to be so wordy, with so > many detailed examples? This is the writing style of Tony Tye, the original author, he's used to writing very formally and unambiguously. Some info is there because users asked these questions, and then we thought that we might as well document it, since others may have the same question. > > I understand the urge to document all the potentially useful stuff > about this mode of GDB, but the result looks disproportionally long > and full of low-level information only tangentially related to GDB. > > That said, if you feel strongly about the need to include all this, > and I'm the only one who raises the brow, feel free to install this, I > won't object anymore. I think you raise some good points. So far we were working in the ROCm bubble, where the line between ROCm-GDB and the rest of the ecosystem is sometimes blurry. So it's easy to leak in some details that are not really relevant for upstream GDB. We may not agree completely on what is relevant and what isn't, there are definitely things that don't belong here. I already pruned a bunch of stuff. I will send a v2 with updated doc, but it will likely be after the new year. Thanks, Simon