From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <simon.marchi@polymtl.ca>
Received: from smtp.polymtl.ca (smtp.polymtl.ca [132.207.4.11])
 by sourceware.org (Postfix) with ESMTPS id 002AC3890435
 for <gdb-patches@sourceware.org>; Mon, 10 Jan 2022 03:27:17 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 002AC3890435
Received: from simark.ca (simark.ca [158.69.221.121]) (authenticated bits=0)
 by smtp.polymtl.ca (8.14.7/8.14.7) with ESMTP id 20A3RBGT012354
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); 
 Sun, 9 Jan 2022 22:27:15 -0500
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp.polymtl.ca 20A3RBGT012354
Received: from [10.0.0.11] (192-222-157-6.qc.cable.ebox.net [192.222.157.6])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by simark.ca (Postfix) with ESMTPSA id 026BC1E940;
 Sun,  9 Jan 2022 22:27:10 -0500 (EST)
Message-ID: <7448ab15-1186-3105-5629-353f0d5ae356@polymtl.ca>
Date: Sun, 9 Jan 2022 22:27:10 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.4.0
Subject: Re: [PATCH 4/4] gdb/python: handle non utf-8 characters when source
 highlighting
Content-Language: en-US
To: Andrew Burgess <aburgess@redhat.com>, gdb-patches@sourceware.org
References: <cover.1641565040.git.aburgess@redhat.com>
 <825abc2257c992be90af28973c54f98e7cf4371f.1641565040.git.aburgess@redhat.com>
From: Simon Marchi <simon.marchi@polymtl.ca>
In-Reply-To: <825abc2257c992be90af28973c54f98e7cf4371f.1641565040.git.aburgess@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Poly-FromMTA: (simark.ca [158.69.221.121]) at Mon, 10 Jan 2022 03:27:11 +0000
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H3,
 RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Jan 2022 03:27:20 -0000

You haven't linked it, but I found this page that you might have read:

  https://pygments.org/docs/unicode/

On 2022-01-07 09:23, Andrew Burgess via Gdb-patches wrote:
> This commit adds support for source files that contain non utf-8
> characters when performing source styling using the Python pygments
> package.  This does not change the behaviour of GDB when the GNU
> Source Highlight library is used.
> 
> For the following problem description, assume that either GDB is built
> without GNU Source Highlight support, of that this has been disabled
> using 'maintenance set gnu-source-highlight enabled off'.
> 
> The initial problem reported was that a source file containing non
> utf-8 characters would cause GDB to print a Python exception, and then
> display the source without styling, e.g.:
> 
>   Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
>   /* Source code here, without styling...  */
> 
> Further, as the user steps through different source files, each time
> the problematic source file was evicted from the source cache, and
> then later reloaded, the exception would be printed again.
> 
> Finally, this problem is only present when using Python 3, this issue
> is not present for Python 2.
> 
> What makes this especially frustrating is that GDB can clearly print
> the source file contents, they're right there...  If we disable
> styling completely, or make use of the GNU Source Highlight library,
> then everything is fine.  So why is there an error when we try to
> apply styling using Python?
> 
> The problem is the use of PyString_FromString (which is an alias for
> PyUnicode_FromString in Python 3), this function converts a C string
> into a either a Unicode object (Py3) or a str object (Py2).  For
> Python 2 there is no unicode encoding performed during this function

"there is not unicode encoding", I think it would make more sense to say
"there is not utf-8 decoding".

> call, but for Python 3 the input is assumed to be a uft-8 encoding

uft -> utf.

> string for the purpose of the conversion.  And here of course, is the
> problem, if the source file contains non utf-8 characters, then it
> should not be treated as utf-8, but that's what we do, and that's why
> we get an error.
> 
> My first thought when looking at this was to spot when the
> PyString_FromString call failed with a UnicodeDecodeError and silently
> ignore the error.  This would mean that GDB would print the source
> without styling, but would also avoid the annoying exception message.
> 
> However, I also make use of `pygmentize`, a command line wrapper
> around the Python pygments module, which I use to apply syntax
> highlighting in the output of `less`.  And this command line wrapper
> is quite happy to syntax highlight my source file that contains non
> utf-8 characters, so it feels like the problem should be solvable.
> 
> It turns out that inside the pygments module there is already support
> for guessing the encoding of the incoming file content, if the
> incoming content is not already a Unicode string.  This is what
> happens for Python 2 where the incoming content is of `str` type.
> 
> We could try and make GDB smarter when it comes to converting C
> strings into Python Unicode objects; this would probably require us to
> just try a couple of different encoding schemes rather than just
> giving up after utf-8.
> 
> However, I figure, why bother?  The pygments module already does this
> for us, and the colorize API is not part of the documented external
> API of GDB.  So, why not just change the colorize API, instead of the
> content being a Unicode string (for Python 3), lets just make the
> content be a bytes object.  The pygments module can then take
> responsibility for guessing the encoding.
> 
> So, currently, the colorize API receives a utf-8 unicode object, and
> returns a host charset unicode object.  I propose that the colorize
> API receive a bytes object, and return a bytes object.
> 
> This works perfectly, and I now see styled output for my source file
> containing non utf-8 characters.
> 
> Well, the above is a little bit of a lie.  The actual code I've
> written allows the colorize API to return either a bytes object, or a
> host charset unicode object.  This just makes things a little more
> flexible in the Python code.
> 
> Then I added a DejaGNU test, and .... my test failed.
> 
> I wasn't seeing styled source lines when the test was run through
> DejaGNU, but I could when I ran the same steps at the console.
> 
> The test I was using is the exact one that is included with this
> commit, you'll notice I did remember to set the TERM environment
> variable to 'ansi' as is done in gdb.base/style.exp, which should be
> all that's needed to see the styled output (I thought).  The problem
> then, must be something in GDB.
> 
> It turns out that its the exact same problem that I solved above, but
> this time, on the way out of the colorize handler, rather than on the
> way in.  Only, interestingly, the problem now effects both Python 2
> and Python 3.
> 
> The output from pygments is always a Unicode object for both Python 2
> and 3.  Once we return from Python code back to gdbpy_colorize in
> python.c, we do the following steps:
> 
>   if (!gdbpy_is_string (result.get ()))
>     return {};
> 
>   gdbpy_ref<> unic = python_string_to_unicode (result.get ());
>   if (unic == nullpt) return {};
> 
>   gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
> 						   host_charset (),
> 						   nullptr));
>   if (host_str == nullpt) return {};
> 
> For both Python 2 and 3 the gdbpy_is_string check passes.
> 
> For both Python 2 and 3 the python_string_to_unicode call just returns
> its input, as the string will already by a Unicode object.
> 
> Now we know we have a Unicode object we convert the Unicode object
> into a bytes object using the host_charset.
> 
> After that (not shown above) we convert the bytes object into a C++
> std::string, and return this as the result.
> 
> The problem is that the host_charset is based on the capabilities of
> the output terminal, and, when the tests are run through DejaGNU, with
> the current environment, the host_charset becomes 'ANSI_X3.4-1968',
> which ends up trying to convert the string using the 'ascii' codec.
> This codec only supports 7-bit characters, and so we once again get a
> Python exception.

Huh, that's not ideal.  It would seem better to me if the testsuite ran
with host-charset == utf-8.  That would be more representative of the
reality than testing with a random charset.

> Now I can force the host_charset value to change, if I set LC_ALL to
> 'en_US.utf-8' (like we currently set TERM) then now the host_charset
> will be utf-8, and the test passes.  Maybe that's OK, but I notice
> that the gdb.base/style.exp test doesn't need to mess with LC_ALL, and
> it manages to style other parts of GDB's output just fine.  So, I want
> that please.
> 
> The problem then is how GDB tries to convert the Unicode object into a
> bytes object.  When I run interactively in a terminal the host_charset
> is utf-8, and that's great, we convert the output just right for
> display in my terminal.  But in the test situation we need to do
> better.
> 
> So, I wonder if we should consider using something other than nullptr
> for the last argument to PyUnicode_AsEncodedString.  I propose that we
> make use of 'backslashreplace'.  This will replace any character that
> is not encodable in host_charset() with an \xNN escape sequence.
> 
> This seems fine for the test case I have - the non utf-8 character is
> within a C string in the source file, and replacing it with an escape
> sequence seems OK.  This might be more questionable if the non
> encodable character were in say, an identifier within the source code,
> but at this point I think (a) we'll probably be hitting other problems
> with the Python API and its bytes/unicode conversion, and (b) the user
> really needs to get their host_charset() set correctly anyway.

I must admit I did not follow everything (it's getting late here).  But
passing a "bytes" to pygments and receiving a coloured "bytes"
(hopefully encoded the same way as the original "bytes", only with ANSI
sequences added) sounds good.  Then the coloured string goes through the
GDB machine just as the uncoloured string would have gone, had Pygments
not been available.  It seems like GDB never really needs to decode a
source string, it just sends it to the terminal and the terminal deals
with it?  Eexcept maybe in the TUI?

I don't understand when the Python colorize function returns a unicode
string now, though.  If you always send a bytes, don't you always get
back a bytes?

Simon