[PATCH 0/4] Source highlight non utf-8 characters using Python

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Source highlight non utf-8 characters using Python
@ 2022-01-07 14:23 Andrew Burgess
  2022-01-07 14:23 ` [PATCH 1/4] gdb: new 'maint flush source-cache' command Andrew Burgess
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-07 14:23 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

This addresses an issue reported here:

  https://sourceware.org/pipermail/gdb/2021-November/049799.html

A problem using the Python Pygments library to highlight source code
containing non utf-8 characters.

All feedback welcome.

Thanks,
Andrew


---

Andrew Burgess (4):
  gdb: new 'maint flush source-cache' command
  gdb: erase items from the source_cache::m_offset_cache
  gdb: add 'maint set/show gnu-source-highlight enabled' command
  gdb/python: handle non utf-8 characters when source highlighting

 gdb/NEWS                                      | 11 +++
 gdb/doc/gdb.texinfo                           | 24 +++++
 gdb/python/python.c                           | 56 ++++++++---
 gdb/source-cache.c                            | 93 ++++++++++++++++++-
 gdb/testsuite/gdb.base/cached-source-file.exp | 38 ++++++++
 gdb/testsuite/gdb.python/py-source-styling.c  | 29 ++++++
 .../gdb.python/py-source-styling.exp          | 64 +++++++++++++
 7 files changed, 299 insertions(+), 16 deletions(-)
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.c
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.exp

-- 
2.25.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] gdb: new 'maint flush source-cache' command
  2022-01-07 14:23 [PATCH 0/4] Source highlight non utf-8 characters using Python Andrew Burgess
@ 2022-01-07 14:23 ` Andrew Burgess
  2022-01-07 14:49   ` Eli Zaretskii
  2022-01-10 15:18   ` Tom Tromey
  2022-01-07 14:23 ` [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache Andrew Burgess
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-07 14:23 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

This commit adds a new 'maint flush source-cache' command, this
flushes the cache of source file contents.

After flushing GDB is forced to reread source files the next time any
source lines are to be displayed.

I've added a test for this new feature.  The test is a little weird,
in that it modifies a source file after compilation, and makes use of
the cache flush so that the changes show up when listing the source
file.  I'm not sure when such a situation would ever crop up in real
life, but maybe we can imagine such cases.

In reality, this command is useful for testing the syntax highlighting
within GDB, we can adjust the syntax highlighting settings, flush the
cache, and then get the file contents re-highlighted using the new
settings.
---
 gdb/NEWS                                      |  3 ++
 gdb/doc/gdb.texinfo                           |  9 +++++
 gdb/source-cache.c                            | 15 ++++++++
 gdb/testsuite/gdb.base/cached-source-file.exp | 38 +++++++++++++++++++
 4 files changed, 65 insertions(+)

diff --git a/gdb/NEWS b/gdb/NEWS
index c26e15b530a..105fcf74ff4 100644
--- a/gdb/NEWS
+++ b/gdb/NEWS
@@ -64,6 +64,9 @@ set debug threads on|off
 show debug threads
   Print additional debug messages about thread creation and deletion.
 
+maint flush source-cache
+  Flush the contents of the source code cache.
+
 * Changed commands
 
 maint packet
diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
index 48873aa34fe..e17149afa2a 100644
--- a/gdb/doc/gdb.texinfo
+++ b/gdb/doc/gdb.texinfo
@@ -39473,6 +39473,15 @@
 register fetching, or frame unwinding.  The command @code{flushregs}
 is deprecated in favor of @code{maint flush register-cache}.
 
+@kindex maint flush source-cache
+@cindex source code, caching
+@item maint flush source-cache
+Flush @value{GDBN}'s cache of source code file contents.  After
+@value{GDBN} reads a source file, and optionally applies styling
+(@pxref{Output Styling}), the file contents are cached.  This command
+clears that cache.  The next time @value{GDBN} wants to show lines
+from a source file, the content will be re-read.
+
 @kindex maint print objfiles
 @cindex info for known object files
 @item maint print objfiles @r{[}@var{regexp}@r{]}
diff --git a/gdb/source-cache.c b/gdb/source-cache.c
index fc789eef8f9..733d1d272cd 100644
--- a/gdb/source-cache.c
+++ b/gdb/source-cache.c
@@ -25,6 +25,7 @@
 #include "gdbsupport/selftest.h"
 #include "objfiles.h"
 #include "exec.h"
+#include "cli/cli-cmds.h"
 
 #ifdef HAVE_SOURCE_HIGHLIGHT
 /* If Gnulib redirects 'open' and 'close' to its replacements
@@ -323,6 +324,16 @@ source_cache::get_source_lines (struct symtab *s, int first_line,
 			first_line, last_line, lines);
 }
 
+/* Implement 'maint flush source-cache' command.  */
+
+static void
+source_cache_flush_command (const char *command, int from_tty)
+{
+  forget_cached_source_info ();
+  if (from_tty)
+    printf_filtered (_("Source cache flushed.\n"));
+}
+
 #if GDB_SELF_TEST
 namespace selftests
 {
@@ -346,6 +357,10 @@ void _initialize_source_cache ();
 void
 _initialize_source_cache ()
 {
+  add_cmd ("source-cache", class_maintenance, source_cache_flush_command,
+	   _("Force gdb to flush its source code cache."),
+	   &maintenanceflushlist);
+
 #if GDB_SELF_TEST
   selftests::register_test ("source-cache", selftests::extract_lines_test);
 #endif
diff --git a/gdb/testsuite/gdb.base/cached-source-file.exp b/gdb/testsuite/gdb.base/cached-source-file.exp
index d4e64f32120..75a13378691 100644
--- a/gdb/testsuite/gdb.base/cached-source-file.exp
+++ b/gdb/testsuite/gdb.base/cached-source-file.exp
@@ -100,3 +100,41 @@ gdb_test "run" $re "rerun program" $q y
 # changed for GDB.
 gdb_test "list" "${bp_line}\[ \t\]+printf \\(\"foo\\\\n\"\\); /\\\* new-marker \\\*/.*" \
     "verify that the source code is properly reloaded"
+
+# Modify the source file again.  As before, this only works locally
+# because of the TCL commands.
+set bkpsrc [standard_output_file $testfile].c.bkp
+set bkpsrcfd [open $bkpsrc w]
+set srcfd [open $srcfile r]
+
+while { [gets $srcfd line] != -1 } {
+    if { [string first "new-marker" $line] != -1 } {
+	# Modify the printf line that we added previously.
+	puts $bkpsrcfd "  printf (\"foo\\n\"); /* new-marker updated */"
+    } else {
+	puts $bkpsrcfd $line
+    }
+}
+
+close $bkpsrcfd
+close $srcfd
+file rename -force -- $bkpsrc $srcfile
+
+# As before, delay so that at least one second has passed.  GDB still
+# will not spot that the source file has changed, as GDB doesn't do a
+# time check unless the binary has also changed, this delay just
+# allows us to confirm this behaviour.
+sleep 1
+
+# List the printf line again, we should not see the file changes yet
+# as the binary is unchanged, so the cached contents will still be
+# used.
+gdb_test "list ${bp_line}" "${bp_line}\[ \t\]+printf \\(\"foo\\\\n\"\\); /\\\* new-marker \\\*/.*" \
+    "verify that the source code change is not seen yet"
+
+gdb_test "maint flush source-cache" "Source cache flushed\\."
+
+# List the printf line again.  After the cache flush GDB will re-read
+# the source file and we should now see the changes.
+gdb_test "list ${bp_line}" "${bp_line}\[ \t\]+printf \\(\"foo\\\\n\"\\); /\\\* new-marker updated \\\*/.*" \
+    "verify that the updated source code change is not seen"
-- 
2.25.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache
  2022-01-07 14:23 [PATCH 0/4] Source highlight non utf-8 characters using Python Andrew Burgess
  2022-01-07 14:23 ` [PATCH 1/4] gdb: new 'maint flush source-cache' command Andrew Burgess
@ 2022-01-07 14:23 ` Andrew Burgess
  2022-01-10 15:24   ` Tom Tromey
  2022-01-07 14:23 ` [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command Andrew Burgess
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Andrew Burgess @ 2022-01-07 14:23 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

The source_cache class has two member variables m_source_map, which
stores the file contents, and m_offset_cache, which stores offsets
into the file contents.

As source files are read the contents of the file, as well as the
offset data, are stored in the cache using these two member variables.

Whenever GDB needs either the files contents, or the offset data,
source_cache::ensure is called.  This function looks for the file in
m_source_map, and if it's found then this implies the file is also in
m_offset_cache, and we're done.

If the file is not in m_source_map then GDB calls
source_cache::get_plain_source_lines to open the file and read its
contents.  ::get_plain_source_lines also calculates the offset data,
which is then inserted into m_offset_cache.

Back in ::ensure, the file contents are added into m_source_map.  And
finally, if m_source_map contains more than MAX_ENTRIES, an entry is
removed from m_source_map.

The problem is entries are not removed from m_offset_cache at the same
time.

This means that if a program contains enough source files, GDB will
hold at most MAX_ENTRIES cached source file contents, but can contain
offsets data for every source file.

Now, the offsets data is going to be smaller than the cached file
contents, so maybe there's no harm here.  But, when we reload the file
contents we always recalculate the offsets data.  And, when we
::get_line_charpos asking for offset data we still call ::ensure which
will ends up loading and caching the file contents.

So, given the current code does the work of reloading the offset data
anyway, we may as well save memory by capping m_offset_cache to
MAX_ENTRIES just like we do m_source_map.

That's what this commit does.

There should be no user visible changes after this commit, except for
ever so slightly lower memory usage in some cases.
---
 gdb/source-cache.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/gdb/source-cache.c b/gdb/source-cache.c
index 733d1d272cd..9a5ac40497a 100644
--- a/gdb/source-cache.c
+++ b/gdb/source-cache.c
@@ -240,7 +240,11 @@ source_cache::ensure (struct symtab *s)
   m_source_map.push_back (std::move (result));

   if (m_source_map.size () > MAX_ENTRIES)
-    m_source_map.erase (m_source_map.begin ());
+    {
+      auto iter = m_source_map.begin ();
+      m_offset_cache.erase (iter->fullname);
+      m_source_map.erase (iter);
+    }

   return true;
 }
-- 
2.25.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command
  2022-01-07 14:23 [PATCH 0/4] Source highlight non utf-8 characters using Python Andrew Burgess
  2022-01-07 14:23 ` [PATCH 1/4] gdb: new 'maint flush source-cache' command Andrew Burgess
  2022-01-07 14:23 ` [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache Andrew Burgess
@ 2022-01-07 14:23 ` Andrew Burgess
  2022-01-07 14:53   ` Eli Zaretskii
  2022-01-10 15:58   ` Tom Tromey
  2022-01-07 14:23 ` [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
  2022-01-12 14:30 ` [PATCHv2 0/2] Source highlight non utf-8 characters using Python Andrew Burgess
  4 siblings, 2 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-07 14:23 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

In a later commit I want to address an issue with the Python pygments
based code styling solution.  As this approach is only used when the
GNU Source Highlight library is not available, testing bugs in this
area can be annoying, as it requires GDB to be rebuilt with use of GNU
Source Highlight disabled.

This commit adds a pair of new maintenance commands:

  maintenance set gnu-source-highlight enabled on|off
  maintenance show gnu-source-highlight enabled

these commands can be used to disable use of the GNU Source Highlight
library, allowing me, in a later commit, to easily test bugs that
would otherwise be masked by GNU Source Highlight being used.

I made this a maintenance command, rather than a general purpose
command, as it didn't seem like this was something a general user
would need to adjust.  We can always convert the maintenance command
to a general command later if needed.

There's no test for this here, but this feature will be used in a
later commit.
---
 gdb/NEWS            |  8 +++++
 gdb/doc/gdb.texinfo | 15 ++++++++++
 gdb/source-cache.c  | 72 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/gdb/NEWS b/gdb/NEWS
index 105fcf74ff4..d968b0cde44 100644
--- a/gdb/NEWS
+++ b/gdb/NEWS
@@ -67,6 +67,14 @@ show debug threads
 maint flush source-cache
   Flush the contents of the source code cache.
 
+maint set gnu-source-highlight enabled on|off
+maint show gnu-source-highlight enabled
+  When on GDB will use the GNU Source Highlight library for adding
+  styling to source code.  When off the library will not be used, even
+  when available.  When GNU Source Highlight isn't used, or can't add
+  styling to a particular source file, then the Python Pygments
+  library will be used instead.
+
 * Changed commands
 
 maint packet
diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
index e17149afa2a..4b7430fb60a 100644
--- a/gdb/doc/gdb.texinfo
+++ b/gdb/doc/gdb.texinfo
@@ -39793,6 +39793,21 @@
 described in @ref{set libthread-db-search-path}.  For more information
 about the tests, see @ref{maint check libthread-db}.
 
+@kindex maint set gnu-source-highlight enabled
+@kindex maint show gnu-source-highlight enabled
+@item maint set gnu-source-highlight enabled @r{[}on|off@r{]}
+@itemx maint show gnu-source-highlight enabled
+Control whether @value{GDBN} should use the GNU Source Highlight
+library for applying styling to source code (@pxref{Output Styling}).
+This will be @samp{on} by default if the GNU Source Highlight library
+is available.  If the GNU Source Highlight library is not available,
+then this will be @samp{off} by default, and attempting to change this
+value to @samp{on} will give an error.
+
+If the GNU Source Highlight library is not being used, then
+@value{GDBN} will use the Python Pygments package for source code
+styling, if it is available.
+
 @kindex maint space
 @cindex memory used by commands
 @item maint space @var{value}
diff --git a/gdb/source-cache.c b/gdb/source-cache.c
index 9a5ac40497a..d64ddd3cc61 100644
--- a/gdb/source-cache.c
+++ b/gdb/source-cache.c
@@ -47,6 +47,46 @@
 
 source_cache g_source_cache;
 
+/* When this is true we will use the GNU Source Highlight to add styling to
+   source code (assuming the library is available).  This is initialized to
+   true (if appropriate) in _initialize_source_cache below.  */
+
+static bool use_gnu_source_highlight;
+
+/* The "maint show gnu-source-highlight enabled" command. */
+
+static void
+show_use_gnu_source_highlight_enabled  (struct ui_file *file, int from_tty,
+					struct cmd_list_element *c,
+					const char *value)
+{
+  fprintf_filtered (file,
+		    _("Use of GNU Source Highlight library is \"%s\".\n"),
+		    value);
+}
+
+/* The "maint set gnu-source-highlight enabled" command.  */
+
+static void
+set_use_gnu_source_highlight_enabled (const char *ignore_args,
+				      int from_tty,
+				      struct cmd_list_element *c)
+{
+#ifndef HAVE_SOURCE_HIGHLIGHT
+  /* If the library is not available and the user tried to enable use of
+     the library, then disable use of the library, and give an error.  */
+  if (use_gnu_source_highlight)
+    {
+      use_gnu_source_highlight = false;
+      error (_("the GNU Source Highlight library is not available"));
+    }
+#else
+  /* We (might) have just changed how we style source code, discard any
+     previously cached contents.  */
+  forget_cached_source_info ();
+#endif
+}
+
 /* See source-cache.h.  */
 
 std::string
@@ -193,7 +233,7 @@ source_cache::ensure (struct symtab *s)
 #ifdef HAVE_SOURCE_HIGHLIGHT
       bool already_styled = false;
       const char *lang_name = get_language_name (SYMTAB_LANGUAGE (s));
-      if (lang_name != nullptr)
+      if (lang_name != nullptr && use_gnu_source_highlight)
 	{
 	  /* The global source highlight object, or null if one was
 	     never constructed.  This is stored here rather than in
@@ -365,6 +405,36 @@ _initialize_source_cache ()
 	   _("Force gdb to flush its source code cache."),
 	   &maintenanceflushlist);
 
+  /* All the 'maint set|show gnu-source-highlight' sub-commands.  */
+  static struct cmd_list_element *maint_set_gnu_source_highlight_cmdlist;
+  static struct cmd_list_element *maint_show_gnu_source_highlight_cmdlist;
+
+  /* Adds 'maint set|show gnu-source-highlight'.  */
+  add_setshow_prefix_cmd ("gnu-source-highlight", class_maintenance,
+			  _("Set gnu-source-highlight specific variables."),
+			  _("Show gnu-source-highlight specific variables."),
+			  &maint_set_gnu_source_highlight_cmdlist,
+			  &maint_show_gnu_source_highlight_cmdlist,
+			  &maintenance_set_cmdlist,
+			  &maintenance_show_cmdlist);
+
+  /* Adds 'maint set|show gnu-source-highlight enabled'.  */
+  add_setshow_boolean_cmd ("enabled", class_maintenance,
+			   &use_gnu_source_highlight, _("\
+Set whether the GNU Source Highlight library can be used."), _("\
+Show whether the GNU Source Highlight library is being used."),_("\
+When enabled, GDB will use the GNU Source Highlight library to apply\n\
+styling to source code lines that are shown."),
+			   set_use_gnu_source_highlight_enabled,
+			   show_use_gnu_source_highlight_enabled,
+			   &maint_set_gnu_source_highlight_cmdlist,
+			   &maint_show_gnu_source_highlight_cmdlist);
+
+  /* Enable use of GNU Source Highlight library, if we have it.  */
+#ifdef HAVE_SOURCE_HIGHLIGHT
+  use_gnu_source_highlight = true;
+#endif
+
 #if GDB_SELF_TEST
   selftests::register_test ("source-cache", selftests::extract_lines_test);
 #endif
-- 
2.25.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-07 14:23 [PATCH 0/4] Source highlight non utf-8 characters using Python Andrew Burgess
                   ` (2 preceding siblings ...)
  2022-01-07 14:23 ` [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command Andrew Burgess
@ 2022-01-07 14:23 ` Andrew Burgess
  2022-01-10  3:27   ` Simon Marchi
  2022-01-11 19:24   ` Tom Tromey
  2022-01-12 14:30 ` [PATCHv2 0/2] Source highlight non utf-8 characters using Python Andrew Burgess
  4 siblings, 2 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-07 14:23 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

This commit adds support for source files that contain non utf-8
characters when performing source styling using the Python pygments
package.  This does not change the behaviour of GDB when the GNU
Source Highlight library is used.

For the following problem description, assume that either GDB is built
without GNU Source Highlight support, of that this has been disabled
using 'maintenance set gnu-source-highlight enabled off'.

The initial problem reported was that a source file containing non
utf-8 characters would cause GDB to print a Python exception, and then
display the source without styling, e.g.:

  Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
  /* Source code here, without styling...  */

Further, as the user steps through different source files, each time
the problematic source file was evicted from the source cache, and
then later reloaded, the exception would be printed again.

Finally, this problem is only present when using Python 3, this issue
is not present for Python 2.

What makes this especially frustrating is that GDB can clearly print
the source file contents, they're right there...  If we disable
styling completely, or make use of the GNU Source Highlight library,
then everything is fine.  So why is there an error when we try to
apply styling using Python?

The problem is the use of PyString_FromString (which is an alias for
PyUnicode_FromString in Python 3), this function converts a C string
into a either a Unicode object (Py3) or a str object (Py2).  For
Python 2 there is no unicode encoding performed during this function
call, but for Python 3 the input is assumed to be a uft-8 encoding
string for the purpose of the conversion.  And here of course, is the
problem, if the source file contains non utf-8 characters, then it
should not be treated as utf-8, but that's what we do, and that's why
we get an error.

My first thought when looking at this was to spot when the
PyString_FromString call failed with a UnicodeDecodeError and silently
ignore the error.  This would mean that GDB would print the source
without styling, but would also avoid the annoying exception message.

However, I also make use of `pygmentize`, a command line wrapper
around the Python pygments module, which I use to apply syntax
highlighting in the output of `less`.  And this command line wrapper
is quite happy to syntax highlight my source file that contains non
utf-8 characters, so it feels like the problem should be solvable.

It turns out that inside the pygments module there is already support
for guessing the encoding of the incoming file content, if the
incoming content is not already a Unicode string.  This is what
happens for Python 2 where the incoming content is of `str` type.

We could try and make GDB smarter when it comes to converting C
strings into Python Unicode objects; this would probably require us to
just try a couple of different encoding schemes rather than just
giving up after utf-8.

However, I figure, why bother?  The pygments module already does this
for us, and the colorize API is not part of the documented external
API of GDB.  So, why not just change the colorize API, instead of the
content being a Unicode string (for Python 3), lets just make the
content be a bytes object.  The pygments module can then take
responsibility for guessing the encoding.

So, currently, the colorize API receives a utf-8 unicode object, and
returns a host charset unicode object.  I propose that the colorize
API receive a bytes object, and return a bytes object.

This works perfectly, and I now see styled output for my source file
containing non utf-8 characters.

Well, the above is a little bit of a lie.  The actual code I've
written allows the colorize API to return either a bytes object, or a
host charset unicode object.  This just makes things a little more
flexible in the Python code.

Then I added a DejaGNU test, and .... my test failed.

I wasn't seeing styled source lines when the test was run through
DejaGNU, but I could when I ran the same steps at the console.

The test I was using is the exact one that is included with this
commit, you'll notice I did remember to set the TERM environment
variable to 'ansi' as is done in gdb.base/style.exp, which should be
all that's needed to see the styled output (I thought).  The problem
then, must be something in GDB.

It turns out that its the exact same problem that I solved above, but
this time, on the way out of the colorize handler, rather than on the
way in.  Only, interestingly, the problem now effects both Python 2
and Python 3.

The output from pygments is always a Unicode object for both Python 2
and 3.  Once we return from Python code back to gdbpy_colorize in
python.c, we do the following steps:

  if (!gdbpy_is_string (result.get ()))
    return {};

  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
  if (unic == nullpt) return {};

  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
						   host_charset (),
						   nullptr));
  if (host_str == nullpt) return {};

For both Python 2 and 3 the gdbpy_is_string check passes.

For both Python 2 and 3 the python_string_to_unicode call just returns
its input, as the string will already by a Unicode object.

Now we know we have a Unicode object we convert the Unicode object
into a bytes object using the host_charset.

After that (not shown above) we convert the bytes object into a C++
std::string, and return this as the result.

The problem is that the host_charset is based on the capabilities of
the output terminal, and, when the tests are run through DejaGNU, with
the current environment, the host_charset becomes 'ANSI_X3.4-1968',
which ends up trying to convert the string using the 'ascii' codec.
This codec only supports 7-bit characters, and so we once again get a
Python exception.

Now I can force the host_charset value to change, if I set LC_ALL to
'en_US.utf-8' (like we currently set TERM) then now the host_charset
will be utf-8, and the test passes.  Maybe that's OK, but I notice
that the gdb.base/style.exp test doesn't need to mess with LC_ALL, and
it manages to style other parts of GDB's output just fine.  So, I want
that please.

The problem then is how GDB tries to convert the Unicode object into a
bytes object.  When I run interactively in a terminal the host_charset
is utf-8, and that's great, we convert the output just right for
display in my terminal.  But in the test situation we need to do
better.

So, I wonder if we should consider using something other than nullptr
for the last argument to PyUnicode_AsEncodedString.  I propose that we
make use of 'backslashreplace'.  This will replace any character that
is not encodable in host_charset() with an \xNN escape sequence.

This seems fine for the test case I have - the non utf-8 character is
within a C string in the source file, and replacing it with an escape
sequence seems OK.  This might be more questionable if the non
encodable character were in say, an identifier within the source code,
but at this point I think (a) we'll probably be hitting other problems
with the Python API and its bytes/unicode conversion, and (b) the user
really needs to get their host_charset() set correctly anyway.
---
 gdb/python/python.c                           | 56 ++++++++++++----
 gdb/testsuite/gdb.python/py-source-styling.c  | 29 +++++++++
 .../gdb.python/py-source-styling.exp          | 64 +++++++++++++++++++
 3 files changed, 135 insertions(+), 14 deletions(-)
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.c
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.exp

diff --git a/gdb/python/python.c b/gdb/python/python.c
index e05b99c0bec..76871c86eca 100644
--- a/gdb/python/python.c
+++ b/gdb/python/python.c
@@ -1147,7 +1147,25 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
       gdbpy_print_stack ();
       return {};
     }
-  gdbpy_ref<> contents_arg (PyString_FromString (contents.c_str ()));
+
+  /* The pygments library, which is what we currently use for applying
+     styling, is happy to take input as a bytes object, and to figure out
+     the encoding for itself.  This removes the need for us to figure out
+     (guess?) at how the content is encoded, which is probably a good
+     thing.
+
+     In an ideal world we should be able to use the host_charset here, but
+     the host_charset is most likely to represent the capabilities of the
+     users terminal, rather than the encoding of a particular source file.
+
+     Another possibility would be to use Python's ability to escape
+     unknown characters when converting the bytes into unicode.  However,
+     by passing the raw file bytes to Python we can differ this choice.
+     This gives us (or potentially a user) the option of taking an approach
+     that works for them, rather than hard coding a single fixed choice
+     into GDB.  */
+  gdbpy_ref<> contents_arg (PyBytes_FromStringAndSize (contents.c_str (),
+						       contents.size ()));
   if (contents_arg == nullptr)
     {
       gdbpy_print_stack ();
@@ -1164,25 +1182,35 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
       return {};
     }
 
-  if (!gdbpy_is_string (result.get ()))
-    return {};
 
-  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
-  if (unic == nullptr)
+  if (PyBytes_Check (result.get ()))
     {
-      gdbpy_print_stack ();
-      return {};
+      /* The Python code has handled us a stream of bytes, so there's no
+	 need to convert this to unicode and back - the only outcome of
+	 which is that we might run into encoding errors if the bytes don't
+	 represent a string in host_charset.  */
+      return std::string (PyBytes_AsString (result.get ()));
     }
-  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
-						   host_charset (),
-						   nullptr));
-  if (host_str == nullptr)
+  else if (PyUnicode_Check (result.get ()))
     {
-      gdbpy_print_stack ();
-      return {};
+      /* Convert to bytes assuming the host_charset.  Any characters that
+	 can't be converted are replaced with \xNN escape sequences.  We
+	 see this case crop up in the gdb.python/py-source-styling.exp test
+	 when writing the output to the log file.  The host_charset is
+	 ASCII, and the unicode string from Pygments can't be encoded.  */
+      gdbpy_ref<> bytes (PyUnicode_AsEncodedString (result.get (),
+						    host_charset (),
+						    "backslashreplace"));
+      if (bytes == nullptr)
+	{
+	  gdbpy_print_stack ();
+	  return {};
+	}
+
+      return std::string (PyBytes_AsString (bytes.get ()));
     }
 
-  return std::string (PyBytes_AsString (host_str.get ()));
+  return {};
 }
 
 \f
diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
new file mode 100644
index 00000000000..6a1a9d59a8c
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.c
@@ -0,0 +1,29 @@
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2022 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+int
+main ()
+{
+  int some_variable = 1234;
+
+  /* The following line contains a character that is non-utf-8.  This is a
+     critical part of the test as Python 3 can't convert this into a string
+     using its default mechanism.  */
+  char c = '�';		/* List this line.  */
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.python/py-source-styling.exp b/gdb/testsuite/gdb.python/py-source-styling.exp
new file mode 100644
index 00000000000..68bbc9f9758
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.exp
@@ -0,0 +1,64 @@
+# Copyright (C) 2022 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This file is part of the GDB testsuite.  It checks for memory leaks
+# associated with allocating and deallocation gdb.Inferior objects.
+
+load_lib gdb-python.exp
+
+standard_testfile
+
+save_vars { env(TERM) } {
+    # We need an ANSI-capable terminal to get the output, additionally
+    # we need to set LC_ALL so GDB knows the terminal is UTF-8
+    # capable, otherwise we'll get a UnicodeEncodeError trying to
+    # encode the output.
+    setenv TERM ansi
+
+    if { [prepare_for_testing "failed to prepare" ${testfile} ${srcfile}] } {
+	return -1
+    }
+
+    if { [skip_python_tests] } { continue }
+
+    if { ![gdb_py_module_available "pygments"] } {
+	unsupported "pygments module not available"
+	return -1
+    }
+
+    if ![runto_main] {
+	return
+    }
+
+    gdb_test_no_output "maint set gnu-source-highlight enabled off"
+
+    gdb_test "maint flush source-cache" "Source cache flushed\\."
+
+    set seen_style_escape false
+    set line_number [gdb_get_line_number "List this line."]
+    gdb_test_multiple "list ${line_number}" "" {
+	-re "Python Exception.*" {
+	    fail $gdb_test_name
+	}
+	-re "\033" {
+	    set seen_style_escape true
+	    exp_continue
+	}
+	-re "$gdb_prompt $" {
+	    gdb_assert { $seen_style_escape }
+	    pass $gdb_test_name
+	}
+    }
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] gdb: new 'maint flush source-cache' command
  2022-01-07 14:23 ` [PATCH 1/4] gdb: new 'maint flush source-cache' command Andrew Burgess
@ 2022-01-07 14:49   ` Eli Zaretskii
  2022-01-11 12:13     ` Andrew Burgess
  2022-01-10 15:18   ` Tom Tromey
  1 sibling, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2022-01-07 14:49 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches

> Date: Fri,  7 Jan 2022 14:23:11 +0000
> From: Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org>
> Cc: Andrew Burgess <aburgess@redhat.com>
> 
> This commit adds a new 'maint flush source-cache' command, this
> flushes the cache of source file contents.
> 
> After flushing GDB is forced to reread source files the next time any
> source lines are to be displayed.
> 
> I've added a test for this new feature.  The test is a little weird,
> in that it modifies a source file after compilation, and makes use of
> the cache flush so that the changes show up when listing the source
> file.  I'm not sure when such a situation would ever crop up in real
> life, but maybe we can imagine such cases.
> 
> In reality, this command is useful for testing the syntax highlighting
> within GDB, we can adjust the syntax highlighting settings, flush the
> cache, and then get the file contents re-highlighted using the new
> settings.
> ---
>  gdb/NEWS                                      |  3 ++
>  gdb/doc/gdb.texinfo                           |  9 +++++
>  gdb/source-cache.c                            | 15 ++++++++
>  gdb/testsuite/gdb.base/cached-source-file.exp | 38 +++++++++++++++++++
>  4 files changed, 65 insertions(+)
> 
> diff --git a/gdb/NEWS b/gdb/NEWS
> index c26e15b530a..105fcf74ff4 100644
> --- a/gdb/NEWS
> +++ b/gdb/NEWS
> @@ -64,6 +64,9 @@ set debug threads on|off
>  show debug threads
>    Print additional debug messages about thread creation and deletion.
>  
> +maint flush source-cache
> +  Flush the contents of the source code cache.
> +
>  * Changed commands
>  
>  maint packet
> diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
> index 48873aa34fe..e17149afa2a 100644
> --- a/gdb/doc/gdb.texinfo
> +++ b/gdb/doc/gdb.texinfo
> @@ -39473,6 +39473,15 @@
>  register fetching, or frame unwinding.  The command @code{flushregs}
>  is deprecated in favor of @code{maint flush register-cache}.
>  
> +@kindex maint flush source-cache
> +@cindex source code, caching
> +@item maint flush source-cache
> +Flush @value{GDBN}'s cache of source code file contents.  After
> +@value{GDBN} reads a source file, and optionally applies styling
> +(@pxref{Output Styling}), the file contents are cached.  This command
> +clears that cache.  The next time @value{GDBN} wants to show lines
> +from a source file, the content will be re-read.
> +

The documentation parts are okay, but how about including in the
manual the hint about where this feature is useful, like what you say
in the commit log?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command
  2022-01-07 14:23 ` [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command Andrew Burgess
@ 2022-01-07 14:53   ` Eli Zaretskii
  2022-01-11 13:07     ` Andrew Burgess
  2022-01-10 15:58   ` Tom Tromey
  1 sibling, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2022-01-07 14:53 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches

> Date: Fri,  7 Jan 2022 14:23:13 +0000
> From: Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org>
> Cc: Andrew Burgess <aburgess@redhat.com>
> 
> +maint set gnu-source-highlight enabled on|off
> +maint show gnu-source-highlight enabled
> +  When on GDB will use the GNU Source Highlight library for adding

"Whether GDB should use the GNU Source Highlight ..."

> +  styling to source code.  When off the library will not be used, even
                                      ^
Comma missing there.

> +  when available.  When GNU Source Highlight isn't used, or can't add
> +  styling to a particular source file, then the Python Pygments
> +  library will be used instead.
> +@kindex maint set gnu-source-highlight enabled
> +@kindex maint show gnu-source-highlight enabled
> +@item maint set gnu-source-highlight enabled @r{[}on|off@r{]}
> +@itemx maint show gnu-source-highlight enabled
> +Control whether @value{GDBN} should use the GNU Source Highlight
> +library for applying styling to source code (@pxref{Output Styling}).
> +This will be @samp{on} by default if the GNU Source Highlight library
> +is available.  If the GNU Source Highlight library is not available,
> +then this will be @samp{off} by default, and attempting to change this
> +value to @samp{on} will give an error.
> +
> +If the GNU Source Highlight library is not being used, then
> +@value{GDBN} will use the Python Pygments package for source code
> +styling, if it is available.

Here, too, I would suggest saying that this is useful for debugging
Pygments styling when Source Highlight is available.

> +  add_setshow_boolean_cmd ("enabled", class_maintenance,
> +			   &use_gnu_source_highlight, _("\
> +Set whether the GNU Source Highlight library can be used."), _("\
                                                ^^^^^^^^^^^
I think "should be used" is better.

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-07 14:23 ` [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
@ 2022-01-10  3:27   ` Simon Marchi
  2022-01-10 10:41     ` Andrew Burgess
  2022-01-11 19:24   ` Tom Tromey
  1 sibling, 1 reply; 36+ messages in thread
From: Simon Marchi @ 2022-01-10  3:27 UTC (permalink / raw)
  To: Andrew Burgess, gdb-patches

You haven't linked it, but I found this page that you might have read:

  https://pygments.org/docs/unicode/

On 2022-01-07 09:23, Andrew Burgess via Gdb-patches wrote:
> This commit adds support for source files that contain non utf-8
> characters when performing source styling using the Python pygments
> package.  This does not change the behaviour of GDB when the GNU
> Source Highlight library is used.
> 
> For the following problem description, assume that either GDB is built
> without GNU Source Highlight support, of that this has been disabled
> using 'maintenance set gnu-source-highlight enabled off'.
> 
> The initial problem reported was that a source file containing non
> utf-8 characters would cause GDB to print a Python exception, and then
> display the source without styling, e.g.:
> 
>   Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
>   /* Source code here, without styling...  */
> 
> Further, as the user steps through different source files, each time
> the problematic source file was evicted from the source cache, and
> then later reloaded, the exception would be printed again.
> 
> Finally, this problem is only present when using Python 3, this issue
> is not present for Python 2.
> 
> What makes this especially frustrating is that GDB can clearly print
> the source file contents, they're right there...  If we disable
> styling completely, or make use of the GNU Source Highlight library,
> then everything is fine.  So why is there an error when we try to
> apply styling using Python?
> 
> The problem is the use of PyString_FromString (which is an alias for
> PyUnicode_FromString in Python 3), this function converts a C string
> into a either a Unicode object (Py3) or a str object (Py2).  For
> Python 2 there is no unicode encoding performed during this function

"there is not unicode encoding", I think it would make more sense to say
"there is not utf-8 decoding".

> call, but for Python 3 the input is assumed to be a uft-8 encoding

uft -> utf.

> string for the purpose of the conversion.  And here of course, is the
> problem, if the source file contains non utf-8 characters, then it
> should not be treated as utf-8, but that's what we do, and that's why
> we get an error.
> 
> My first thought when looking at this was to spot when the
> PyString_FromString call failed with a UnicodeDecodeError and silently
> ignore the error.  This would mean that GDB would print the source
> without styling, but would also avoid the annoying exception message.
> 
> However, I also make use of `pygmentize`, a command line wrapper
> around the Python pygments module, which I use to apply syntax
> highlighting in the output of `less`.  And this command line wrapper
> is quite happy to syntax highlight my source file that contains non
> utf-8 characters, so it feels like the problem should be solvable.
> 
> It turns out that inside the pygments module there is already support
> for guessing the encoding of the incoming file content, if the
> incoming content is not already a Unicode string.  This is what
> happens for Python 2 where the incoming content is of `str` type.
> 
> We could try and make GDB smarter when it comes to converting C
> strings into Python Unicode objects; this would probably require us to
> just try a couple of different encoding schemes rather than just
> giving up after utf-8.
> 
> However, I figure, why bother?  The pygments module already does this
> for us, and the colorize API is not part of the documented external
> API of GDB.  So, why not just change the colorize API, instead of the
> content being a Unicode string (for Python 3), lets just make the
> content be a bytes object.  The pygments module can then take
> responsibility for guessing the encoding.
> 
> So, currently, the colorize API receives a utf-8 unicode object, and
> returns a host charset unicode object.  I propose that the colorize
> API receive a bytes object, and return a bytes object.
> 
> This works perfectly, and I now see styled output for my source file
> containing non utf-8 characters.
> 
> Well, the above is a little bit of a lie.  The actual code I've
> written allows the colorize API to return either a bytes object, or a
> host charset unicode object.  This just makes things a little more
> flexible in the Python code.
> 
> Then I added a DejaGNU test, and .... my test failed.
> 
> I wasn't seeing styled source lines when the test was run through
> DejaGNU, but I could when I ran the same steps at the console.
> 
> The test I was using is the exact one that is included with this
> commit, you'll notice I did remember to set the TERM environment
> variable to 'ansi' as is done in gdb.base/style.exp, which should be
> all that's needed to see the styled output (I thought).  The problem
> then, must be something in GDB.
> 
> It turns out that its the exact same problem that I solved above, but
> this time, on the way out of the colorize handler, rather than on the
> way in.  Only, interestingly, the problem now effects both Python 2
> and Python 3.
> 
> The output from pygments is always a Unicode object for both Python 2
> and 3.  Once we return from Python code back to gdbpy_colorize in
> python.c, we do the following steps:
> 
>   if (!gdbpy_is_string (result.get ()))
>     return {};
> 
>   gdbpy_ref<> unic = python_string_to_unicode (result.get ());
>   if (unic == nullpt) return {};
> 
>   gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
> 						   host_charset (),
> 						   nullptr));
>   if (host_str == nullpt) return {};
> 
> For both Python 2 and 3 the gdbpy_is_string check passes.
> 
> For both Python 2 and 3 the python_string_to_unicode call just returns
> its input, as the string will already by a Unicode object.
> 
> Now we know we have a Unicode object we convert the Unicode object
> into a bytes object using the host_charset.
> 
> After that (not shown above) we convert the bytes object into a C++
> std::string, and return this as the result.
> 
> The problem is that the host_charset is based on the capabilities of
> the output terminal, and, when the tests are run through DejaGNU, with
> the current environment, the host_charset becomes 'ANSI_X3.4-1968',
> which ends up trying to convert the string using the 'ascii' codec.
> This codec only supports 7-bit characters, and so we once again get a
> Python exception.

Huh, that's not ideal.  It would seem better to me if the testsuite ran
with host-charset == utf-8.  That would be more representative of the
reality than testing with a random charset.

> Now I can force the host_charset value to change, if I set LC_ALL to
> 'en_US.utf-8' (like we currently set TERM) then now the host_charset
> will be utf-8, and the test passes.  Maybe that's OK, but I notice
> that the gdb.base/style.exp test doesn't need to mess with LC_ALL, and
> it manages to style other parts of GDB's output just fine.  So, I want
> that please.
> 
> The problem then is how GDB tries to convert the Unicode object into a
> bytes object.  When I run interactively in a terminal the host_charset
> is utf-8, and that's great, we convert the output just right for
> display in my terminal.  But in the test situation we need to do
> better.
> 
> So, I wonder if we should consider using something other than nullptr
> for the last argument to PyUnicode_AsEncodedString.  I propose that we
> make use of 'backslashreplace'.  This will replace any character that
> is not encodable in host_charset() with an \xNN escape sequence.
> 
> This seems fine for the test case I have - the non utf-8 character is
> within a C string in the source file, and replacing it with an escape
> sequence seems OK.  This might be more questionable if the non
> encodable character were in say, an identifier within the source code,
> but at this point I think (a) we'll probably be hitting other problems
> with the Python API and its bytes/unicode conversion, and (b) the user
> really needs to get their host_charset() set correctly anyway.

I must admit I did not follow everything (it's getting late here).  But
passing a "bytes" to pygments and receiving a coloured "bytes"
(hopefully encoded the same way as the original "bytes", only with ANSI
sequences added) sounds good.  Then the coloured string goes through the
GDB machine just as the uncoloured string would have gone, had Pygments
not been available.  It seems like GDB never really needs to decode a
source string, it just sends it to the terminal and the terminal deals
with it?  Eexcept maybe in the TUI?

I don't understand when the Python colorize function returns a unicode
string now, though.  If you always send a bytes, don't you always get
back a bytes?

Simon

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-10  3:27   ` Simon Marchi
@ 2022-01-10 10:41     ` Andrew Burgess
  2022-01-10 15:32       ` Simon Marchi
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Burgess @ 2022-01-10 10:41 UTC (permalink / raw)
  To: Simon Marchi; +Cc: gdb-patches

* Simon Marchi <simon.marchi@polymtl.ca> [2022-01-09 22:27:10 -0500]:

> You haven't linked it, but I found this page that you might have read:
> 
>   https://pygments.org/docs/unicode/
> 
> On 2022-01-07 09:23, Andrew Burgess via Gdb-patches wrote:
> > This commit adds support for source files that contain non utf-8
> > characters when performing source styling using the Python pygments
> > package.  This does not change the behaviour of GDB when the GNU
> > Source Highlight library is used.
> > 
> > For the following problem description, assume that either GDB is built
> > without GNU Source Highlight support, of that this has been disabled
> > using 'maintenance set gnu-source-highlight enabled off'.
> > 
> > The initial problem reported was that a source file containing non
> > utf-8 characters would cause GDB to print a Python exception, and then
> > display the source without styling, e.g.:
> > 
> >   Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
> >   /* Source code here, without styling...  */
> > 
> > Further, as the user steps through different source files, each time
> > the problematic source file was evicted from the source cache, and
> > then later reloaded, the exception would be printed again.
> > 
> > Finally, this problem is only present when using Python 3, this issue
> > is not present for Python 2.
> > 
> > What makes this especially frustrating is that GDB can clearly print
> > the source file contents, they're right there...  If we disable
> > styling completely, or make use of the GNU Source Highlight library,
> > then everything is fine.  So why is there an error when we try to
> > apply styling using Python?
> > 
> > The problem is the use of PyString_FromString (which is an alias for
> > PyUnicode_FromString in Python 3), this function converts a C string
> > into a either a Unicode object (Py3) or a str object (Py2).  For
> > Python 2 there is no unicode encoding performed during this function
> 
> "there is not unicode encoding", I think it would make more sense to say
> "there is not utf-8 decoding".
> 
> > call, but for Python 3 the input is assumed to be a uft-8 encoding
> 
> uft -> utf.
> 
> > string for the purpose of the conversion.  And here of course, is the
> > problem, if the source file contains non utf-8 characters, then it
> > should not be treated as utf-8, but that's what we do, and that's why
> > we get an error.
> > 
> > My first thought when looking at this was to spot when the
> > PyString_FromString call failed with a UnicodeDecodeError and silently
> > ignore the error.  This would mean that GDB would print the source
> > without styling, but would also avoid the annoying exception message.
> > 
> > However, I also make use of `pygmentize`, a command line wrapper
> > around the Python pygments module, which I use to apply syntax
> > highlighting in the output of `less`.  And this command line wrapper
> > is quite happy to syntax highlight my source file that contains non
> > utf-8 characters, so it feels like the problem should be solvable.
> > 
> > It turns out that inside the pygments module there is already support
> > for guessing the encoding of the incoming file content, if the
> > incoming content is not already a Unicode string.  This is what
> > happens for Python 2 where the incoming content is of `str` type.
> > 
> > We could try and make GDB smarter when it comes to converting C
> > strings into Python Unicode objects; this would probably require us to
> > just try a couple of different encoding schemes rather than just
> > giving up after utf-8.
> > 
> > However, I figure, why bother?  The pygments module already does this
> > for us, and the colorize API is not part of the documented external
> > API of GDB.  So, why not just change the colorize API, instead of the
> > content being a Unicode string (for Python 3), lets just make the
> > content be a bytes object.  The pygments module can then take
> > responsibility for guessing the encoding.
> > 
> > So, currently, the colorize API receives a utf-8 unicode object, and
> > returns a host charset unicode object.  I propose that the colorize
> > API receive a bytes object, and return a bytes object.
> > 
> > This works perfectly, and I now see styled output for my source file
> > containing non utf-8 characters.
> > 
> > Well, the above is a little bit of a lie.  The actual code I've
> > written allows the colorize API to return either a bytes object, or a
> > host charset unicode object.  This just makes things a little more
> > flexible in the Python code.
> > 
> > Then I added a DejaGNU test, and .... my test failed.
> > 
> > I wasn't seeing styled source lines when the test was run through
> > DejaGNU, but I could when I ran the same steps at the console.
> > 
> > The test I was using is the exact one that is included with this
> > commit, you'll notice I did remember to set the TERM environment
> > variable to 'ansi' as is done in gdb.base/style.exp, which should be
> > all that's needed to see the styled output (I thought).  The problem
> > then, must be something in GDB.
> > 
> > It turns out that its the exact same problem that I solved above, but
> > this time, on the way out of the colorize handler, rather than on the
> > way in.  Only, interestingly, the problem now effects both Python 2
> > and Python 3.
> > 
> > The output from pygments is always a Unicode object for both Python 2
> > and 3.  Once we return from Python code back to gdbpy_colorize in
> > python.c, we do the following steps:
> > 
> >   if (!gdbpy_is_string (result.get ()))
> >     return {};
> > 
> >   gdbpy_ref<> unic = python_string_to_unicode (result.get ());
> >   if (unic == nullpt) return {};
> > 
> >   gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
> > 						   host_charset (),
> > 						   nullptr));
> >   if (host_str == nullpt) return {};
> > 
> > For both Python 2 and 3 the gdbpy_is_string check passes.
> > 
> > For both Python 2 and 3 the python_string_to_unicode call just returns
> > its input, as the string will already by a Unicode object.
> > 
> > Now we know we have a Unicode object we convert the Unicode object
> > into a bytes object using the host_charset.
> > 
> > After that (not shown above) we convert the bytes object into a C++
> > std::string, and return this as the result.
> > 
> > The problem is that the host_charset is based on the capabilities of
> > the output terminal, and, when the tests are run through DejaGNU, with
> > the current environment, the host_charset becomes 'ANSI_X3.4-1968',
> > which ends up trying to convert the string using the 'ascii' codec.
> > This codec only supports 7-bit characters, and so we once again get a
> > Python exception.
> 
> Huh, that's not ideal.  It would seem better to me if the testsuite ran
> with host-charset == utf-8.  That would be more representative of the
> reality than testing with a random charset.
> 
> > Now I can force the host_charset value to change, if I set LC_ALL to
> > 'en_US.utf-8' (like we currently set TERM) then now the host_charset
> > will be utf-8, and the test passes.  Maybe that's OK, but I notice
> > that the gdb.base/style.exp test doesn't need to mess with LC_ALL, and
> > it manages to style other parts of GDB's output just fine.  So, I want
> > that please.
> > 
> > The problem then is how GDB tries to convert the Unicode object into a
> > bytes object.  When I run interactively in a terminal the host_charset
> > is utf-8, and that's great, we convert the output just right for
> > display in my terminal.  But in the test situation we need to do
> > better.
> > 
> > So, I wonder if we should consider using something other than nullptr
> > for the last argument to PyUnicode_AsEncodedString.  I propose that we
> > make use of 'backslashreplace'.  This will replace any character that
> > is not encodable in host_charset() with an \xNN escape sequence.
> > 
> > This seems fine for the test case I have - the non utf-8 character is
> > within a C string in the source file, and replacing it with an escape
> > sequence seems OK.  This might be more questionable if the non
> > encodable character were in say, an identifier within the source code,
> > but at this point I think (a) we'll probably be hitting other problems
> > with the Python API and its bytes/unicode conversion, and (b) the user
> > really needs to get their host_charset() set correctly anyway.
> 
> I must admit I did not follow everything (it's getting late here).  But
> passing a "bytes" to pygments and receiving a coloured "bytes"
> (hopefully encoded the same way as the original "bytes", only with ANSI
> sequences added) sounds good.  Then the coloured string goes through the
> GDB machine just as the uncoloured string would have gone, had Pygments
> not been available.  It seems like GDB never really needs to decode a
> source string, it just sends it to the terminal and the terminal deals
> with it?  Eexcept maybe in the TUI?
> 
> I don't understand when the Python colorize function returns a unicode
> string now, though.  If you always send a bytes, don't you always get
> back a bytes?

Unfortunately it's not as simple as bytes in bytes out.  See:

  https://pygments.org/docs/unicode/?highlight=encoding

In summary, Pygments uses unicode internally, but has some logic for
guessing the encoding of the incoming bytes.  This logic is better (I
claim) than GDB's hard-coded use UTF-8.  The link above outlines how
the guess is done in more detail.

Pygments always returns a unicode object, which is one of the reasons
I have GDB handle both bytes and unicode being returned from the
colorize API.  We could always make the API for restricted, and insist
on a bytes object being returned, this would just require us to
convert the output of Pygments to bytes before returning to GDB.

Hope that makes things a little clearer...

Thanks,
Andrew




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] gdb: new 'maint flush source-cache' command
  2022-01-07 14:23 ` [PATCH 1/4] gdb: new 'maint flush source-cache' command Andrew Burgess
  2022-01-07 14:49   ` Eli Zaretskii
@ 2022-01-10 15:18   ` Tom Tromey
  1 sibling, 0 replies; 36+ messages in thread
From: Tom Tromey @ 2022-01-10 15:18 UTC (permalink / raw)
  To: Andrew Burgess via Gdb-patches; +Cc: Andrew Burgess

>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:

Andrew> +static void
Andrew> +source_cache_flush_command (const char *command, int from_tty)
Andrew> +{
Andrew> +  forget_cached_source_info ();
Andrew> +  if (from_tty)
Andrew> +    printf_filtered (_("Source cache flushed.\n"));

For a maint command in particular, I'd say unconditionally printing is
fine.

This looks good to me.

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache
  2022-01-07 14:23 ` [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache Andrew Burgess
@ 2022-01-10 15:24   ` Tom Tromey
  2022-01-11 12:17     ` Andrew Burgess
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2022-01-10 15:24 UTC (permalink / raw)
  To: Andrew Burgess via Gdb-patches; +Cc: Andrew Burgess

>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:

Andrew> So, given the current code does the work of reloading the offset data
Andrew> anyway, we may as well save memory by capping m_offset_cache to
Andrew> MAX_ENTRIES just like we do m_source_map.

Andrew> That's what this commit does.

Andrew> There should be no user visible changes after this commit, except for
Andrew> ever so slightly lower memory usage in some cases.

I think it would be good to update the last part of this comment in
source_cache::ensure:

	  /* This should always hold, because we create the file
	     offsets when reading the file, and never free them
	     without also clearing the contents cache.  */

... but otherwise this looks good.

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-10 10:41     ` Andrew Burgess
@ 2022-01-10 15:32       ` Simon Marchi
  2022-01-11 13:10         ` Andrew Burgess
  0 siblings, 1 reply; 36+ messages in thread
From: Simon Marchi @ 2022-01-10 15:32 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches

> Unfortunately it's not as simple as bytes in bytes out.  See:
> 
>   https://pygments.org/docs/unicode/?highlight=encoding
> 
> In summary, Pygments uses unicode internally, but has some logic for
> guessing the encoding of the incoming bytes.  This logic is better (I
> claim) than GDB's hard-coded use UTF-8.  The link above outlines how
> the guess is done in more detail.
> 
> Pygments always returns a unicode object, which is one of the reasons
> I have GDB handle both bytes and unicode being returned from the
> colorize API.  We could always make the API for restricted, and insist
> on a bytes object being returned, this would just require us to
> convert the output of Pygments to bytes before returning to GDB.

Ok, so when does "colorize" returns bytes?

Simon

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command
  2022-01-07 14:23 ` [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command Andrew Burgess
  2022-01-07 14:53   ` Eli Zaretskii
@ 2022-01-10 15:58   ` Tom Tromey
  1 sibling, 0 replies; 36+ messages in thread
From: Tom Tromey @ 2022-01-10 15:58 UTC (permalink / raw)
  To: Andrew Burgess via Gdb-patches; +Cc: Andrew Burgess

>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:

Andrew> In a later commit I want to address an issue with the Python pygments
Andrew> based code styling solution.  As this approach is only used when the
Andrew> GNU Source Highlight library is not available, testing bugs in this
Andrew> area can be annoying, as it requires GDB to be rebuilt with use of GNU
Andrew> Source Highlight disabled.

Andrew> This commit adds a pair of new maintenance commands:

Andrew>   maintenance set gnu-source-highlight enabled on|off
Andrew>   maintenance show gnu-source-highlight enabled

Looks good, thank you.

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] gdb: new 'maint flush source-cache' command
  2022-01-07 14:49   ` Eli Zaretskii
@ 2022-01-11 12:13     ` Andrew Burgess
  2022-01-11 13:31       ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Burgess @ 2022-01-11 12:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb-patches

* Eli Zaretskii <eliz@gnu.org> [2022-01-07 16:49:50 +0200]:

> > Date: Fri,  7 Jan 2022 14:23:11 +0000
> > From: Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org>
> > Cc: Andrew Burgess <aburgess@redhat.com>
> > 
> > This commit adds a new 'maint flush source-cache' command, this
> > flushes the cache of source file contents.
> > 
> > After flushing GDB is forced to reread source files the next time any
> > source lines are to be displayed.
> > 
> > I've added a test for this new feature.  The test is a little weird,
> > in that it modifies a source file after compilation, and makes use of
> > the cache flush so that the changes show up when listing the source
> > file.  I'm not sure when such a situation would ever crop up in real
> > life, but maybe we can imagine such cases.
> > 
> > In reality, this command is useful for testing the syntax highlighting
> > within GDB, we can adjust the syntax highlighting settings, flush the
> > cache, and then get the file contents re-highlighted using the new
> > settings.
> > ---
> >  gdb/NEWS                                      |  3 ++
> >  gdb/doc/gdb.texinfo                           |  9 +++++
> >  gdb/source-cache.c                            | 15 ++++++++
> >  gdb/testsuite/gdb.base/cached-source-file.exp | 38 +++++++++++++++++++
> >  4 files changed, 65 insertions(+)
> > 
> > diff --git a/gdb/NEWS b/gdb/NEWS
> > index c26e15b530a..105fcf74ff4 100644
> > --- a/gdb/NEWS
> > +++ b/gdb/NEWS
> > @@ -64,6 +64,9 @@ set debug threads on|off
> >  show debug threads
> >    Print additional debug messages about thread creation and deletion.
> >  
> > +maint flush source-cache
> > +  Flush the contents of the source code cache.
> > +
> >  * Changed commands
> >  
> >  maint packet
> > diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
> > index 48873aa34fe..e17149afa2a 100644
> > --- a/gdb/doc/gdb.texinfo
> > +++ b/gdb/doc/gdb.texinfo
> > @@ -39473,6 +39473,15 @@
> >  register fetching, or frame unwinding.  The command @code{flushregs}
> >  is deprecated in favor of @code{maint flush register-cache}.
> >  
> > +@kindex maint flush source-cache
> > +@cindex source code, caching
> > +@item maint flush source-cache
> > +Flush @value{GDBN}'s cache of source code file contents.  After
> > +@value{GDBN} reads a source file, and optionally applies styling
> > +(@pxref{Output Styling}), the file contents are cached.  This command
> > +clears that cache.  The next time @value{GDBN} wants to show lines
> > +from a source file, the content will be re-read.
> > +
> 
> The documentation parts are okay, but how about including in the
> manual the hint about where this feature is useful, like what you say
> in the commit log?
>

Thanks for the feedback.  An updated patch is below, here's the new
manual entry for 'maint flush source-cache':

  @kindex maint flush source-cache
  @cindex source code, caching
  @item maint flush source-cache
  Flush @value{GDBN}'s cache of source code file contents.  After
  @value{GDBN} reads a source file, and optionally applies styling
  (@pxref{Output Styling}), the file contents are cached.  This command
  clears that cache.  The next time @value{GDBN} wants to show lines
  from a source file, the content will be re-read.

  This command is useful when debugging issues related to source code
  styling.  After flushing the cache any source code displayed by
  @value{GDBN} will be re-read and re-styled.

Is that OK?

Thanks,
Andrew

---

commit e780b58dfcbe12abc7638c5c808fbc43f1176f08
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Fri Nov 26 13:51:36 2021 +0000

    gdb: new 'maint flush source-cache' command
    
    This commit adds a new 'maint flush source-cache' command, this
    flushes the cache of source file contents.
    
    After flushing GDB is forced to reread source files the next time any
    source lines are to be displayed.
    
    I've added a test for this new feature.  The test is a little weird,
    in that it modifies a source file after compilation, and makes use of
    the cache flush so that the changes show up when listing the source
    file.  I'm not sure when such a situation would ever crop up in real
    life, but maybe we can imagine such cases.
    
    In reality, this command is useful for testing the syntax highlighting
    within GDB, we can adjust the syntax highlighting settings, flush the
    cache, and then get the file contents re-highlighted using the new
    settings.

diff --git a/gdb/NEWS b/gdb/NEWS
index c26e15b530a..105fcf74ff4 100644
--- a/gdb/NEWS
+++ b/gdb/NEWS
@@ -64,6 +64,9 @@ set debug threads on|off
 show debug threads
   Print additional debug messages about thread creation and deletion.
 
+maint flush source-cache
+  Flush the contents of the source code cache.
+
 * Changed commands
 
 maint packet
diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
index 48873aa34fe..222a01023e0 100644
--- a/gdb/doc/gdb.texinfo
+++ b/gdb/doc/gdb.texinfo
@@ -39473,6 +39473,19 @@
 register fetching, or frame unwinding.  The command @code{flushregs}
 is deprecated in favor of @code{maint flush register-cache}.
 
+@kindex maint flush source-cache
+@cindex source code, caching
+@item maint flush source-cache
+Flush @value{GDBN}'s cache of source code file contents.  After
+@value{GDBN} reads a source file, and optionally applies styling
+(@pxref{Output Styling}), the file contents are cached.  This command
+clears that cache.  The next time @value{GDBN} wants to show lines
+from a source file, the content will be re-read.
+
+This command is useful when debugging issues related to source code
+styling.  After flushing the cache any source code displayed by
+@value{GDBN} will be re-read and re-styled.
+
 @kindex maint print objfiles
 @cindex info for known object files
 @item maint print objfiles @r{[}@var{regexp}@r{]}
diff --git a/gdb/source-cache.c b/gdb/source-cache.c
index fc789eef8f9..0650768cc0e 100644
--- a/gdb/source-cache.c
+++ b/gdb/source-cache.c
@@ -25,6 +25,7 @@
 #include "gdbsupport/selftest.h"
 #include "objfiles.h"
 #include "exec.h"
+#include "cli/cli-cmds.h"
 
 #ifdef HAVE_SOURCE_HIGHLIGHT
 /* If Gnulib redirects 'open' and 'close' to its replacements
@@ -323,6 +324,15 @@ source_cache::get_source_lines (struct symtab *s, int first_line,
 			first_line, last_line, lines);
 }
 
+/* Implement 'maint flush source-cache' command.  */
+
+static void
+source_cache_flush_command (const char *command, int from_tty)
+{
+  forget_cached_source_info ();
+  printf_filtered (_("Source cache flushed.\n"));
+}
+
 #if GDB_SELF_TEST
 namespace selftests
 {
@@ -346,6 +356,10 @@ void _initialize_source_cache ();
 void
 _initialize_source_cache ()
 {
+  add_cmd ("source-cache", class_maintenance, source_cache_flush_command,
+	   _("Force gdb to flush its source code cache."),
+	   &maintenanceflushlist);
+
 #if GDB_SELF_TEST
   selftests::register_test ("source-cache", selftests::extract_lines_test);
 #endif
diff --git a/gdb/testsuite/gdb.base/cached-source-file.exp b/gdb/testsuite/gdb.base/cached-source-file.exp
index d4e64f32120..75a13378691 100644
--- a/gdb/testsuite/gdb.base/cached-source-file.exp
+++ b/gdb/testsuite/gdb.base/cached-source-file.exp
@@ -100,3 +100,41 @@ gdb_test "run" $re "rerun program" $q y
 # changed for GDB.
 gdb_test "list" "${bp_line}\[ \t\]+printf \\(\"foo\\\\n\"\\); /\\\* new-marker \\\*/.*" \
     "verify that the source code is properly reloaded"
+
+# Modify the source file again.  As before, this only works locally
+# because of the TCL commands.
+set bkpsrc [standard_output_file $testfile].c.bkp
+set bkpsrcfd [open $bkpsrc w]
+set srcfd [open $srcfile r]
+
+while { [gets $srcfd line] != -1 } {
+    if { [string first "new-marker" $line] != -1 } {
+	# Modify the printf line that we added previously.
+	puts $bkpsrcfd "  printf (\"foo\\n\"); /* new-marker updated */"
+    } else {
+	puts $bkpsrcfd $line
+    }
+}
+
+close $bkpsrcfd
+close $srcfd
+file rename -force -- $bkpsrc $srcfile
+
+# As before, delay so that at least one second has passed.  GDB still
+# will not spot that the source file has changed, as GDB doesn't do a
+# time check unless the binary has also changed, this delay just
+# allows us to confirm this behaviour.
+sleep 1
+
+# List the printf line again, we should not see the file changes yet
+# as the binary is unchanged, so the cached contents will still be
+# used.
+gdb_test "list ${bp_line}" "${bp_line}\[ \t\]+printf \\(\"foo\\\\n\"\\); /\\\* new-marker \\\*/.*" \
+    "verify that the source code change is not seen yet"
+
+gdb_test "maint flush source-cache" "Source cache flushed\\."
+
+# List the printf line again.  After the cache flush GDB will re-read
+# the source file and we should now see the changes.
+gdb_test "list ${bp_line}" "${bp_line}\[ \t\]+printf \\(\"foo\\\\n\"\\); /\\\* new-marker updated \\\*/.*" \
+    "verify that the updated source code change is not seen"


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache
  2022-01-10 15:24   ` Tom Tromey
@ 2022-01-11 12:17     ` Andrew Burgess
  2022-01-11 14:54       ` Tom Tromey
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Burgess @ 2022-01-11 12:17 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Andrew Burgess via Gdb-patches

* Tom Tromey <tom@tromey.com> [2022-01-10 08:24:48 -0700]:

> >>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:
> 
> Andrew> So, given the current code does the work of reloading the offset data
> Andrew> anyway, we may as well save memory by capping m_offset_cache to
> Andrew> MAX_ENTRIES just like we do m_source_map.
> 
> Andrew> That's what this commit does.
> 
> Andrew> There should be no user visible changes after this commit, except for
> Andrew> ever so slightly lower memory usage in some cases.
> 
> I think it would be good to update the last part of this comment in
> source_cache::ensure:
> 
> 	  /* This should always hold, because we create the file
> 	     offsets when reading the file, and never free them
> 	     without also clearing the contents cache.  */

What should it say now?  I think this comment was wrong previously, it
should have said:

        /* This should always hold, because we create the file
 	   offsets when reading the file, and never free them.  */

After my change the comment now seems to be correct.  But, maybe I'm
missing something...

Thanks,
Andrew


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command
  2022-01-07 14:53   ` Eli Zaretskii
@ 2022-01-11 13:07     ` Andrew Burgess
  2022-01-11 13:34       ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Burgess @ 2022-01-11 13:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb-patches

* Eli Zaretskii via Gdb-patches <gdb-patches@sourceware.org> [2022-01-07 16:53:29 +0200]:

> > Date: Fri,  7 Jan 2022 14:23:13 +0000
> > From: Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org>
> > Cc: Andrew Burgess <aburgess@redhat.com>
> > 
> > +maint set gnu-source-highlight enabled on|off
> > +maint show gnu-source-highlight enabled
> > +  When on GDB will use the GNU Source Highlight library for adding
> 
> "Whether GDB should use the GNU Source Highlight ..."
> 
> > +  styling to source code.  When off the library will not be used, even
>                                       ^
> Comma missing there.
> 
> > +  when available.  When GNU Source Highlight isn't used, or can't add
> > +  styling to a particular source file, then the Python Pygments
> > +  library will be used instead.
> > +@kindex maint set gnu-source-highlight enabled
> > +@kindex maint show gnu-source-highlight enabled
> > +@item maint set gnu-source-highlight enabled @r{[}on|off@r{]}
> > +@itemx maint show gnu-source-highlight enabled
> > +Control whether @value{GDBN} should use the GNU Source Highlight
> > +library for applying styling to source code (@pxref{Output Styling}).
> > +This will be @samp{on} by default if the GNU Source Highlight library
> > +is available.  If the GNU Source Highlight library is not available,
> > +then this will be @samp{off} by default, and attempting to change this
> > +value to @samp{on} will give an error.
> > +
> > +If the GNU Source Highlight library is not being used, then
> > +@value{GDBN} will use the Python Pygments package for source code
> > +styling, if it is available.
> 
> Here, too, I would suggest saying that this is useful for debugging
> Pygments styling when Source Highlight is available.

Thanks for the feedback.  I've made the fixes you suggested.  For the
manual entry the full text is now this:

  @kindex maint set gnu-source-highlight enabled
  @kindex maint show gnu-source-highlight enabled
  @item maint set gnu-source-highlight enabled @r{[}on|off@r{]}
  @itemx maint show gnu-source-highlight enabled
  Control whether @value{GDBN} should use the GNU Source Highlight
  library for applying styling to source code (@pxref{Output Styling}).
  This will be @samp{on} by default if the GNU Source Highlight library
  is available.  If the GNU Source Highlight library is not available,
  then this will be @samp{off} by default, and attempting to change this
  value to @samp{on} will give an error.
  
  If the GNU Source Highlight library is not being used, then
  @value{GDBN} will use the Python Pygments package for source code
  styling, if it is available.
  
  This option is useful for debugging @value{GDBN}'s use of the Pygments
  library when @value{GDBN} is linked against the GNU Source Highlight
  library.

Is this better?

The full patch, including this change is below.

Thanks,
Andrew

---

commit 8346698512836d313cfe91581f2ba43be359090a
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Fri Nov 26 15:13:43 2021 +0000

    gdb: add 'maint set/show gnu-source-highlight enabled' command
    
    In a later commit I want to address an issue with the Python pygments
    based code styling solution.  As this approach is only used when the
    GNU Source Highlight library is not available, testing bugs in this
    area can be annoying, as it requires GDB to be rebuilt with use of GNU
    Source Highlight disabled.
    
    This commit adds a pair of new maintenance commands:
    
      maintenance set gnu-source-highlight enabled on|off
      maintenance show gnu-source-highlight enabled
    
    these commands can be used to disable use of the GNU Source Highlight
    library, allowing me, in a later commit, to easily test bugs that
    would otherwise be masked by GNU Source Highlight being used.
    
    I made this a maintenance command, rather than a general purpose
    command, as it didn't seem like this was something a general user
    would need to adjust.  We can always convert the maintenance command
    to a general command later if needed.
    
    There's no test for this here, but this feature will be used in a
    later commit.

diff --git a/gdb/NEWS b/gdb/NEWS
index 105fcf74ff4..9e8e173c0d8 100644
--- a/gdb/NEWS
+++ b/gdb/NEWS
@@ -67,6 +67,14 @@ show debug threads
 maint flush source-cache
   Flush the contents of the source code cache.
 
+maint set gnu-source-highlight enabled on|off
+maint show gnu-source-highlight enabled
+  Whether GDB should use the GNU Source Highlight library for adding
+  styling to source code.  When off, the library will not be used, even
+  when available.  When GNU Source Highlight isn't used, or can't add
+  styling to a particular source file, then the Python Pygments
+  library will be used instead.
+
 * Changed commands
 
 maint packet
diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
index 222a01023e0..4ed1302302e 100644
--- a/gdb/doc/gdb.texinfo
+++ b/gdb/doc/gdb.texinfo
@@ -39797,6 +39797,25 @@
 described in @ref{set libthread-db-search-path}.  For more information
 about the tests, see @ref{maint check libthread-db}.
 
+@kindex maint set gnu-source-highlight enabled
+@kindex maint show gnu-source-highlight enabled
+@item maint set gnu-source-highlight enabled @r{[}on|off@r{]}
+@itemx maint show gnu-source-highlight enabled
+Control whether @value{GDBN} should use the GNU Source Highlight
+library for applying styling to source code (@pxref{Output Styling}).
+This will be @samp{on} by default if the GNU Source Highlight library
+is available.  If the GNU Source Highlight library is not available,
+then this will be @samp{off} by default, and attempting to change this
+value to @samp{on} will give an error.
+
+If the GNU Source Highlight library is not being used, then
+@value{GDBN} will use the Python Pygments package for source code
+styling, if it is available.
+
+This option is useful for debugging @value{GDBN}'s use of the Pygments
+library when @value{GDBN} is linked against the GNU Source Highlight
+library.
+
 @kindex maint space
 @cindex memory used by commands
 @item maint space @var{value}
diff --git a/gdb/source-cache.c b/gdb/source-cache.c
index 41fcda267b3..608738679e8 100644
--- a/gdb/source-cache.c
+++ b/gdb/source-cache.c
@@ -47,6 +47,46 @@
 
 source_cache g_source_cache;
 
+/* When this is true we will use the GNU Source Highlight to add styling to
+   source code (assuming the library is available).  This is initialized to
+   true (if appropriate) in _initialize_source_cache below.  */
+
+static bool use_gnu_source_highlight;
+
+/* The "maint show gnu-source-highlight enabled" command. */
+
+static void
+show_use_gnu_source_highlight_enabled  (struct ui_file *file, int from_tty,
+					struct cmd_list_element *c,
+					const char *value)
+{
+  fprintf_filtered (file,
+		    _("Use of GNU Source Highlight library is \"%s\".\n"),
+		    value);
+}
+
+/* The "maint set gnu-source-highlight enabled" command.  */
+
+static void
+set_use_gnu_source_highlight_enabled (const char *ignore_args,
+				      int from_tty,
+				      struct cmd_list_element *c)
+{
+#ifndef HAVE_SOURCE_HIGHLIGHT
+  /* If the library is not available and the user tried to enable use of
+     the library, then disable use of the library, and give an error.  */
+  if (use_gnu_source_highlight)
+    {
+      use_gnu_source_highlight = false;
+      error (_("the GNU Source Highlight library is not available"));
+    }
+#else
+  /* We (might) have just changed how we style source code, discard any
+     previously cached contents.  */
+  forget_cached_source_info ();
+#endif
+}
+
 /* See source-cache.h.  */
 
 std::string
@@ -193,7 +233,7 @@ source_cache::ensure (struct symtab *s)
 #ifdef HAVE_SOURCE_HIGHLIGHT
       bool already_styled = false;
       const char *lang_name = get_language_name (SYMTAB_LANGUAGE (s));
-      if (lang_name != nullptr)
+      if (lang_name != nullptr && use_gnu_source_highlight)
 	{
 	  /* The global source highlight object, or null if one was
 	     never constructed.  This is stored here rather than in
@@ -364,6 +404,36 @@ _initialize_source_cache ()
 	   _("Force gdb to flush its source code cache."),
 	   &maintenanceflushlist);
 
+  /* All the 'maint set|show gnu-source-highlight' sub-commands.  */
+  static struct cmd_list_element *maint_set_gnu_source_highlight_cmdlist;
+  static struct cmd_list_element *maint_show_gnu_source_highlight_cmdlist;
+
+  /* Adds 'maint set|show gnu-source-highlight'.  */
+  add_setshow_prefix_cmd ("gnu-source-highlight", class_maintenance,
+			  _("Set gnu-source-highlight specific variables."),
+			  _("Show gnu-source-highlight specific variables."),
+			  &maint_set_gnu_source_highlight_cmdlist,
+			  &maint_show_gnu_source_highlight_cmdlist,
+			  &maintenance_set_cmdlist,
+			  &maintenance_show_cmdlist);
+
+  /* Adds 'maint set|show gnu-source-highlight enabled'.  */
+  add_setshow_boolean_cmd ("enabled", class_maintenance,
+			   &use_gnu_source_highlight, _("\
+Set whether the GNU Source Highlight library should be used."), _("\
+Show whether the GNU Source Highlight library is being used."),_("\
+When enabled, GDB will use the GNU Source Highlight library to apply\n\
+styling to source code lines that are shown."),
+			   set_use_gnu_source_highlight_enabled,
+			   show_use_gnu_source_highlight_enabled,
+			   &maint_set_gnu_source_highlight_cmdlist,
+			   &maint_show_gnu_source_highlight_cmdlist);
+
+  /* Enable use of GNU Source Highlight library, if we have it.  */
+#ifdef HAVE_SOURCE_HIGHLIGHT
+  use_gnu_source_highlight = true;
+#endif
+
 #if GDB_SELF_TEST
   selftests::register_test ("source-cache", selftests::extract_lines_test);
 #endif


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-10 15:32       ` Simon Marchi
@ 2022-01-11 13:10         ` Andrew Burgess
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-11 13:10 UTC (permalink / raw)
  To: Simon Marchi; +Cc: gdb-patches

* Simon Marchi via Gdb-patches <gdb-patches@sourceware.org> [2022-01-10 10:32:02 -0500]:

> > Unfortunately it's not as simple as bytes in bytes out.  See:
> > 
> >   https://pygments.org/docs/unicode/?highlight=encoding
> > 
> > In summary, Pygments uses unicode internally, but has some logic for
> > guessing the encoding of the incoming bytes.  This logic is better (I
> > claim) than GDB's hard-coded use UTF-8.  The link above outlines how
> > the guess is done in more detail.
> > 
> > Pygments always returns a unicode object, which is one of the reasons
> > I have GDB handle both bytes and unicode being returned from the
> > colorize API.  We could always make the API for restricted, and insist
> > on a bytes object being returned, this would just require us to
> > convert the output of Pygments to bytes before returning to GDB.
> 
> Ok, so when does "colorize" returns bytes?

  (1) Python 2 (for now), and
  (2) Never, unless a user overrides gdb.colorize.

Thanks,
Andrew


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] gdb: new 'maint flush source-cache' command
  2022-01-11 12:13     ` Andrew Burgess
@ 2022-01-11 13:31       ` Eli Zaretskii
  2022-01-12 11:38         ` Andrew Burgess
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2022-01-11 13:31 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches

> Date: Tue, 11 Jan 2022 12:13:19 +0000
> From: Andrew Burgess <aburgess@redhat.com>
> Cc: gdb-patches@sourceware.org
> 
>   @kindex maint flush source-cache
>   @cindex source code, caching
>   @item maint flush source-cache
>   Flush @value{GDBN}'s cache of source code file contents.  After
>   @value{GDBN} reads a source file, and optionally applies styling
>   (@pxref{Output Styling}), the file contents are cached.  This command
>   clears that cache.  The next time @value{GDBN} wants to show lines
>   from a source file, the content will be re-read.
> 
>   This command is useful when debugging issues related to source code
>   styling.  After flushing the cache any source code displayed by
>   @value{GDBN} will be re-read and re-styled.
> 
> Is that OK?

Yes, thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command
  2022-01-11 13:07     ` Andrew Burgess
@ 2022-01-11 13:34       ` Eli Zaretskii
  2022-01-12 11:37         ` Andrew Burgess
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2022-01-11 13:34 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches

> Date: Tue, 11 Jan 2022 13:07:44 +0000
> From: Andrew Burgess <aburgess@redhat.com>
> Cc: gdb-patches@sourceware.org
> 
>   @kindex maint set gnu-source-highlight enabled
>   @kindex maint show gnu-source-highlight enabled
>   @item maint set gnu-source-highlight enabled @r{[}on|off@r{]}
>   @itemx maint show gnu-source-highlight enabled
>   Control whether @value{GDBN} should use the GNU Source Highlight
>   library for applying styling to source code (@pxref{Output Styling}).
>   This will be @samp{on} by default if the GNU Source Highlight library
>   is available.  If the GNU Source Highlight library is not available,
>   then this will be @samp{off} by default, and attempting to change this
>   value to @samp{on} will give an error.
>   
>   If the GNU Source Highlight library is not being used, then
>   @value{GDBN} will use the Python Pygments package for source code
>   styling, if it is available.
>   
>   This option is useful for debugging @value{GDBN}'s use of the Pygments
>   library when @value{GDBN} is linked against the GNU Source Highlight
>   library.
> 
> Is this better?

Yes, thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache
  2022-01-11 12:17     ` Andrew Burgess
@ 2022-01-11 14:54       ` Tom Tromey
  2022-01-12 11:38         ` Andrew Burgess
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2022-01-11 14:54 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: Tom Tromey, Andrew Burgess via Gdb-patches

>>>>> "Andrew" == Andrew Burgess <aburgess@redhat.com> writes:

>> /* This should always hold, because we create the file
>> offsets when reading the file, and never free them
>> without also clearing the contents cache.  */

Andrew> What should it say now?  I think this comment was wrong previously, it
Andrew> should have said:

Andrew>         /* This should always hold, because we create the file
Andrew>  	   offsets when reading the file, and never free them.  */

I think they are freed sometimes by source_cache::clear.

Anyway how about

  /* This should always hold, because we create the file
     offsets when reading the file.  */

and just remove the mention of the freeing.

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-07 14:23 ` [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
  2022-01-10  3:27   ` Simon Marchi
@ 2022-01-11 19:24   ` Tom Tromey
  2022-01-11 19:42     ` Patrick Monnerat
  1 sibling, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2022-01-11 19:24 UTC (permalink / raw)
  To: Andrew Burgess via Gdb-patches; +Cc: Andrew Burgess

>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:

Andrew> We could try and make GDB smarter when it comes to converting C
Andrew> strings into Python Unicode objects; this would probably require us to
Andrew> just try a couple of different encoding schemes rather than just
Andrew> giving up after utf-8.

Perhaps it should be using the host charset here.

Anyway, FWIW, I think this patch looks reasonable.

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-11 19:24   ` Tom Tromey
@ 2022-01-11 19:42     ` Patrick Monnerat
  0 siblings, 0 replies; 36+ messages in thread
From: Patrick Monnerat @ 2022-01-11 19:42 UTC (permalink / raw)
  To: gdb-patches


On 1/11/22 20:24, Tom Tromey wrote:
>>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:
> Andrew> We could try and make GDB smarter when it comes to converting C
> Andrew> strings into Python Unicode objects; this would probably require us to
> Andrew> just try a couple of different encoding schemes rather than just
> Andrew> giving up after utf-8.
>
> Perhaps it should be using the host charset here.
>
> Anyway, FWIW, I think this patch looks reasonable.
>
I did not follow all the discussion, but did you consider using 
surrogate escapes 
(https://docs.python.org/3/library/codecs.html#error-handlers) ?

I used that in RabbitCVS with quite good results.

Just my 2 cents,

Patrick


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command
  2022-01-11 13:34       ` Eli Zaretskii
@ 2022-01-12 11:37         ` Andrew Burgess
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 11:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb-patches

* Eli Zaretskii via Gdb-patches <gdb-patches@sourceware.org> [2022-01-11 15:34:12 +0200]:

> > Date: Tue, 11 Jan 2022 13:07:44 +0000
> > From: Andrew Burgess <aburgess@redhat.com>
> > Cc: gdb-patches@sourceware.org
> > 
> >   @kindex maint set gnu-source-highlight enabled
> >   @kindex maint show gnu-source-highlight enabled
> >   @item maint set gnu-source-highlight enabled @r{[}on|off@r{]}
> >   @itemx maint show gnu-source-highlight enabled
> >   Control whether @value{GDBN} should use the GNU Source Highlight
> >   library for applying styling to source code (@pxref{Output Styling}).
> >   This will be @samp{on} by default if the GNU Source Highlight library
> >   is available.  If the GNU Source Highlight library is not available,
> >   then this will be @samp{off} by default, and attempting to change this
> >   value to @samp{on} will give an error.
> >   
> >   If the GNU Source Highlight library is not being used, then
> >   @value{GDBN} will use the Python Pygments package for source code
> >   styling, if it is available.
> >   
> >   This option is useful for debugging @value{GDBN}'s use of the Pygments
> >   library when @value{GDBN} is linked against the GNU Source Highlight
> >   library.
> > 
> > Is this better?
> 
> Yes, thanks.

Thanks, I pushed this patch.

Andrew


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache
  2022-01-11 14:54       ` Tom Tromey
@ 2022-01-12 11:38         ` Andrew Burgess
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 11:38 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Andrew Burgess via Gdb-patches

* Tom Tromey <tom@tromey.com> [2022-01-11 07:54:00 -0700]:

> >>>>> "Andrew" == Andrew Burgess <aburgess@redhat.com> writes:
> 
> >> /* This should always hold, because we create the file
> >> offsets when reading the file, and never free them
> >> without also clearing the contents cache.  */
> 
> Andrew> What should it say now?  I think this comment was wrong previously, it
> Andrew> should have said:
> 
> Andrew>         /* This should always hold, because we create the file
> Andrew>  	   offsets when reading the file, and never free them.  */
> 
> I think they are freed sometimes by source_cache::clear.
> 
> Anyway how about
> 
>   /* This should always hold, because we create the file
>      offsets when reading the file.  */
> 
> and just remove the mention of the freeing.

Thanks, I made this change and pushed the patch below.

Andrew

---

commit 0e42221ac2cea234fe7a35797475f55f3d304b92
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Fri Nov 26 14:34:27 2021 +0000

    gdb: erase items from the source_cache::m_offset_cache
    
    The source_cache class has two member variables m_source_map, which
    stores the file contents, and m_offset_cache, which stores offsets
    into the file contents.
    
    As source files are read the contents of the file, as well as the
    offset data, are stored in the cache using these two member variables.
    
    Whenever GDB needs either the files contents, or the offset data,
    source_cache::ensure is called.  This function looks for the file in
    m_source_map, and if it's found then this implies the file is also in
    m_offset_cache, and we're done.
    
    If the file is not in m_source_map then GDB calls
    source_cache::get_plain_source_lines to open the file and read its
    contents.  ::get_plain_source_lines also calculates the offset data,
    which is then inserted into m_offset_cache.
    
    Back in ::ensure, the file contents are added into m_source_map.  And
    finally, if m_source_map contains more than MAX_ENTRIES, an entry is
    removed from m_source_map.
    
    The problem is entries are not removed from m_offset_cache at the same
    time.
    
    This means that if a program contains enough source files, GDB will
    hold at most MAX_ENTRIES cached source file contents, but can contain
    offsets data for every source file.
    
    Now, the offsets data is going to be smaller than the cached file
    contents, so maybe there's no harm here.  But, when we reload the file
    contents we always recalculate the offsets data.  And, when we
    ::get_line_charpos asking for offset data we still call ::ensure which
    will ends up loading and caching the file contents.
    
    So, given the current code does the work of reloading the offset data
    anyway, we may as well save memory by capping m_offset_cache to
    MAX_ENTRIES just like we do m_source_map.
    
    That's what this commit does.
    
    There should be no user visible changes after this commit, except for
    ever so slightly lower memory usage in some cases.

diff --git a/gdb/source-cache.c b/gdb/source-cache.c
index 0650768cc0e..7016476730a 100644
--- a/gdb/source-cache.c
+++ b/gdb/source-cache.c
@@ -162,9 +162,8 @@ source_cache::ensure (struct symtab *s)
     {
       if (m_source_map[i].fullname == fullname)
 	{
-	  /* This should always hold, because we create the file
-	     offsets when reading the file, and never free them
-	     without also clearing the contents cache.  */
+	  /* This should always hold, because we create the file offsets
+	     when reading the file.  */
 	  gdb_assert (m_offset_cache.find (fullname)
 		      != m_offset_cache.end ());
 	  /* Not strictly LRU, but at least ensure that the most
@@ -240,7 +239,11 @@ source_cache::ensure (struct symtab *s)
   m_source_map.push_back (std::move (result));
 
   if (m_source_map.size () > MAX_ENTRIES)
-    m_source_map.erase (m_source_map.begin ());
+    {
+      auto iter = m_source_map.begin ();
+      m_offset_cache.erase (iter->fullname);
+      m_source_map.erase (iter);
+    }
 
   return true;
 }


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] gdb: new 'maint flush source-cache' command
  2022-01-11 13:31       ` Eli Zaretskii
@ 2022-01-12 11:38         ` Andrew Burgess
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 11:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb-patches

* Eli Zaretskii <eliz@gnu.org> [2022-01-11 15:31:16 +0200]:

> > Date: Tue, 11 Jan 2022 12:13:19 +0000
> > From: Andrew Burgess <aburgess@redhat.com>
> > Cc: gdb-patches@sourceware.org
> > 
> >   @kindex maint flush source-cache
> >   @cindex source code, caching
> >   @item maint flush source-cache
> >   Flush @value{GDBN}'s cache of source code file contents.  After
> >   @value{GDBN} reads a source file, and optionally applies styling
> >   (@pxref{Output Styling}), the file contents are cached.  This command
> >   clears that cache.  The next time @value{GDBN} wants to show lines
> >   from a source file, the content will be re-read.
> > 
> >   This command is useful when debugging issues related to source code
> >   styling.  After flushing the cache any source code displayed by
> >   @value{GDBN} will be re-read and re-styled.
> > 
> > Is that OK?
> 
> Yes, thanks.
> 

Thanks, I pushed this patch.

Andrew


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHv2 0/2] Source highlight non utf-8 characters using Python
  2022-01-07 14:23 [PATCH 0/4] Source highlight non utf-8 characters using Python Andrew Burgess
                   ` (3 preceding siblings ...)
  2022-01-07 14:23 ` [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
@ 2022-01-12 14:30 ` Andrew Burgess
  2022-01-12 14:30   ` [PATCHv2 1/2] gdb/python: add gdb.host_charset function Andrew Burgess
  2022-01-12 14:30   ` [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
  4 siblings, 2 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 14:30 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

Thanks for all the feedback on the v1 series.

Having thought about this patch a bit more, I think that I probably
made the colorize API too complicated without any real good reason.

This new version of the patch aims to make things simpler, see the
changes listed below for details.

Thanks,
Andrew

---

Since v1:

  - First 3 patches from the original series have now been merged,

  - This new series includes a new patch (#1), which adds a
    gdb.host_charset method to the Python API.

  - The gdb.colorize API has been simplified since v1.  The function
    now takes the file contents as a bytes object, and returns the
    modified contents as a bytes object (or None to indicate no
    styling was done).  Anything else returned by colorize will give
    an error.  If the colorize process requires the content to be held
    in a unicode object then it is up to the colorize function itself
    to convert to unicode, and convert back again before returning.

---

Andrew Burgess (2):
  gdb/python: add gdb.host_charset function
  gdb/python: handle non utf-8 characters when source highlighting

 gdb/NEWS                                      |  3 +
 gdb/doc/python.texi                           |  8 +++
 gdb/python/lib/gdb/__init__.py                |  5 +-
 gdb/python/python.c                           | 51 +++++++++++----
 gdb/testsuite/gdb.python/py-charset.exp       | 50 +++++++++++++++
 gdb/testsuite/gdb.python/py-source-styling.c  | 29 +++++++++
 .../gdb.python/py-source-styling.exp          | 64 +++++++++++++++++++
 7 files changed, 195 insertions(+), 15 deletions(-)
 create mode 100644 gdb/testsuite/gdb.python/py-charset.exp
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.c
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.exp

-- 
2.25.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHv2 1/2] gdb/python: add gdb.host_charset function
  2022-01-12 14:30 ` [PATCHv2 0/2] Source highlight non utf-8 characters using Python Andrew Burgess
@ 2022-01-12 14:30   ` Andrew Burgess
  2022-01-12 15:02     ` Eli Zaretskii
                       ` (2 more replies)
  2022-01-12 14:30   ` [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
  1 sibling, 3 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 14:30 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

We already have gdb.target_charset and gdb.target_wide_charset.  This
commit adds gdb.host_charset along the same lines.
---
 gdb/NEWS                                |  3 ++
 gdb/doc/python.texi                     |  8 ++++
 gdb/python/python.c                     | 13 +++++++
 gdb/testsuite/gdb.python/py-charset.exp | 50 +++++++++++++++++++++++++
 4 files changed, 74 insertions(+)
 create mode 100644 gdb/testsuite/gdb.python/py-charset.exp

diff --git a/gdb/NEWS b/gdb/NEWS
index c1f30563a93..8c13cefb22f 100644
--- a/gdb/NEWS
+++ b/gdb/NEWS
@@ -143,6 +143,9 @@ show debug lin-lwp
      is equivalent to the existing 'maint packet' CLI command; it
      allows a user specified packet to be sent to the remote target.
 
+  ** New function gdb.host_charset(), returns a string, which is the
+     name of the current host charset.
+
 * New features in the GDB remote stub, GDBserver
 
   ** GDBserver is now supported on OpenRISC GNU/Linux.
diff --git a/gdb/doc/python.texi b/gdb/doc/python.texi
index 6bd5f6b90ac..38fce5b38e3 100644
--- a/gdb/doc/python.texi
+++ b/gdb/doc/python.texi
@@ -518,6 +518,14 @@
 never returned.
 @end defun
 
+@findex gdb.host_charset
+@defun gdb.host_charset ()
+Return a string, the name of the current host character set
+(@pxref{Character Sets}).  This differs from
+@code{gdb.parameter('host-charset')} in that @samp{auto} is never
+returned.
+@end defun
+
 @findex gdb.solib_name
 @defun gdb.solib_name (address)
 Return the name of the shared library holding the given @var{address}
diff --git a/gdb/python/python.c b/gdb/python/python.c
index e05b99c0bec..4dcda53d9ab 100644
--- a/gdb/python/python.c
+++ b/gdb/python/python.c
@@ -571,6 +571,16 @@ gdbpy_target_wide_charset (PyObject *self, PyObject *args)
   return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
 }
 
+/* Implement gdb.host_charset().  */
+
+static PyObject *
+gdbpy_host_charset (PyObject *self, PyObject *args)
+{
+  const char *cset = host_charset ();
+
+  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
+}
+
 /* A Python function which evaluates a string using the gdb CLI.  */
 
 static PyObject *
@@ -2281,6 +2291,9 @@ Return the name of the current target charset." },
   { "target_wide_charset", gdbpy_target_wide_charset, METH_NOARGS,
     "target_wide_charset () -> string.\n\
 Return the name of the current target wide charset." },
+  { "host_charset", gdbpy_host_charset, METH_NOARGS,
+    "host_charset () -> string.\n\
+Return the name of the current host charset." },
   { "rbreak", (PyCFunction) gdbpy_rbreak, METH_VARARGS | METH_KEYWORDS,
     "rbreak (Regex) -> List.\n\
 Return a Tuple containing gdb.Breakpoint objects that match the given Regex." },
diff --git a/gdb/testsuite/gdb.python/py-charset.exp b/gdb/testsuite/gdb.python/py-charset.exp
new file mode 100644
index 00000000000..e4af0e5b56f
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-charset.exp
@@ -0,0 +1,50 @@
+# Copyright 2022 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+load_lib gdb-python.exp
+
+gdb_exit
+gdb_start
+
+# Skip all tests if Python scripting is not enabled.
+if { [skip_python_tests] } { continue }
+
+# Each test data has 4 parts:
+# 1. The string used in 'show XXX-charset' command,
+# 2. The string expected in the output of the command used in #1,
+# 3. The string used is gdb.XXXX_charset() python function call,
+# 4. A string that is a regexp appended to the result of #1, used to
+#    match the output of #3
+foreach test_data { {host host host ""} \
+			{target target target ""} \
+			{target-wide "target wide" \
+			     "target_wide" "(LE|BE)?"} } {
+    with_test_prefix "charset=[lindex $test_data 0]" {
+	set charset "unknown"
+	gdb_test_multiple "show [lindex $test_data 0]-charset" "" {
+	    -re "The [lindex $test_data 1] character set is \"auto; currently (\[^\"\]*)\".*$gdb_prompt $" {
+		set charset $expect_out(1,string)
+		pass $gdb_test_name
+	    }
+	    -re "The [lindex $test_data 1] character set is \"(\[^\"\]*)\".*$gdb_prompt $" {
+		set charset $expect_out(1,string)
+		pass $gdb_test_name
+	    }
+	}
+	set charset "${charset}[lindex $test_data 3]"
+	gdb_test "python print(gdb.[lindex $test_data 2]_charset())" \
+	    "${charset}"
+    }
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-12 14:30 ` [PATCHv2 0/2] Source highlight non utf-8 characters using Python Andrew Burgess
  2022-01-12 14:30   ` [PATCHv2 1/2] gdb/python: add gdb.host_charset function Andrew Burgess
@ 2022-01-12 14:30   ` Andrew Burgess
  2022-01-12 15:32     ` Tom Tromey
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 14:30 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

This commit adds support for source files that contain non utf-8
characters when performing source styling using the Python pygments
package.  This does not change the behaviour of GDB when the GNU
Source Highlight library is used.

For the following problem description, assume that either GDB is built
without GNU Source Highlight support, of that this has been disabled
using 'maintenance set gnu-source-highlight enabled off'.

The initial problem reported was that a source file containing non
utf-8 characters would cause GDB to print a Python exception, and then
display the source without styling, e.g.:

  Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
  /* Source code here, without styling...  */

Further, as the user steps through different source files, each time
the problematic source file was evicted from the source cache, and
then later reloaded, the exception would be printed again.

Finally, this problem is only present when using Python 3, this issue
is not present for Python 2.

What makes this especially frustrating is that GDB can clearly print
the source file contents, they're right there...  If we disable
styling completely, or make use of the GNU Source Highlight library,
then everything is fine.  So why is there an error when we try to
apply styling using Python?

The problem is the use of PyString_FromString (which is an alias for
PyUnicode_FromString in Python 3), this function converts a C string
into a either a Unicode object (Py3) or a str object (Py2).  For
Python 2 there is no unicode encoding performed during this function
call, but for Python 3 the input is assumed to be a uft-8 encoding
string for the purpose of the conversion.  And here of course, is the
problem, if the source file contains non utf-8 characters, then it
should not be treated as utf-8, but that's what we do, and that's why
we get an error.

My first thought when looking at this was to spot when the
PyString_FromString call failed with a UnicodeDecodeError and silently
ignore the error.  This would mean that GDB would print the source
without styling, but would also avoid the annoying exception message.

However, I also make use of `pygmentize`, a command line wrapper
around the Python pygments module, which I use to apply syntax
highlighting in the output of `less`.  And this command line wrapper
is quite happy to syntax highlight my source file that contains non
utf-8 characters, so it feels like the problem should be solvable.

It turns out that inside the pygments module there is already support
for guessing the encoding of the incoming file content, if the
incoming content is not already a Unicode string.  This is what
happens for Python 2 where the incoming content is of `str` type.

We could try and make GDB smarter when it comes to converting C
strings into Python Unicode objects; this would probably require us to
just try a couple of different encoding schemes rather than just
giving up after utf-8.

However, I figure, why bother?  The pygments module already does this
for us, and the colorize API is not part of the documented external
API of GDB.  So, why not just change the colorize API, instead of the
content being a Unicode string (for Python 3), lets just make the
content be a bytes object.  The pygments module can then take
responsibility for guessing the encoding.

So, currently, the colorize API receives a unicode object, and returns
a unicode object.  I propose that the colorize API receive a bytes
object, and return a bytes object.
---
 gdb/python/lib/gdb/__init__.py                |  5 +-
 gdb/python/python.c                           | 38 +++++++----
 gdb/testsuite/gdb.python/py-source-styling.c  | 29 +++++++++
 .../gdb.python/py-source-styling.exp          | 64 +++++++++++++++++++
 4 files changed, 121 insertions(+), 15 deletions(-)
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.c
 create mode 100644 gdb/testsuite/gdb.python/py-source-styling.exp

diff --git a/gdb/python/lib/gdb/__init__.py b/gdb/python/lib/gdb/__init__.py
index 11a1b444bfd..b43ec506783 100644
--- a/gdb/python/lib/gdb/__init__.py
+++ b/gdb/python/lib/gdb/__init__.py
@@ -239,10 +239,13 @@ try:
         try:
             lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
             formatter = formatters.TerminalFormatter()
-            return highlight(contents, lexer, formatter)
+            return highlight(contents, lexer, formatter).encode(
+                _gdb.host_charset(), "backslashreplace"
+            )
         except:
             return None
 
+
 except:
 
     def colorize(filename, contents):
diff --git a/gdb/python/python.c b/gdb/python/python.c
index 4dcda53d9ab..390a65cd20b 100644
--- a/gdb/python/python.c
+++ b/gdb/python/python.c
@@ -1157,7 +1157,25 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
       gdbpy_print_stack ();
       return {};
     }
-  gdbpy_ref<> contents_arg (PyString_FromString (contents.c_str ()));
+
+  /* The pygments library, which is what we currently use for applying
+     styling, is happy to take input as a bytes object, and to figure out
+     the encoding for itself.  This removes the need for us to figure out
+     (guess?) at how the content is encoded, which is probably a good
+     thing.
+
+     In an ideal world we should be able to use the host_charset here, but
+     the host_charset is most likely to represent the capabilities of the
+     users terminal, rather than the encoding of a particular source file.
+
+     Another possibility would be to use Python's ability to escape
+     unknown characters when converting the bytes into unicode.  However,
+     by passing the raw file bytes to Python we can differ this choice.
+     This gives us (or potentially a user) the option of taking an approach
+     that works for them, rather than hard coding a single fixed choice
+     into GDB.  */
+  gdbpy_ref<> contents_arg (PyBytes_FromStringAndSize (contents.c_str (),
+						       contents.size ()));
   if (contents_arg == nullptr)
     {
       gdbpy_print_stack ();
@@ -1174,25 +1192,17 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
       return {};
     }
 
-  if (!gdbpy_is_string (result.get ()))
+  if (result == Py_None)
     return {};
-
-  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
-  if (unic == nullptr)
-    {
-      gdbpy_print_stack ();
-      return {};
-    }
-  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
-						   host_charset (),
-						   nullptr));
-  if (host_str == nullptr)
+  else if (!PyBytes_Check (result.get ()))
     {
+      PyErr_SetString (PyExc_TypeError,
+		       _("Return value from gdb.colorize should be a bytes object."));
       gdbpy_print_stack ();
       return {};
     }
 
-  return std::string (PyBytes_AsString (host_str.get ()));
+  return std::string (PyBytes_AsString (result.get ()));
 }
 
 \f
diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
new file mode 100644
index 00000000000..6a1a9d59a8c
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.c
@@ -0,0 +1,29 @@
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2022 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+int
+main ()
+{
+  int some_variable = 1234;
+
+  /* The following line contains a character that is non-utf-8.  This is a
+     critical part of the test as Python 3 can't convert this into a string
+     using its default mechanism.  */
+  char c = '�';		/* List this line.  */
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.python/py-source-styling.exp b/gdb/testsuite/gdb.python/py-source-styling.exp
new file mode 100644
index 00000000000..68bbc9f9758
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.exp
@@ -0,0 +1,64 @@
+# Copyright (C) 2022 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This file is part of the GDB testsuite.  It checks for memory leaks
+# associated with allocating and deallocation gdb.Inferior objects.
+
+load_lib gdb-python.exp
+
+standard_testfile
+
+save_vars { env(TERM) } {
+    # We need an ANSI-capable terminal to get the output, additionally
+    # we need to set LC_ALL so GDB knows the terminal is UTF-8
+    # capable, otherwise we'll get a UnicodeEncodeError trying to
+    # encode the output.
+    setenv TERM ansi
+
+    if { [prepare_for_testing "failed to prepare" ${testfile} ${srcfile}] } {
+	return -1
+    }
+
+    if { [skip_python_tests] } { continue }
+
+    if { ![gdb_py_module_available "pygments"] } {
+	unsupported "pygments module not available"
+	return -1
+    }
+
+    if ![runto_main] {
+	return
+    }
+
+    gdb_test_no_output "maint set gnu-source-highlight enabled off"
+
+    gdb_test "maint flush source-cache" "Source cache flushed\\."
+
+    set seen_style_escape false
+    set line_number [gdb_get_line_number "List this line."]
+    gdb_test_multiple "list ${line_number}" "" {
+	-re "Python Exception.*" {
+	    fail $gdb_test_name
+	}
+	-re "\033" {
+	    set seen_style_escape true
+	    exp_continue
+	}
+	-re "$gdb_prompt $" {
+	    gdb_assert { $seen_style_escape }
+	    pass $gdb_test_name
+	}
+    }
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 1/2] gdb/python: add gdb.host_charset function
  2022-01-12 14:30   ` [PATCHv2 1/2] gdb/python: add gdb.host_charset function Andrew Burgess
@ 2022-01-12 15:02     ` Eli Zaretskii
  2022-01-12 15:23     ` Tom Tromey
  2022-01-12 16:05     ` Andrew Burgess
  2 siblings, 0 replies; 36+ messages in thread
From: Eli Zaretskii @ 2022-01-12 15:02 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches

> Date: Wed, 12 Jan 2022 14:30:27 +0000
> From: Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org>
> Cc: Andrew Burgess <aburgess@redhat.com>
> 
> We already have gdb.target_charset and gdb.target_wide_charset.  This
> commit adds gdb.host_charset along the same lines.
> ---
>  gdb/NEWS                                |  3 ++
>  gdb/doc/python.texi                     |  8 ++++
>  gdb/python/python.c                     | 13 +++++++
>  gdb/testsuite/gdb.python/py-charset.exp | 50 +++++++++++++++++++++++++
>  4 files changed, 74 insertions(+)
>  create mode 100644 gdb/testsuite/gdb.python/py-charset.exp

OK for the documentation parts.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 1/2] gdb/python: add gdb.host_charset function
  2022-01-12 14:30   ` [PATCHv2 1/2] gdb/python: add gdb.host_charset function Andrew Burgess
  2022-01-12 15:02     ` Eli Zaretskii
@ 2022-01-12 15:23     ` Tom Tromey
  2022-01-12 16:05     ` Andrew Burgess
  2 siblings, 0 replies; 36+ messages in thread
From: Tom Tromey @ 2022-01-12 15:23 UTC (permalink / raw)
  To: Andrew Burgess via Gdb-patches; +Cc: Andrew Burgess

>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:

Andrew> We already have gdb.target_charset and gdb.target_wide_charset.  This
Andrew> commit adds gdb.host_charset along the same lines.

Looks good.

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-12 14:30   ` [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
@ 2022-01-12 15:32     ` Tom Tromey
  2022-01-12 15:59       ` Andrew Burgess
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Tromey @ 2022-01-12 15:32 UTC (permalink / raw)
  To: Andrew Burgess via Gdb-patches; +Cc: Andrew Burgess

>>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:

Andrew> -            return highlight(contents, lexer, formatter)
Andrew> +            return highlight(contents, lexer, formatter).encode(
Andrew> +                _gdb.host_charset(), "backslashreplace"

I think the rest of this file doesn't use the '_gdb.' prefix, because
there's a star import of _gdb.

This approach isn't super (it defeats Python analyzers like flake8) but
it seems more clear to be consistent.

Andrew> +     In an ideal world we should be able to use the host_charset here, but
Andrew> +     the host_charset is most likely to represent the capabilities of the
Andrew> +     users terminal, rather than the encoding of a particular source file.

This comment mentions not using host_charset, but then the code does use
it.

I think host_charset really does represent both the expected encoding of
things -- file names, file contents, etc -- and also the terminal
capabilities.  At least, that's my mental model of how the locale system
works.  It's not a great setup of course, since in reality it's probably
pretty normal to have some files in some encoding and other files in
other ones.

Andrew> +     Another possibility would be to use Python's ability to escape
Andrew> +     unknown characters when converting the bytes into unicode.  However,

This also seems to be done, but the comment makes it sound as though it
is not.

The code changes seem fine to me.

thank you,
Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-12 15:32     ` Tom Tromey
@ 2022-01-12 15:59       ` Andrew Burgess
  2022-01-19 17:44         ` Andrew Burgess
  2022-01-21 16:59         ` Simon Marchi
  0 siblings, 2 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 15:59 UTC (permalink / raw)
  To: gdb-patches; +Cc: Tom Tromey

* Tom Tromey <tom@tromey.com> [2022-01-12 08:32:18 -0700]:

> >>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:
> 
> Andrew> -            return highlight(contents, lexer, formatter)
> Andrew> +            return highlight(contents, lexer, formatter).encode(
> Andrew> +                _gdb.host_charset(), "backslashreplace"
> 
> I think the rest of this file doesn't use the '_gdb.' prefix, because
> there's a star import of _gdb.
> 
> This approach isn't super (it defeats Python analyzers like flake8) but
> it seems more clear to be consistent.
> 
> Andrew> +     In an ideal world we should be able to use the host_charset here, but
> Andrew> +     the host_charset is most likely to represent the capabilities of the
> Andrew> +     users terminal, rather than the encoding of a particular source file.
> 
> This comment mentions not using host_charset, but then the code does use
> it.
> 
> I think host_charset really does represent both the expected encoding of
> things -- file names, file contents, etc -- and also the terminal
> capabilities.  At least, that's my mental model of how the locale system
> works.  It's not a great setup of course, since in reality it's probably
> pretty normal to have some files in some encoding and other files in
> other ones.
> 
> Andrew> +     Another possibility would be to use Python's ability to escape
> Andrew> +     unknown characters when converting the bytes into unicode.  However,
> 
> This also seems to be done, but the comment makes it sound as though it
> is not.
> 
> The code changes seem fine to me.

Thanks for your feedback.  An updated version of the patch is
included below.  I'm hoping Simon will take a look at this too, as
he's had some really constructive feedback on Python/unicode stuff.

Thanks,
Andrew

---

commit 099d6a5168b43fbfa767ee0992a6975d9e3f651e
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Fri Nov 26 13:15:28 2021 +0000

    gdb/python: handle non utf-8 characters when source highlighting
    
    This commit adds support for source files that contain non utf-8
    characters when performing source styling using the Python pygments
    package.  This does not change the behaviour of GDB when the GNU
    Source Highlight library is used.
    
    For the following problem description, assume that either GDB is built
    without GNU Source Highlight support, of that this has been disabled
    using 'maintenance set gnu-source-highlight enabled off'.
    
    The initial problem reported was that a source file containing non
    utf-8 characters would cause GDB to print a Python exception, and then
    display the source without styling, e.g.:
    
      Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
      /* Source code here, without styling...  */
    
    Further, as the user steps through different source files, each time
    the problematic source file was evicted from the source cache, and
    then later reloaded, the exception would be printed again.
    
    Finally, this problem is only present when using Python 3, this issue
    is not present for Python 2.
    
    What makes this especially frustrating is that GDB can clearly print
    the source file contents, they're right there...  If we disable
    styling completely, or make use of the GNU Source Highlight library,
    then everything is fine.  So why is there an error when we try to
    apply styling using Python?
    
    The problem is the use of PyString_FromString (which is an alias for
    PyUnicode_FromString in Python 3), this function converts a C string
    into a either a Unicode object (Py3) or a str object (Py2).  For
    Python 2 there is no unicode encoding performed during this function
    call, but for Python 3 the input is assumed to be a uft-8 encoding
    string for the purpose of the conversion.  And here of course, is the
    problem, if the source file contains non utf-8 characters, then it
    should not be treated as utf-8, but that's what we do, and that's why
    we get an error.
    
    My first thought when looking at this was to spot when the
    PyString_FromString call failed with a UnicodeDecodeError and silently
    ignore the error.  This would mean that GDB would print the source
    without styling, but would also avoid the annoying exception message.
    
    However, I also make use of `pygmentize`, a command line wrapper
    around the Python pygments module, which I use to apply syntax
    highlighting in the output of `less`.  And this command line wrapper
    is quite happy to syntax highlight my source file that contains non
    utf-8 characters, so it feels like the problem should be solvable.
    
    It turns out that inside the pygments module there is already support
    for guessing the encoding of the incoming file content, if the
    incoming content is not already a Unicode string.  This is what
    happens for Python 2 where the incoming content is of `str` type.
    
    We could try and make GDB smarter when it comes to converting C
    strings into Python Unicode objects; this would probably require us to
    just try a couple of different encoding schemes rather than just
    giving up after utf-8.
    
    However, I figure, why bother?  The pygments module already does this
    for us, and the colorize API is not part of the documented external
    API of GDB.  So, why not just change the colorize API, instead of the
    content being a Unicode string (for Python 3), lets just make the
    content be a bytes object.  The pygments module can then take
    responsibility for guessing the encoding.
    
    So, currently, the colorize API receives a unicode object, and returns
    a unicode object.  I propose that the colorize API receive a bytes
    object, and return a bytes object.

diff --git a/gdb/python/lib/gdb/__init__.py b/gdb/python/lib/gdb/__init__.py
index 11a1b444bfd..7880ed7564e 100644
--- a/gdb/python/lib/gdb/__init__.py
+++ b/gdb/python/lib/gdb/__init__.py
@@ -239,10 +239,13 @@ try:
         try:
             lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
             formatter = formatters.TerminalFormatter()
-            return highlight(contents, lexer, formatter)
+            return highlight(contents, lexer, formatter).encode(
+                host_charset(), "backslashreplace"
+            )
         except:
             return None
 
+
 except:
 
     def colorize(filename, contents):
diff --git a/gdb/python/python.c b/gdb/python/python.c
index 4dcda53d9ab..b4fa882ab78 100644
--- a/gdb/python/python.c
+++ b/gdb/python/python.c
@@ -1157,13 +1157,24 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
       gdbpy_print_stack ();
       return {};
     }
-  gdbpy_ref<> contents_arg (PyString_FromString (contents.c_str ()));
+
+  /* The pygments library, which is what we currently use for applying
+     styling, is happy to take input as a bytes object, and to figure out
+     the encoding for itself.  This removes the need for us to figure out
+     (guess?) at how the content is encoded, which is probably a good
+     thing.  */
+  gdbpy_ref<> contents_arg (PyBytes_FromStringAndSize (contents.c_str (),
+						       contents.size ()));
   if (contents_arg == nullptr)
     {
       gdbpy_print_stack ();
       return {};
     }
 
+  /* Calling gdb.colorize passing in the filename (a string), and the file
+     contents (a bytes object).  This function should return either a bytes
+     object, the same contents with styling applied, or None to indicate
+     that no styling should be performed.  */
   gdbpy_ref<> result (PyObject_CallFunctionObjArgs (hook.get (),
 						    fname_arg.get (),
 						    contents_arg.get (),
@@ -1174,25 +1185,17 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
       return {};
     }
 
-  if (!gdbpy_is_string (result.get ()))
+  if (result == Py_None)
     return {};
-
-  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
-  if (unic == nullptr)
-    {
-      gdbpy_print_stack ();
-      return {};
-    }
-  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
-						   host_charset (),
-						   nullptr));
-  if (host_str == nullptr)
+  else if (!PyBytes_Check (result.get ()))
     {
+      PyErr_SetString (PyExc_TypeError,
+		       _("Return value from gdb.colorize should be a bytes object or None."));
       gdbpy_print_stack ();
       return {};
     }
 
-  return std::string (PyBytes_AsString (host_str.get ()));
+  return std::string (PyBytes_AsString (result.get ()));
 }
 
 \f
diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
new file mode 100644
index 00000000000..6a1a9d59a8c
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.c
@@ -0,0 +1,29 @@
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2022 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+int
+main ()
+{
+  int some_variable = 1234;
+
+  /* The following line contains a character that is non-utf-8.  This is a
+     critical part of the test as Python 3 can't convert this into a string
+     using its default mechanism.  */
+  char c = '�';		/* List this line.  */
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.python/py-source-styling.exp b/gdb/testsuite/gdb.python/py-source-styling.exp
new file mode 100644
index 00000000000..68bbc9f9758
--- /dev/null
+++ b/gdb/testsuite/gdb.python/py-source-styling.exp
@@ -0,0 +1,64 @@
+# Copyright (C) 2022 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This file is part of the GDB testsuite.  It checks for memory leaks
+# associated with allocating and deallocation gdb.Inferior objects.
+
+load_lib gdb-python.exp
+
+standard_testfile
+
+save_vars { env(TERM) } {
+    # We need an ANSI-capable terminal to get the output, additionally
+    # we need to set LC_ALL so GDB knows the terminal is UTF-8
+    # capable, otherwise we'll get a UnicodeEncodeError trying to
+    # encode the output.
+    setenv TERM ansi
+
+    if { [prepare_for_testing "failed to prepare" ${testfile} ${srcfile}] } {
+	return -1
+    }
+
+    if { [skip_python_tests] } { continue }
+
+    if { ![gdb_py_module_available "pygments"] } {
+	unsupported "pygments module not available"
+	return -1
+    }
+
+    if ![runto_main] {
+	return
+    }
+
+    gdb_test_no_output "maint set gnu-source-highlight enabled off"
+
+    gdb_test "maint flush source-cache" "Source cache flushed\\."
+
+    set seen_style_escape false
+    set line_number [gdb_get_line_number "List this line."]
+    gdb_test_multiple "list ${line_number}" "" {
+	-re "Python Exception.*" {
+	    fail $gdb_test_name
+	}
+	-re "\033" {
+	    set seen_style_escape true
+	    exp_continue
+	}
+	-re "$gdb_prompt $" {
+	    gdb_assert { $seen_style_escape }
+	    pass $gdb_test_name
+	}
+    }
+}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 1/2] gdb/python: add gdb.host_charset function
  2022-01-12 14:30   ` [PATCHv2 1/2] gdb/python: add gdb.host_charset function Andrew Burgess
  2022-01-12 15:02     ` Eli Zaretskii
  2022-01-12 15:23     ` Tom Tromey
@ 2022-01-12 16:05     ` Andrew Burgess
  2 siblings, 0 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-12 16:05 UTC (permalink / raw)
  To: gdb-patches

Thanks for the reviews.  I pushed this patch.  I'm holding off on
patch 2/2 to give more folk a chance to comment.

Thanks,
Andrew

* Andrew Burgess <aburgess@redhat.com> [2022-01-12 14:30:27 +0000]:

> We already have gdb.target_charset and gdb.target_wide_charset.  This
> commit adds gdb.host_charset along the same lines.
> ---
>  gdb/NEWS                                |  3 ++
>  gdb/doc/python.texi                     |  8 ++++
>  gdb/python/python.c                     | 13 +++++++
>  gdb/testsuite/gdb.python/py-charset.exp | 50 +++++++++++++++++++++++++
>  4 files changed, 74 insertions(+)
>  create mode 100644 gdb/testsuite/gdb.python/py-charset.exp
> 
> diff --git a/gdb/NEWS b/gdb/NEWS
> index c1f30563a93..8c13cefb22f 100644
> --- a/gdb/NEWS
> +++ b/gdb/NEWS
> @@ -143,6 +143,9 @@ show debug lin-lwp
>       is equivalent to the existing 'maint packet' CLI command; it
>       allows a user specified packet to be sent to the remote target.
>  
> +  ** New function gdb.host_charset(), returns a string, which is the
> +     name of the current host charset.
> +
>  * New features in the GDB remote stub, GDBserver
>  
>    ** GDBserver is now supported on OpenRISC GNU/Linux.
> diff --git a/gdb/doc/python.texi b/gdb/doc/python.texi
> index 6bd5f6b90ac..38fce5b38e3 100644
> --- a/gdb/doc/python.texi
> +++ b/gdb/doc/python.texi
> @@ -518,6 +518,14 @@
>  never returned.
>  @end defun
>  
> +@findex gdb.host_charset
> +@defun gdb.host_charset ()
> +Return a string, the name of the current host character set
> +(@pxref{Character Sets}).  This differs from
> +@code{gdb.parameter('host-charset')} in that @samp{auto} is never
> +returned.
> +@end defun
> +
>  @findex gdb.solib_name
>  @defun gdb.solib_name (address)
>  Return the name of the shared library holding the given @var{address}
> diff --git a/gdb/python/python.c b/gdb/python/python.c
> index e05b99c0bec..4dcda53d9ab 100644
> --- a/gdb/python/python.c
> +++ b/gdb/python/python.c
> @@ -571,6 +571,16 @@ gdbpy_target_wide_charset (PyObject *self, PyObject *args)
>    return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
>  }
>  
> +/* Implement gdb.host_charset().  */
> +
> +static PyObject *
> +gdbpy_host_charset (PyObject *self, PyObject *args)
> +{
> +  const char *cset = host_charset ();
> +
> +  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
> +}
> +
>  /* A Python function which evaluates a string using the gdb CLI.  */
>  
>  static PyObject *
> @@ -2281,6 +2291,9 @@ Return the name of the current target charset." },
>    { "target_wide_charset", gdbpy_target_wide_charset, METH_NOARGS,
>      "target_wide_charset () -> string.\n\
>  Return the name of the current target wide charset." },
> +  { "host_charset", gdbpy_host_charset, METH_NOARGS,
> +    "host_charset () -> string.\n\
> +Return the name of the current host charset." },
>    { "rbreak", (PyCFunction) gdbpy_rbreak, METH_VARARGS | METH_KEYWORDS,
>      "rbreak (Regex) -> List.\n\
>  Return a Tuple containing gdb.Breakpoint objects that match the given Regex." },
> diff --git a/gdb/testsuite/gdb.python/py-charset.exp b/gdb/testsuite/gdb.python/py-charset.exp
> new file mode 100644
> index 00000000000..e4af0e5b56f
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-charset.exp
> @@ -0,0 +1,50 @@
> +# Copyright 2022 Free Software Foundation, Inc.
> +
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +load_lib gdb-python.exp
> +
> +gdb_exit
> +gdb_start
> +
> +# Skip all tests if Python scripting is not enabled.
> +if { [skip_python_tests] } { continue }
> +
> +# Each test data has 4 parts:
> +# 1. The string used in 'show XXX-charset' command,
> +# 2. The string expected in the output of the command used in #1,
> +# 3. The string used is gdb.XXXX_charset() python function call,
> +# 4. A string that is a regexp appended to the result of #1, used to
> +#    match the output of #3
> +foreach test_data { {host host host ""} \
> +			{target target target ""} \
> +			{target-wide "target wide" \
> +			     "target_wide" "(LE|BE)?"} } {
> +    with_test_prefix "charset=[lindex $test_data 0]" {
> +	set charset "unknown"
> +	gdb_test_multiple "show [lindex $test_data 0]-charset" "" {
> +	    -re "The [lindex $test_data 1] character set is \"auto; currently (\[^\"\]*)\".*$gdb_prompt $" {
> +		set charset $expect_out(1,string)
> +		pass $gdb_test_name
> +	    }
> +	    -re "The [lindex $test_data 1] character set is \"(\[^\"\]*)\".*$gdb_prompt $" {
> +		set charset $expect_out(1,string)
> +		pass $gdb_test_name
> +	    }
> +	}
> +	set charset "${charset}[lindex $test_data 3]"
> +	gdb_test "python print(gdb.[lindex $test_data 2]_charset())" \
> +	    "${charset}"
> +    }
> +}
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-12 15:59       ` Andrew Burgess
@ 2022-01-19 17:44         ` Andrew Burgess
  2022-01-21 16:59         ` Simon Marchi
  1 sibling, 0 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-19 17:44 UTC (permalink / raw)
  To: Simon Marchi; +Cc: gdb-patches

Simon,

I wonder if you have any thoughts on this latest iteration?

Thanks,
Andrew


* Andrew Burgess <aburgess@redhat.com> [2022-01-12 15:59:38 +0000]:

> * Tom Tromey <tom@tromey.com> [2022-01-12 08:32:18 -0700]:
> 
> > >>>>> "Andrew" == Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> writes:
> > 
> > Andrew> -            return highlight(contents, lexer, formatter)
> > Andrew> +            return highlight(contents, lexer, formatter).encode(
> > Andrew> +                _gdb.host_charset(), "backslashreplace"
> > 
> > I think the rest of this file doesn't use the '_gdb.' prefix, because
> > there's a star import of _gdb.
> > 
> > This approach isn't super (it defeats Python analyzers like flake8) but
> > it seems more clear to be consistent.
> > 
> > Andrew> +     In an ideal world we should be able to use the host_charset here, but
> > Andrew> +     the host_charset is most likely to represent the capabilities of the
> > Andrew> +     users terminal, rather than the encoding of a particular source file.
> > 
> > This comment mentions not using host_charset, but then the code does use
> > it.
> > 
> > I think host_charset really does represent both the expected encoding of
> > things -- file names, file contents, etc -- and also the terminal
> > capabilities.  At least, that's my mental model of how the locale system
> > works.  It's not a great setup of course, since in reality it's probably
> > pretty normal to have some files in some encoding and other files in
> > other ones.
> > 
> > Andrew> +     Another possibility would be to use Python's ability to escape
> > Andrew> +     unknown characters when converting the bytes into unicode.  However,
> > 
> > This also seems to be done, but the comment makes it sound as though it
> > is not.
> > 
> > The code changes seem fine to me.
> 
> Thanks for your feedback.  An updated version of the patch is
> included below.  I'm hoping Simon will take a look at this too, as
> he's had some really constructive feedback on Python/unicode stuff.
> 
> Thanks,
> Andrew
> 
> ---
> 
> commit 099d6a5168b43fbfa767ee0992a6975d9e3f651e
> Author: Andrew Burgess <aburgess@redhat.com>
> Date:   Fri Nov 26 13:15:28 2021 +0000
> 
>     gdb/python: handle non utf-8 characters when source highlighting
>     
>     This commit adds support for source files that contain non utf-8
>     characters when performing source styling using the Python pygments
>     package.  This does not change the behaviour of GDB when the GNU
>     Source Highlight library is used.
>     
>     For the following problem description, assume that either GDB is built
>     without GNU Source Highlight support, of that this has been disabled
>     using 'maintenance set gnu-source-highlight enabled off'.
>     
>     The initial problem reported was that a source file containing non
>     utf-8 characters would cause GDB to print a Python exception, and then
>     display the source without styling, e.g.:
>     
>       Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
>       /* Source code here, without styling...  */
>     
>     Further, as the user steps through different source files, each time
>     the problematic source file was evicted from the source cache, and
>     then later reloaded, the exception would be printed again.
>     
>     Finally, this problem is only present when using Python 3, this issue
>     is not present for Python 2.
>     
>     What makes this especially frustrating is that GDB can clearly print
>     the source file contents, they're right there...  If we disable
>     styling completely, or make use of the GNU Source Highlight library,
>     then everything is fine.  So why is there an error when we try to
>     apply styling using Python?
>     
>     The problem is the use of PyString_FromString (which is an alias for
>     PyUnicode_FromString in Python 3), this function converts a C string
>     into a either a Unicode object (Py3) or a str object (Py2).  For
>     Python 2 there is no unicode encoding performed during this function
>     call, but for Python 3 the input is assumed to be a uft-8 encoding
>     string for the purpose of the conversion.  And here of course, is the
>     problem, if the source file contains non utf-8 characters, then it
>     should not be treated as utf-8, but that's what we do, and that's why
>     we get an error.
>     
>     My first thought when looking at this was to spot when the
>     PyString_FromString call failed with a UnicodeDecodeError and silently
>     ignore the error.  This would mean that GDB would print the source
>     without styling, but would also avoid the annoying exception message.
>     
>     However, I also make use of `pygmentize`, a command line wrapper
>     around the Python pygments module, which I use to apply syntax
>     highlighting in the output of `less`.  And this command line wrapper
>     is quite happy to syntax highlight my source file that contains non
>     utf-8 characters, so it feels like the problem should be solvable.
>     
>     It turns out that inside the pygments module there is already support
>     for guessing the encoding of the incoming file content, if the
>     incoming content is not already a Unicode string.  This is what
>     happens for Python 2 where the incoming content is of `str` type.
>     
>     We could try and make GDB smarter when it comes to converting C
>     strings into Python Unicode objects; this would probably require us to
>     just try a couple of different encoding schemes rather than just
>     giving up after utf-8.
>     
>     However, I figure, why bother?  The pygments module already does this
>     for us, and the colorize API is not part of the documented external
>     API of GDB.  So, why not just change the colorize API, instead of the
>     content being a Unicode string (for Python 3), lets just make the
>     content be a bytes object.  The pygments module can then take
>     responsibility for guessing the encoding.
>     
>     So, currently, the colorize API receives a unicode object, and returns
>     a unicode object.  I propose that the colorize API receive a bytes
>     object, and return a bytes object.
> 
> diff --git a/gdb/python/lib/gdb/__init__.py b/gdb/python/lib/gdb/__init__.py
> index 11a1b444bfd..7880ed7564e 100644
> --- a/gdb/python/lib/gdb/__init__.py
> +++ b/gdb/python/lib/gdb/__init__.py
> @@ -239,10 +239,13 @@ try:
>          try:
>              lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
>              formatter = formatters.TerminalFormatter()
> -            return highlight(contents, lexer, formatter)
> +            return highlight(contents, lexer, formatter).encode(
> +                host_charset(), "backslashreplace"
> +            )
>          except:
>              return None
>  
> +
>  except:
>  
>      def colorize(filename, contents):
> diff --git a/gdb/python/python.c b/gdb/python/python.c
> index 4dcda53d9ab..b4fa882ab78 100644
> --- a/gdb/python/python.c
> +++ b/gdb/python/python.c
> @@ -1157,13 +1157,24 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
>        gdbpy_print_stack ();
>        return {};
>      }
> -  gdbpy_ref<> contents_arg (PyString_FromString (contents.c_str ()));
> +
> +  /* The pygments library, which is what we currently use for applying
> +     styling, is happy to take input as a bytes object, and to figure out
> +     the encoding for itself.  This removes the need for us to figure out
> +     (guess?) at how the content is encoded, which is probably a good
> +     thing.  */
> +  gdbpy_ref<> contents_arg (PyBytes_FromStringAndSize (contents.c_str (),
> +						       contents.size ()));
>    if (contents_arg == nullptr)
>      {
>        gdbpy_print_stack ();
>        return {};
>      }
>  
> +  /* Calling gdb.colorize passing in the filename (a string), and the file
> +     contents (a bytes object).  This function should return either a bytes
> +     object, the same contents with styling applied, or None to indicate
> +     that no styling should be performed.  */
>    gdbpy_ref<> result (PyObject_CallFunctionObjArgs (hook.get (),
>  						    fname_arg.get (),
>  						    contents_arg.get (),
> @@ -1174,25 +1185,17 @@ gdbpy_colorize (const std::string &filename, const std::string &contents)
>        return {};
>      }
>  
> -  if (!gdbpy_is_string (result.get ()))
> +  if (result == Py_None)
>      return {};
> -
> -  gdbpy_ref<> unic = python_string_to_unicode (result.get ());
> -  if (unic == nullptr)
> -    {
> -      gdbpy_print_stack ();
> -      return {};
> -    }
> -  gdbpy_ref<> host_str (PyUnicode_AsEncodedString (unic.get (),
> -						   host_charset (),
> -						   nullptr));
> -  if (host_str == nullptr)
> +  else if (!PyBytes_Check (result.get ()))
>      {
> +      PyErr_SetString (PyExc_TypeError,
> +		       _("Return value from gdb.colorize should be a bytes object or None."));
>        gdbpy_print_stack ();
>        return {};
>      }
>  
> -  return std::string (PyBytes_AsString (host_str.get ()));
> +  return std::string (PyBytes_AsString (result.get ()));
>  }
>  
>  \f
> diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
> new file mode 100644
> index 00000000000..6a1a9d59a8c
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-source-styling.c
> @@ -0,0 +1,29 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2022 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +int
> +main ()
> +{
> +  int some_variable = 1234;
> +
> +  /* The following line contains a character that is non-utf-8.  This is a
> +     critical part of the test as Python 3 can't convert this into a string
> +     using its default mechanism.  */
> +  char c = '�';		/* List this line.  */
> +
> +  return 0;
> +}
> diff --git a/gdb/testsuite/gdb.python/py-source-styling.exp b/gdb/testsuite/gdb.python/py-source-styling.exp
> new file mode 100644
> index 00000000000..68bbc9f9758
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-source-styling.exp
> @@ -0,0 +1,64 @@
> +# Copyright (C) 2022 Free Software Foundation, Inc.
> +
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +# This file is part of the GDB testsuite.  It checks for memory leaks
> +# associated with allocating and deallocation gdb.Inferior objects.
> +
> +load_lib gdb-python.exp
> +
> +standard_testfile
> +
> +save_vars { env(TERM) } {
> +    # We need an ANSI-capable terminal to get the output, additionally
> +    # we need to set LC_ALL so GDB knows the terminal is UTF-8
> +    # capable, otherwise we'll get a UnicodeEncodeError trying to
> +    # encode the output.
> +    setenv TERM ansi
> +
> +    if { [prepare_for_testing "failed to prepare" ${testfile} ${srcfile}] } {
> +	return -1
> +    }
> +
> +    if { [skip_python_tests] } { continue }
> +
> +    if { ![gdb_py_module_available "pygments"] } {
> +	unsupported "pygments module not available"
> +	return -1
> +    }
> +
> +    if ![runto_main] {
> +	return
> +    }
> +
> +    gdb_test_no_output "maint set gnu-source-highlight enabled off"
> +
> +    gdb_test "maint flush source-cache" "Source cache flushed\\."
> +
> +    set seen_style_escape false
> +    set line_number [gdb_get_line_number "List this line."]
> +    gdb_test_multiple "list ${line_number}" "" {
> +	-re "Python Exception.*" {
> +	    fail $gdb_test_name
> +	}
> +	-re "\033" {
> +	    set seen_style_escape true
> +	    exp_continue
> +	}
> +	-re "$gdb_prompt $" {
> +	    gdb_assert { $seen_style_escape }
> +	    pass $gdb_test_name
> +	}
> +    }
> +}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-12 15:59       ` Andrew Burgess
  2022-01-19 17:44         ` Andrew Burgess
@ 2022-01-21 16:59         ` Simon Marchi
  2022-01-26 23:13           ` Andrew Burgess
  1 sibling, 1 reply; 36+ messages in thread
From: Simon Marchi @ 2022-01-21 16:59 UTC (permalink / raw)
  To: Andrew Burgess, gdb-patches; +Cc: Tom Tromey

> diff --git a/gdb/python/lib/gdb/__init__.py b/gdb/python/lib/gdb/__init__.py
> index 11a1b444bfd..7880ed7564e 100644
> --- a/gdb/python/lib/gdb/__init__.py
> +++ b/gdb/python/lib/gdb/__init__.py
> @@ -239,10 +239,13 @@ try:
>          try:
>              lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
>              formatter = formatters.TerminalFormatter()
> -            return highlight(contents, lexer, formatter)
> +            return highlight(contents, lexer, formatter).encode(
> +                host_charset(), "backslashreplace"
> +            )
>          except:
>              return None
>  
> +

The latest version of black removed this, see:

https://gitlab.com/gnutools/binutils-gdb/-/commit/37260e0df0772dd8378afa91ec26fc0419e1ca94

> diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
> new file mode 100644
> index 00000000000..6a1a9d59a8c
> --- /dev/null
> +++ b/gdb/testsuite/gdb.python/py-source-styling.c
> @@ -0,0 +1,29 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2022 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +int
> +main ()
> +{
> +  int some_variable = 1234;
> +
> +  /* The following line contains a character that is non-utf-8.  This is a
> +     critical part of the test as Python 3 can't convert this into a string
> +     using its default mechanism.  */
> +  char c = '�';		/* List this line.  */
> +
> +  return 0;

Running the test, I get:


    Running /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.exp ...
    gdb compile failed, /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.c: In function 'main':
    /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.c:26:12: warning: multi-character character constant [-Wmultichar]
       26 |   char c = '�';         /* List this line.  */
          |            ^~~
    /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.c:26:12: warning: overflow in conversion from 'int' to 'char' changes value from '15712189' to '-67' [-Woverflow]

This works:

    char c[] = "<the multi-byte character>";

Otherwise, LGTM.

Simon

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting
  2022-01-21 16:59         ` Simon Marchi
@ 2022-01-26 23:13           ` Andrew Burgess
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Burgess @ 2022-01-26 23:13 UTC (permalink / raw)
  To: gdb-patches; +Cc: Tom Tromey, Simon Marchi

Simon Marchi via Gdb-patches <gdb-patches@sourceware.org> writes:

>> diff --git a/gdb/python/lib/gdb/__init__.py b/gdb/python/lib/gdb/__init__.py
>> index 11a1b444bfd..7880ed7564e 100644
>> --- a/gdb/python/lib/gdb/__init__.py
>> +++ b/gdb/python/lib/gdb/__init__.py
>> @@ -239,10 +239,13 @@ try:
>>          try:
>>              lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
>>              formatter = formatters.TerminalFormatter()
>> -            return highlight(contents, lexer, formatter)
>> +            return highlight(contents, lexer, formatter).encode(
>> +                host_charset(), "backslashreplace"
>> +            )
>>          except:
>>              return None
>>  
>> +
>
> The latest version of black removed this, see:
>
> https://gitlab.com/gnutools/binutils-gdb/-/commit/37260e0df0772dd8378afa91ec26fc0419e1ca94
>
>> diff --git a/gdb/testsuite/gdb.python/py-source-styling.c b/gdb/testsuite/gdb.python/py-source-styling.c
>> new file mode 100644
>> index 00000000000..6a1a9d59a8c
>> --- /dev/null
>> +++ b/gdb/testsuite/gdb.python/py-source-styling.c
>> @@ -0,0 +1,29 @@
>> +/* This testcase is part of GDB, the GNU debugger.
>> +
>> +   Copyright 2022 Free Software Foundation, Inc.
>> +
>> +   This program is free software; you can redistribute it and/or modify
>> +   it under the terms of the GNU General Public License as published by
>> +   the Free Software Foundation; either version 3 of the License, or
>> +   (at your option) any later version.
>> +
>> +   This program is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +   GNU General Public License for more details.
>> +
>> +   You should have received a copy of the GNU General Public License
>> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
>> +
>> +int
>> +main ()
>> +{
>> +  int some_variable = 1234;
>> +
>> +  /* The following line contains a character that is non-utf-8.  This is a
>> +     critical part of the test as Python 3 can't convert this into a string
>> +     using its default mechanism.  */
>> +  char c = '�';		/* List this line.  */
>> +
>> +  return 0;
>
> Running the test, I get:
>
>
>     Running /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.exp ...
>     gdb compile failed, /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.c: In function 'main':
>     /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.c:26:12: warning: multi-character character constant [-Wmultichar]
>        26 |   char c = '�';         /* List this line.  */
>           |            ^~~
>     /home/simark/src/binutils-gdb/gdb/testsuite/gdb.python/py-source-styling.c:26:12: warning: overflow in conversion from 'int' to 'char' changes value from '15712189' to '-67' [-Woverflow]
>
> This works:
>
>     char c[] = "<the multi-byte character>";
>
> Otherwise, LGTM.

Thanks, I made these fixes and pushed.

Andrew


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-01-26 23:13 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-07 14:23 [PATCH 0/4] Source highlight non utf-8 characters using Python Andrew Burgess
2022-01-07 14:23 ` [PATCH 1/4] gdb: new 'maint flush source-cache' command Andrew Burgess
2022-01-07 14:49   ` Eli Zaretskii
2022-01-11 12:13     ` Andrew Burgess
2022-01-11 13:31       ` Eli Zaretskii
2022-01-12 11:38         ` Andrew Burgess
2022-01-10 15:18   ` Tom Tromey
2022-01-07 14:23 ` [PATCH 2/4] gdb: erase items from the source_cache::m_offset_cache Andrew Burgess
2022-01-10 15:24   ` Tom Tromey
2022-01-11 12:17     ` Andrew Burgess
2022-01-11 14:54       ` Tom Tromey
2022-01-12 11:38         ` Andrew Burgess
2022-01-07 14:23 ` [PATCH 3/4] gdb: add 'maint set/show gnu-source-highlight enabled' command Andrew Burgess
2022-01-07 14:53   ` Eli Zaretskii
2022-01-11 13:07     ` Andrew Burgess
2022-01-11 13:34       ` Eli Zaretskii
2022-01-12 11:37         ` Andrew Burgess
2022-01-10 15:58   ` Tom Tromey
2022-01-07 14:23 ` [PATCH 4/4] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
2022-01-10  3:27   ` Simon Marchi
2022-01-10 10:41     ` Andrew Burgess
2022-01-10 15:32       ` Simon Marchi
2022-01-11 13:10         ` Andrew Burgess
2022-01-11 19:24   ` Tom Tromey
2022-01-11 19:42     ` Patrick Monnerat
2022-01-12 14:30 ` [PATCHv2 0/2] Source highlight non utf-8 characters using Python Andrew Burgess
2022-01-12 14:30   ` [PATCHv2 1/2] gdb/python: add gdb.host_charset function Andrew Burgess
2022-01-12 15:02     ` Eli Zaretskii
2022-01-12 15:23     ` Tom Tromey
2022-01-12 16:05     ` Andrew Burgess
2022-01-12 14:30   ` [PATCHv2 2/2] gdb/python: handle non utf-8 characters when source highlighting Andrew Burgess
2022-01-12 15:32     ` Tom Tromey
2022-01-12 15:59       ` Andrew Burgess
2022-01-19 17:44         ` Andrew Burgess
2022-01-21 16:59         ` Simon Marchi
2022-01-26 23:13           ` Andrew Burgess

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).