From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aburgess@sourceware.org>
Received: by sourceware.org (Postfix, from userid 1726)
	id 7020F3858297; Sun,  2 Oct 2022 16:36:37 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7020F3858297
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1664728597;
	bh=jZtB1FozB0c/SL0ZXTVbazNpC853B7Zi5sTObRzjTZI=;
	h=From:To:Subject:Date:From;
	b=PuzrlrELSvAeW43U4WJOIhLTr7TSrED85lPVteHXVk+6ev51H/Z5MxgTTiXdJcROi
	 rfizAiS75sYmLT6WAPCz3jAqOfAUA6cP/HL/6DsMMZeQo29bvIrgwPrgORSc3mH+qb
	 Fv2VI8va+mwN38UpUhAIHrgJk6kncWd9Ld4Kg05w=
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
From: Andrew Burgess <aburgess@sourceware.org>
To: gdb-cvs@sourceware.org
Subject: [binutils-gdb] gdb/disasm: better intel flavour disassembly styling with Pygments
X-Act-Checkin: binutils-gdb
X-Git-Author: Andrew Burgess <aburgess@redhat.com>
X-Git-Refname: refs/heads/master
X-Git-Oldrev: f22c50c22cbff208e3b024e344c0cb7d88dc0835
X-Git-Newrev: 6deb7a8185b9f359289bad3e86d7be248ac75550
Message-Id: <20221002163637.7020F3858297@sourceware.org>
Date: Sun,  2 Oct 2022 16:36:37 +0000 (GMT)
List-Id: <gdb-cvs.sourceware.org>

https://sourceware.org/git/gitweb.cgi?p=3Dbinutils-gdb.git;h=3D6deb7a8185b9=
f359289bad3e86d7be248ac75550

commit 6deb7a8185b9f359289bad3e86d7be248ac75550
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Sat Aug 27 16:15:31 2022 +0100

    gdb/disasm: better intel flavour disassembly styling with Pygments
   =20
    This commit was inspired by this stackoverflow post:
   =20
      https://stackoverflow.com/questions/73491793/why-is-there-a-%C2%B1-in=
-lea-rax-rip-%C2%B1-0xeb3
   =20
    One of the comments helpfully links to this Python test case:
   =20
      from pygments import formatters, lexers, highlight
   =20
      def colorize_disasm(content, gdbarch):
          try:
              lexer =3D lexers.get_lexer_by_name("asm")
              formatter =3D formatters.TerminalFormatter()
              return highlight(content, lexer, formatter).rstrip().encode()
          except:
              return None
   =20
      print(colorize_disasm("lea [rip+0x211]  # COMMENT", None).decode())
   =20
    Run the test case and you should see that the '+' character is
    underlined, and could be confused with a combined +/- symbol.
   =20
    What's happening is that Pygments is failing to parse the input text,
    and the '+' is actually being marked in the error style.  The error
    style is red and underlined.
   =20
    It is worth noting that the assembly instruction being disassembled
    here is an x86-64 instruction in the 'intel' disassembly style, rather
    than the default att style.  Clearly the Pygments module expects the
    att syntax by default.
   =20
    If we change the test case to this:
   =20
      from pygments import formatters, lexers, highlight
   =20
      def colorize_disasm(content, gdbarch):
          try:
              lexer =3D lexers.get_lexer_by_name("asm")
              lexer.add_filter('raiseonerror')
              formatter =3D formatters.TerminalFormatter()
              return highlight(content, lexer, formatter).rstrip().encode()
          except:
              return None
   =20
      res =3D colorize_disasm("lea rax,[rip+0xeb3] # COMMENT", None)
      if res:
          print(res.decode())
      else:
          print("No result!")
   =20
    Here I've added the call: lexer.add_filter('raiseonerror'), and I am
    now checking to see if the result is None or not.  Running this and
    the test now print 'No result!' - instead of styling the '+' in the
    error style, we instead give up on the styling attempt.
   =20
    There are two things we need to fix relating to this disassembly
    text.  First, Pygments is expecting att style disassembly, not the
    intel style that this example uses.  Fortunately, Pygments also
    supports the intel style, all we need to do is use the 'nasm' lexer
    instead of the 'asm' lexer.
   =20
    However, this leads to the second problem; in our disassembler line we
    have '# COMMENT'.  The "official" Intel disassembler style uses ';'
    for its comment character, however, gas and libopcodes use '#' as the
    comment character, as gas uses ';' for an instruction separator.
   =20
    Unfortunately, Pygments expects ';' as the comment character, and
    treats '#' as an error, which means, with the addition of the
    'raiseonerror' filter, that any line containing a '#' comment, will
    not get styled correctly.
   =20
    However, as the i386 disassembler never produces a '#' character other
    than for comments, we can easily "fix" Pygments parsing of the
    disassembly line.  This is done by creating a filter.  This filter
    looks for an Error token with the value '#', we then change this into
    a comment token.  Every token after this (until the end of the line)
    is also converted into a comment.
   =20
    In this commit I do the following:
   =20
      1. Check the 'disassembly-flavor' setting and select between the
      'asm' and 'nasm' lexers based on the setting.  If the setting is not
      available then the 'asm' lexer is used by default,
   =20
      2. Use "add_filter('raiseonerror')" to ensure that the formatted
      output will not include any error text, which would be underlined,
      and might be confusing,
   =20
      3. If the 'nasm' lexer is selected, then add an additional filter
      that will format '#' and all other text on the line, as a comment,
      and
   =20
      4. If Pygments throws an exception, instead of returning None,
      return the original, unmodified content.  This will mean that this
      one instruction is printed without styling, but GDB will continue to
      call into the Python code to style later instructions.
   =20
    I haven't included a test specifically for the above error case,
    though I have manually check that the above case now styles
    correctly (with no underline).  The existing style tests check that
    the disassembler styling still works though, so I know I've not
    generally broken things.
   =20
    One final thought I have after looking at this issue is that I wonder
    now if using Pygments for styling disassembly from every architecture
    is actually a good idea?
   =20
    Clearly, the 'asm' lexer is OK with att style x86-64, but not OK with
    intel style x86-64, so who knows how well it will handle other random
    architectures?
   =20
    When I first added this feature I tested it against some random
    RISC-V, ARM, and X86-64 (att style) code, and it seemed fine, but I
    never tried to make an exhaustive check of all instructions, so its
    quite possible that there are corner cases where things are styled
    incorrectly.
   =20
    With the above changes I think that things should be a bit better
    now.  If a particular instruction doesn't parse correctly then our
    Pygments based styling code will just not style that one instruction.
    This is combined with the fact that many architectures are now moving
    to libopcodes based styling, which is much more reliable.
   =20
    So, I think it is fine to keep using Pygments as a fallback mechanism
    for styling all architectures, even if we know it might not be perfect
    in all cases.

Diff:
---
 gdb/python/lib/gdb/styling.py | 59 +++++++++++++++++++++++++++++++++++++++=
+---
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/gdb/python/lib/gdb/styling.py b/gdb/python/lib/gdb/styling.py
index aef39c6857c..b97f1dd7fb8 100644
--- a/gdb/python/lib/gdb/styling.py
+++ b/gdb/python/lib/gdb/styling.py
@@ -20,26 +20,77 @@ import gdb
=20
 try:
     from pygments import formatters, lexers, highlight
+    from pygments.token import Error, Comment, Text
+    from pygments.filters import TokenMergeFilter
+
+    _formatter =3D None
+
+    def get_formatter():
+        global _formatter
+        if _formatter is None:
+            _formatter =3D formatters.TerminalFormatter()
+        return _formatter
=20
     def colorize(filename, contents):
         # Don't want any errors.
         try:
             lexer =3D lexers.get_lexer_for_filename(filename, stripnl=3DFa=
lse)
-            formatter =3D formatters.TerminalFormatter()
+            formatter =3D get_formatter()
             return highlight(contents, lexer, formatter).encode(
                 gdb.host_charset(), "backslashreplace"
             )
         except:
             return None
=20
+    class HandleNasmComments(TokenMergeFilter):
+        @staticmethod
+        def fix_comments(lexer, stream):
+            in_comment =3D False
+            for ttype, value in stream:
+                if ttype is Error and value =3D=3D "#":
+                    in_comment =3D True
+                if in_comment:
+                    if ttype is Text and value =3D=3D "\n":
+                        in_comment =3D False
+                    else:
+                        ttype =3D Comment.Single
+                yield ttype, value
+
+        def filter(self, lexer, stream):
+            f =3D HandleNasmComments.fix_comments
+            return super().filter(lexer, f(lexer, stream))
+
+    _asm_lexers =3D {}
+
+    def __get_asm_lexer(gdbarch):
+        lexer_type =3D "asm"
+        try:
+            # For an i386 based architecture, in 'intel' mode, use the nasm
+            # lexer.
+            flavor =3D gdb.parameter("disassembly-flavor")
+            if flavor =3D=3D "intel" and gdbarch.name()[:4] =3D=3D "i386":
+                lexer_type =3D "nasm"
+        except:
+            # If GDB is built without i386 support then attempting to fetch
+            # the 'disassembly-flavor' parameter will throw an error, whic=
h we
+            # ignore.
+            pass
+
+        global _asm_lexers
+        if lexer_type not in _asm_lexers:
+            _asm_lexers[lexer_type] =3D lexers.get_lexer_by_name(lexer_typ=
e)
+            _asm_lexers[lexer_type].add_filter(HandleNasmComments())
+            _asm_lexers[lexer_type].add_filter("raiseonerror")
+        return _asm_lexers[lexer_type]
+
     def colorize_disasm(content, gdbarch):
         # Don't want any errors.
         try:
-            lexer =3D lexers.get_lexer_by_name("asm")
-            formatter =3D formatters.TerminalFormatter()
+            lexer =3D __get_asm_lexer(gdbarch)
+            formatter =3D get_formatter()
             return highlight(content, lexer, formatter).rstrip().encode()
         except:
-            return None
+            return content
=20
 except: