From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aburgess@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
 by sourceware.org (Postfix) with ESMTPS id 00CB7382F090
 for <gdb-patches@sourceware.org>; Tue, 30 Aug 2022 14:17:23 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 00CB7382F090
Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com
 [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-277-Tr6ed7b1OauOQw05yme55g-1; Tue, 30 Aug 2022 10:17:10 -0400
X-MC-Unique: Tr6ed7b1OauOQw05yme55g-1
Received: by mail-wr1-f71.google.com with SMTP id
 g5-20020adfbc85000000b00226badbd7d9so1732464wrh.10
 for <gdb-patches@sourceware.org>; Tue, 30 Aug 2022 07:17:06 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc;
 bh=ZdGOAD6C5CEQuLTnx9/FkIPyyZonNmnNMcjMIL/pBAU=;
 b=0kKXg8b/Ap05eMjEM/3grqFiOlJQz1HQU+kLioviCeSlJnJEYeNr60Nxm9U30HlTof
 b07TX1bHrc21JjRdXyv8cOHnmVMN805xaUhWqa4QSPl9SOuJCN1xjQWFJ2kx8DLOAGbq
 oLL2H3VUW2JAJqNawRVeFR6hpI0JuSTLf8ypJsheWM2vLMdh/1VQhr6YteUKnsuodjXr
 fkh9Uo7jr572SXM0TK/T/N3di2nH2+SIXkmkTMiH4z0Z8a3hxGIR8kY1FRENOA044Ttx
 fMOKHdgsUUONqF1Pjk6earqTUcHOLrPZa3DvgLD8jioffWA8o9sVdNyKiGoHF6dk5c/0
 Yktg==
X-Gm-Message-State: ACgBeo3t2dnddEw4lFWBc+bEmcUhkTW3esupyVrokUpq7KCtB9dItpNN
 h4i9e7cYk/c8VQiarP194vAmnO9iZPY0MMToTbpGx2DV1Te0WDxstLqNz3e6xaX6O2rUicBC46Z
 moExr9mqbiINvRocajPw+2E6X/UavzEhcH67GCWMNeD/jgRQ92LYavZcQA8rCyUr5GT9ZgYrZ+g
 ==
X-Received: by 2002:a05:600c:a4c:b0:39c:34d0:fd25 with SMTP id
 c12-20020a05600c0a4c00b0039c34d0fd25mr10058311wmq.172.1661869023840; 
 Tue, 30 Aug 2022 07:17:03 -0700 (PDT)
X-Google-Smtp-Source: AA6agR4yr00K+0tafmrjKh+zq+6xS3QgOaOWbTJbBYPS+HjSxmMRDkum1IAWfI1XdvbyjHHMMFGFwg==
X-Received: by 2002:a05:600c:a4c:b0:39c:34d0:fd25 with SMTP id
 c12-20020a05600c0a4c00b0039c34d0fd25mr10058280wmq.172.1661869023356; 
 Tue, 30 Aug 2022 07:17:03 -0700 (PDT)
Received: from localhost ([31.111.84.229]) by smtp.gmail.com with ESMTPSA id
 bn5-20020a056000060500b00226d217c3e6sm9994286wrb.64.2022.08.30.07.17.02
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 30 Aug 2022 07:17:03 -0700 (PDT)
From: Andrew Burgess <aburgess@redhat.com>
To: gdb-patches@sourceware.org
Subject: [PATCH 3/3] gdb/disasm: better intel flavour disassembly styling with
 Pygments
Date: Tue, 30 Aug 2022 15:16:53 +0100
Message-Id: <d7e10bbc7fd9e0264038056ef2034d838af6b36f.1661868744.git.aburgess@redhat.com>
X-Mailer: git-send-email 2.25.4
In-Reply-To: <cover.1661868744.git.aburgess@redhat.com>
References: <cover.1661868744.git.aburgess@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="US-ASCII"; x-default=true
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 30 Aug 2022 14:17:28 -0000

This commit was inspired by this stackoverflow post:

  https://stackoverflow.com/questions/73491793/why-is-there-a-%C2%B1-in-lea-rax-rip-%C2%B1-0xeb3

One of the comments helpfully links to this Python test case:

  from pygments import formatters, lexers, highlight

  def colorize_disasm(content, gdbarch):
      try:
          lexer = lexers.get_lexer_by_name("asm")
          formatter = formatters.TerminalFormatter()
          return highlight(content, lexer, formatter).rstrip().encode()
      except:
          return None

  print(colorize_disasm("lea [rip+0x211]  # COMMENT", None).decode())

Run the test case and you should see that the '+' character is
underlined, and could be confused with a combined +/- symbol.

What's happening is that Pygments is failing to parse the input text,
and the '+' is actually being marked in the error style.  The error
style is red and underlined.

It is worth noting that the assembly instruction being disassembled
here is an x86-64 instruction in the 'intel' disassembly style, rather
than the default att style.  Clearly the Pygments module expects the
att syntax by default.

If we change the test case to this:

  from pygments import formatters, lexers, highlight

  def colorize_disasm(content, gdbarch):
      try:
          lexer = lexers.get_lexer_by_name("asm")
          lexer.add_filter('raiseonerror')
          formatter = formatters.TerminalFormatter()
          return highlight(content, lexer, formatter).rstrip().encode()
      except:
          return None

  res = colorize_disasm("lea rax,[rip+0xeb3] # COMMENT", None)
  if res:
      print(res.decode())
  else:
      print("No result!")

Here I've added the call: lexer.add_filter('raiseonerror'), and I am
now checking to see if the result is None or not.  Running this and
the test now print 'No result!' - instead of styling the '+' in the
error style, we instead give up on the styling attempt.

There are two things we need to fix relating to this disassembly
text.  First, Pygments is expecting att style disassembly, not the
intel style that this example uses.  Fortunately, Pygments also
supports the intel style, all we need to do is use the 'nasm' lexer
instead of the 'asm' lexer.

However, this leads to the second problem; in our disassembler line we
have '# COMMENT'.  The "official" Intel disassembler style uses ';'
for its comment character, however, gas and libopcodes use '#' as the
comment character, as gas uses ';' for an instruction separator.

Unfortunately, Pygments expects ';' as the comment character, and
treats '#' as an error, which means, with the addition of the
'raiseonerror' filter, that any line containing a '#' comment, will
not get styled correctly.

However, as the i386 disassembler never produces a '#' character other
than for comments, we can easily "fix" Pygments parsing of the
disassembly line.  This is done by creating a filter.  This filter
looks for an Error token with the value '#', we then change this into
a comment token.  Every token after this (until the end of the line)
is also converted into a comment.

In this commit I do the following:

  1. Check the 'disassembly-flavor' setting and select between the
  'asm' and 'nasm' lexers based on the setting.  If the setting is not
  available then the 'asm' lexer is used by default,

  2. Use "add_filter('raiseonerror')" to ensure that the formatted
  output will not include any error text, which would be underlined,
  and might be confusing,

  3. If the 'nasm' lexer is selected, then add an additional filter
  that will format '#' and all other text on the line, as a comment,
  and

  4. If Pygments throws an exception, instead of returning None,
  return the original, unmodified content.  This will mean that this
  one instruction is printed without styling, but GDB will continue to
  call into the Python code to style later instructions.

I haven't included a test specifically for the above error case,
though I have manually check that the above case now styles
correctly (with no underline).  The existing style tests check that
the disassembler styling still works though, so I know I've not
generally broken things.

One final thought I have after looking at this issue is that I wonder
now if using Pygments for styling disassembly from every architecture
is actually a good idea?

Clearly, the 'asm' lexer is OK with att style x86-64, but not OK with
intel style x86-64, so who knows how well it will handle other random
architectures?

When I first added this feature I tested it against some random
RISC-V, ARM, and X86-64 (att style) code, and it seemed fine, but I
never tried to make an exhaustive check of all instructions, so its
quite possible that there are corner cases where things are styled
incorrectly.

With the above changes I think that things should be a bit better
now.  If a particular instruction doesn't parse correctly then our
Pygments based styling code will just not style that one instruction.
This is combined with the fact that many architectures are now moving
to libopcodes based styling, which is much more reliable.

So, I think it is fine to keep using Pygments as a fallback mechanism
for styling all architectures, even if we know it might not be perfect
in all cases.
---
 gdb/python/lib/gdb/styling.py | 59 ++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/gdb/python/lib/gdb/styling.py b/gdb/python/lib/gdb/styling.py
index aef39c6857c..b97f1dd7fb8 100644
--- a/gdb/python/lib/gdb/styling.py
+++ b/gdb/python/lib/gdb/styling.py
@@ -20,26 +20,77 @@ import gdb
 
 try:
     from pygments import formatters, lexers, highlight
+    from pygments.token import Error, Comment, Text
+    from pygments.filters import TokenMergeFilter
+
+    _formatter = None
+
+    def get_formatter():
+        global _formatter
+        if _formatter is None:
+            _formatter = formatters.TerminalFormatter()
+        return _formatter
 
     def colorize(filename, contents):
         # Don't want any errors.
         try:
             lexer = lexers.get_lexer_for_filename(filename, stripnl=False)
-            formatter = formatters.TerminalFormatter()
+            formatter = get_formatter()
             return highlight(contents, lexer, formatter).encode(
                 gdb.host_charset(), "backslashreplace"
             )
         except:
             return None
 
+    class HandleNasmComments(TokenMergeFilter):
+        @staticmethod
+        def fix_comments(lexer, stream):
+            in_comment = False
+            for ttype, value in stream:
+                if ttype is Error and value == "#":
+                    in_comment = True
+                if in_comment:
+                    if ttype is Text and value == "\n":
+                        in_comment = False
+                    else:
+                        ttype = Comment.Single
+                yield ttype, value
+
+        def filter(self, lexer, stream):
+            f = HandleNasmComments.fix_comments
+            return super().filter(lexer, f(lexer, stream))
+
+    _asm_lexers = {}
+
+    def __get_asm_lexer(gdbarch):
+        lexer_type = "asm"
+        try:
+            # For an i386 based architecture, in 'intel' mode, use the nasm
+            # lexer.
+            flavor = gdb.parameter("disassembly-flavor")
+            if flavor == "intel" and gdbarch.name()[:4] == "i386":
+                lexer_type = "nasm"
+        except:
+            # If GDB is built without i386 support then attempting to fetch
+            # the 'disassembly-flavor' parameter will throw an error, which we
+            # ignore.
+            pass
+
+        global _asm_lexers
+        if lexer_type not in _asm_lexers:
+            _asm_lexers[lexer_type] = lexers.get_lexer_by_name(lexer_type)
+            _asm_lexers[lexer_type].add_filter(HandleNasmComments())
+            _asm_lexers[lexer_type].add_filter("raiseonerror")
+        return _asm_lexers[lexer_type]
+
     def colorize_disasm(content, gdbarch):
         # Don't want any errors.
         try:
-            lexer = lexers.get_lexer_by_name("asm")
-            formatter = formatters.TerminalFormatter()
+            lexer = __get_asm_lexer(gdbarch)
+            formatter = get_formatter()
             return highlight(content, lexer, formatter).rstrip().encode()
         except:
-            return None
+            return content
 
 except:
 
-- 
2.25.4