From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by sourceware.org (Postfix) with ESMTPS id F036E3858D3C for ; Tue, 3 May 2022 15:47:59 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F036E3858D3C Received: by mail-pf1-x42c.google.com with SMTP id p12so15089866pfn.0 for ; Tue, 03 May 2022 08:47:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=os7gCEwP4lb4+3JFMgxL64R5Au0xOb/03ZsySDk5a/s=; b=eByU2RojJ6dewy0ipSwUA7TKsHD8FBMUudkmq43klI3x5vsEJbsSYBqt8nqtEJgP2t HMTmUCuvCcviCVPQIfNdroZlNC0DeuWDQ7VsJMUAJdF/VFyGOhG21XS4OQaT6XAsbBxz SgNhkB/icxyC0fhAOoH11e+hsFNMRQknJu8IlEK+mvgnEZhDxsLJrfQzbw5JxakCARdP 9N/BLor5LP0nPZ7QD7cJu88SajCAsJh1KVp+aBd53cjt4EvGBNnCwKS5ehUvwCaxzdLu GOJdSOvzWaYGg3HLqSeITWQlmrA69KI9EQ4jD6Za0pVq/zWLPgboXbc+vCuslJiyUy4c Vdcg== X-Gm-Message-State: AOAM533/5trP3zA2mx65+7oyd1/0szlsSBDLkOg/eYpjPeEBV4nqQu7N m+ZLVsH/uMLkW0lHntePoaLtC+FZYNsElq7y7+4p38vl X-Google-Smtp-Source: ABdhPJywsAVeccG/pb0PP6qyHXygi8lC/Q63LpwHqckH3cpr5ipJulTOR3ccVEWgCmGOCDteH36UkOR4d9vV/x+ZKAE= X-Received: by 2002:a63:2114:0:b0:3c4:995c:344a with SMTP id h20-20020a632114000000b003c4995c344amr506167pgh.125.1651592878741; Tue, 03 May 2022 08:47:58 -0700 (PDT) MIME-Version: 1.0 References: <388c1dd1235a3c95aefc7caee5726b869b6894e0.1651239378.git.aburgess@redhat.com> <87zgjyn4k1.fsf@redhat.com> In-Reply-To: <87zgjyn4k1.fsf@redhat.com> From: "H.J. Lu" Date: Tue, 3 May 2022 08:47:22 -0700 Message-ID: Subject: Re: [PATCH 2/2] libopcodes: extend the styling within the i386 disassembler To: Andrew Burgess Cc: Jan Beulich , binutils@sourceware.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3020.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: binutils@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Binutils mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2022 15:48:02 -0000 On Tue, May 3, 2022 at 6:14 AM Andrew Burgess via Binutils wrote: > > Jan Beulich via Binutils writes: > > > On 29.04.2022 15:42, Andrew Burgess via Binutils wrote: > >> The i386 disassembler is pretty complex. Most disassembly is done > >> indirectly; operands are built into buffers within a struct instr_info > >> instance, before finally being printed later in the disassembly > >> process. > >> > >> Sometimes the operand buffers are built in a different order to the > >> order in which they will eventually be printed. > >> > >> Each operand can contain multiple components, e.g. multiple registers, > >> immediates, other textual elements (commas, brackets, etc). > >> > >> When looking for how to apply styling I guess the ideal solution would > >> be to move away from the operands being a single string that is built > >> up, and instead have each operand be a list of "parts", where each > >> part is some text and a style. Then, when we eventually print the > >> operand we would loop over the parts and print each part with the > >> correct style. > >> > >> But it feels like a huge amount of work to move from where we are > >> now to that potentially ideal solution. Plus, the above solution > >> would be pretty complex. > >> > >> So, instead I propose a .... different solution here, one that works > >> with the existing infrastructure. > >> > >> As each operand is built up, piece be piece, we pass through style > >> information. This style information is then encoded into the operand > >> buffer (see below for details). After this the code can continue to > >> operate as it does right now in order to manage the set of operand > >> buffers. > >> > >> Then, as each operand is printed we can split the operand buffer into > >> chunks at the style marker boundaries, with each chunk being printed > >> in the correct style. > >> > >> For encoding the style information I use the format "~%x~". As far as > >> I can tell the '~' is not otherwise used in the i386 disassembler, so > >> this should serve as a unique marker. To speed up writing and then > >> reading the style markers, I take advantage of the fact that there are > >> less than 16 styles so I know the '%x' will only ever be a single hex > >> character. > > > > Like H.J. I'd like to ask that you avoid ~ here (I actually have plans > > to use it to make at least some 64-bit constants better recognizable); > > I'm not sure about using non-ASCII though, as that may cause issues with > > compilers treating non-ASCII wrong. I'd soften this to non-alnum, non- > > operator characters (perhaps more generally non-printable). Otoh I guess > > about _any_ character could be used in symbol names, so I'm not > > convinced such an escaping model can be generally conflict free. > > Hi Jan, > > I've addressed all the simple feedback from H.J. and Vladimir, and I > just need to figure out something for the escaping mechanism. > > I'm still keen to try and go with an escaping based solution, my > reasoning is that I think that this is the solution least likely to > introduce latent disassembler bugs. > > However, that position is based on my belief that there's no exhaustive > test for the i386 based disassembler, i.e. one that tests every single > valid instruction disassembles correctly. If there was such a test then > I might be more tempted to try something more radical... > > That said, if I was going to stick with an escaping scheme, then I have > some ideas for moving forward. > > The current scheme relies on the fact that symbols are not printed > directly from the i386 disassembler, instead the i386 disassembler calls > back into the driver application (objdump, gdb) to print the symbol. As > a result, symbols don't go through the instr_info::obuf buffer. This > means that we never try to interpret a symbol name for escape > characters. > > This means we avoid one of the issues that you raised, what if the > escape character appears in a symbol name; the answer is, I just don't > need to worry about this! > > So, I only need to ensure that the escape character is: > > (a) not a character that the disassembler currently tries to directly > print itself, and > > (b) not something that will ever be printed as part of an immediate. > > Clearly my choice passes both right now, but looks like it will not pass > (b) forever. > > One possible solution would be to replace all the remaining places where > we directly write to instr_info::obuf with calls to oappend_char. I > could then extend the oappend API such that we do "real" escaping, that > is (assuming the continued use of '~' for now): '~X' would indicate a > style marker, with X being the style number, and '~~' would indicate a > literal '~' character. In this was we really wouldn't care which > character we used (though we'd probably pick one that didn't crop up too > ofter just for ease of parsing the buffers). > > An alternative solution would be to pick a non-printable character, > e.g. \001, and use this as the escape character in place of the current > '~'. This seems to pass the (a) and (b) tests above, and if such a > character does ever appear in a symbol name, then, as I've said above, I > don't believe this would cause us any problems. I like \001. We can always change it later. Let's wait for input from Jan. > Here's a session that demonstrates a symbol containing '~' with the > current patch (obviously the final disassembler call actually has > colour in the output, which all looks correct to me): > > $ cat /tmp/weird.s > .text > .global "foo~bar" > "foo~bar": > nop > nop > nop > call "foo~bar" > $ as -o /tmp/weird.o /tmp/weird.s > $ ./binutils/objdump --disassembler-color=extended-color -d /tmp/weird.o > > /tmp/weird.o: file format elf64-x86-64 > > > Disassembly of section .text: > > 0000000000000000 : > 0: 90 nop > 1: 90 nop > 2: 90 nop > 3: e8 00 00 00 00 call 8 > > > Thanks, > Andrew > -- H.J.