From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id C50E03858C51 for ; Wed, 18 May 2022 10:41:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C50E03858C51 Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-647-RWPyHktjN8yE6Ujm9xehzQ-1; Wed, 18 May 2022 06:41:51 -0400 X-MC-Unique: RWPyHktjN8yE6Ujm9xehzQ-1 Received: by mail-wr1-f69.google.com with SMTP id s16-20020adfeb10000000b0020cc4e5e683so464821wrn.6 for ; Wed, 18 May 2022 03:41:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version; bh=UWnpf4m5xBwJWuu2FtmsCvh21uKCqw6sLZYR4BUDx1c=; b=kN8Jzv/ChbLSyQ+cJ02sW5RLkA9vbHhTXjzM+fdbovSE8fIKcyTPJpI8vN1FhVaQjl muQrFKsDp0zYns7yGl1TF8bwfhDJ7/YQu8g5qPk19uvJWqfnfSDrnDr7yTUnkYNQEozD 1K3LjOpZhxIKt9hDgTBoTQNgK94vtQuEileDlFMjEsOkzlQ1MmYN/ZofdBV8oHD1SxWM 6bn2f4R7PpUdOfGwz4vDDZVBQA9AmIVP4hybMMhlRZp/lLjTSyVwSzuoBAX60QnBNmNP NgolcQE91nq51I9x7SLk2J+99hvVXGxK2I373Ck8k3JtrcPLLaM/R3lMmOEpNzREQO0m +i1A== X-Gm-Message-State: AOAM533rbx198ZSGmEYx6Zod5JC3sX26P3WquwECicIlkuMyZK9Zlq0S kuK596X9OkqBqLwu1iwFFYj+EQOjj4+ccCR7anvMq8UwIb+rOqt3YQExGRD4VsN9moOLYsyhifx fLxTrJSMWOXIgfGcj4A== X-Received: by 2002:a05:600c:1986:b0:394:867f:984c with SMTP id t6-20020a05600c198600b00394867f984cmr35895687wmq.20.1652870510083; Wed, 18 May 2022 03:41:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwGxZ7wZQHXEiBT0vCIlHrvwNNlDeoWpVSMMDyXrLtSCPQyqc9BnIp8fIr/OW/MCo9IIvb25w== X-Received: by 2002:a05:600c:1986:b0:394:867f:984c with SMTP id t6-20020a05600c198600b00394867f984cmr35895667wmq.20.1652870509766; Wed, 18 May 2022 03:41:49 -0700 (PDT) Received: from localhost (host81-136-113-48.range81-136.btcentralplus.com. [81.136.113.48]) by smtp.gmail.com with ESMTPSA id c22-20020a05600c0a5600b00396fbf6f524sm4102686wmq.1.2022.05.18.03.41.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 May 2022 03:41:49 -0700 (PDT) From: Andrew Burgess To: Jan Beulich Cc: binutils@sourceware.org Subject: Re: [PATCH 2/2] libopcodes: extend the styling within the i386 disassembler In-Reply-To: References: <388c1dd1235a3c95aefc7caee5726b869b6894e0.1651239378.git.aburgess@redhat.com> <87zgjyn4k1.fsf@redhat.com> <559aaea5-bf16-f48e-fc0e-e750a2795b99@suse.com> <87tu9zkpdy.fsf@redhat.com> Date: Wed, 18 May 2022 11:41:48 +0100 Message-ID: <871qwrjf6b.fsf@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-6.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: binutils@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Binutils mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2022 10:41:54 -0000 Jan Beulich via Binutils writes: > On 09.05.2022 11:48, Andrew Burgess wrote: >> Jan Beulich via Binutils writes: >>> On 03.05.2022 15:12, Andrew Burgess wrote: >>>> I'm still keen to try and go with an escaping based solution, my >>>> reasoning is that I think that this is the solution least likely to >>>> introduce latent disassembler bugs. >>>> >>>> However, that position is based on my belief that there's no exhaustive >>>> test for the i386 based disassembler, i.e. one that tests every single >>>> valid instruction disassembles correctly. If there was such a test then >>>> I might be more tempted to try something more radical... >>>> >>>> That said, if I was going to stick with an escaping scheme, then I have >>>> some ideas for moving forward. >>>> >>>> The current scheme relies on the fact that symbols are not printed >>>> directly from the i386 disassembler, instead the i386 disassembler calls >>>> back into the driver application (objdump, gdb) to print the symbol. As >>>> a result, symbols don't go through the instr_info::obuf buffer. This >>>> means that we never try to interpret a symbol name for escape >>>> characters. >>> >>> Hmm, indeed. I have to admit that I view it as a significant shortcoming >>> of the disassembler that it doesn't resolve addresses in the output. So >>> I'd like to at least not see the road being closed towards improving this. >>> >>>> This means we avoid one of the issues that you raised, what if the >>>> escape character appears in a symbol name; the answer is, I just don't >>>> need to worry about this! >>>> >>>> So, I only need to ensure that the escape character is: >>>> >>>> (a) not a character that the disassembler currently tries to directly >>>> print itself, and >>>> >>>> (b) not something that will ever be printed as part of an immediate. >>> >>> Or, more generally, as part of any kind of operand. >> >> Sure, but the reason I single out immedates here is I think these are >> the only operand whose content is not statically know within the >> disassembler. >> >> For example, register operands, every possible register operand value is >> enumerated within the i386-dis.c source file, right? So when I proposed >> using '~' I could simply search the source file, find no uses, and know >> that character is not (currently) used within a register name. > > Indeed. Yet present state is only part of it. See the uses of { and } > that AVX512 has added. Prior to that one could have thought these > characters could easily be used for some special purpose (like your > escaping), too. Hence my pointing out of possible future uses of ~, > with the more general implication that all printable characters would > better be avoided. But you've switched to \002 already anyway afaics. > >> Immediates are different though, for them we rely on libc to generate >> the textual representation. > > Yet even then we know the set of characters libc might use. > >> The only other operand type that might contain "unknown" characters >> would be a field that contains an address and potentially a symbol name, >> but as was already discussed, these are not printed through the >> disassembler. > > Hmm, yes. This behavior is so extremely counterintuitive to me that > I keep forgetting. Not the least because in many cases a symbol > name isn't printed at all even when one could be known. So yes, if > ->print_address_func() doesn't look for escapes, then indeed all > should be fine right now. > >> My question then, other than the exceptions I've already listed, are >> there other types of operand where the content doesn't already exit >> within i386-dis.c? > > I don't think there is right now. > >>>> Clearly my choice passes both right now, but looks like it will not pass >>>> (b) forever. >>>> >>>> One possible solution would be to replace all the remaining places where >>>> we directly write to instr_info::obuf with calls to oappend_char. >>> >>> I guess this might be troublesome. The way the disassembler works is a >>> little quirky here and there, and hence one needs to play tricks every >>> now and then to half-way reasonably deal with certain special cases. >>> >>>> I >>>> could then extend the oappend API such that we do "real" escaping, that >>>> is (assuming the continued use of '~' for now): '~X' would indicate a >>>> style marker, with X being the style number, and '~~' would indicate a >>>> literal '~' character. In this was we really wouldn't care which >>>> character we used (though we'd probably pick one that didn't crop up too >>>> ofter just for ease of parsing the buffers). >>>> >>>> An alternative solution would be to pick a non-printable character, >>>> e.g. \001, and use this as the escape character in place of the current >>>> '~'. This seems to pass the (a) and (b) tests above, and if such a >>>> character does ever appear in a symbol name, then, as I've said above, I >>>> don't believe this would cause us any problems. >>> >>> I suppose \001 (or a character very close to this, as iirc \001 has >>> some meaning internally in gas, and I'm not entirely certain none of >>> these uses can ever "escape" gas) is good to start with. Provided it >>> is properly abstracted so it can, if necessary, be _very_ easily >>> changed (by modifying exactly one line, or - if you need both a >>> single-quoted and a double-quoted instance - two adjacent ones). >>> >>> Albeit, thinking of this last aspect, maybe it would be better to >>> only have a double-quoted instance in the first place, and allow >>> for the escape to be more than a single character if need be ... >>> >>> And yes - if a symbol name was possible to hit and if that symbol >>> name contained such an escape sequence, aiui the worst that would >>> happen is bogus coloring? IOW the escape would not be looked for and >>> replaced / processed when coloring is disabled? >> >> Unfortunately this is not correct. The disassembler always sends >> styling information to the user (objdump, gdb, etc), its the user that >> decides if the output should be styled or not. >> >> What this means is that if the disassembler encountered a random symbol >> (which would be a pretty big change to the disassembler), and the symbol >> did include something like ~a~ (using the current character to make it >> more readable here), then the whole '~a~' part would disappear from the >> symbol name, this would be seen as a style marker, the next up to the >> start of '~a~' sould take the previous style, and the text after '~a~' >> would take the '0xa' style, but the '~a~' itself would always be >> stripped out. >> >> One relatively easy solution here would be to say that, when we add the >> ability to include symbol names in the disassembler output buffers, at >> that point we can add "true" escaping. So if your symbol name is >> 'foo~a~bar' then as this is added to the disassebmler buffer we would >> actually add 'foo~~a~~bar', and we'd extend the code that parses out >> styling information so that it could handle this case. This feels like >> it should be easy enough to do. >> >> All we then have to do is convince ourselves that there's no way for the >> escape character to make it into the disassembler output from any other >> source, and we should be fine. >> >> For example, your concern about \001 escaping from gas. Other than >> within a symbol name, how might the disassembler end up trying to print >> this byte? > > As per above, I was wrong, simply because I find the disassembler behavior > here rather bogus. Jan, Thanks for your feedback, this all sounds really positive now. Is there anything else you'd like me to change with this patch before it can be merged? Thanks, Andrew