public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed
From: LIU Hao <lh_mouse@126.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: binutils@sourceware.org, GCC Development <gcc@gcc.gnu.org>
Subject: Re: RFC: Formalization of the Intel assembly syntax (PR53929)
Date: Fri, 19 Jan 2024 00:40:51 +0800	[thread overview]
Message-ID: <40ae7cb2-c094-4594-859d-470e7a7fce49@126.com> (raw)
In-Reply-To: <95e373fb-24f3-4b10-9ad1-948597ed9d67@suse.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 5549 bytes --]

在 2024-01-18 20:54, Jan Beulich 写道:
> I'm sorry, but most of your proposal may even be considered for being
> acceptable only if you would gain buy-off from the MASM guys. Anything
> MASM treats as valid ought to be permitted by gas as well (within the
> scope of certain divergence that cannot be changed in gas without
> risking to break people's code). It could probably be considered to
> introduce a "strict" mode of Intel syntax, following some / most of
> what you propose; making this the default cannot be an option.

Thanks for your reply.

I have attached the Markdown source for that page, modified a few hours ago. I am planning to make 
some updates according to your advice tomorrow.

And yes, I am proposing a 'strict' mode, however not for humans, only for compilers.

My first message references a GCC bug report, where the problematic symbol `bx` comes from C source. 
I have been aware of the `/APP` and `/NO_APP` markers in generated assembly, so I suspect that GAS 
should be able to tell which parts are generated from a compiler and which parts are composed by 
hand. The proposed strict mode may apply only to the output from GCC, which are much more likely to 
contain bad symbols, but are also more controllable on the GCC side.

I believe that skillful people who write x86 assembly have known that `offset`, `shr`, `si` etc. are 
'bad' names for symbols. Therefore, it's like an issue there.


> Commenting on individual aspects of your proposal is a little difficult,
> as you didn't provide the proposal inline (and hence it cannot be easily
> used as context in a reply). But to mention the imo worst aspect:
> Declaring
> 
> 	mov	eax, [rcx]
> 
> as invalid is a no-go.

I agree. I am considering to declare the lack of a symbol as a special case.


> I also don't see how this would be related to the
> issue at hand. What's in the square brackets may as well be a symbol
> name, so requiring the "mode specifier" doesn't disambiguate things at
> all.

If someone declares a variable called `rcx` in C, it has be translated to

    mov eax, DWORD PTR rcx      # `movl rcx, %eax`

instead of

    mov eax, DWORD PTR [rcx]    # `movl (%rcx), %eax`


> One remark regarding the underlying pattern leading to the issue:
> Personally I view it as questionable practice to have extern or static
> variables in C code with names as short as register names are. Avoiding
> them does not only avoid the issue here, but also is quite likely going
> to improve the code (by having more descriptive variable names). And
> automatic variables aren't affected aiui, so can remain short (after
> all, commonly automatic variable names are as short as a single char).

Yes, we agree that longer, more descriptive names increase maintainability.

However, there are scenarios where maintainability doesn't matter much. For instance, testcases, 
sometimes machine-generated testcases, which are usually short programs, created to address issues 
in something else, and are likely to contain variables with very short names. The register names 
`si` and `es` look especially risky to me.


> That said, I can certainly also see how the introduction of new
> registers can lead to new conflicts, which isn't nice. Iirc old 32-bit
> MASM escaped this problem by requiring architecture extensions to be
> explicitly enabled (may have changed in newer MASM). Gas, otoh, enables
> everything by default (and I don't see how we could change that).

I confess! I haven't done much investigation about these compilers, and all stuff hereinafter is my 
presumption.


Given this C source:

    extern int rdx;
    int get_value() { return rdx;  }


I try to compile it directly to an object file, with MSVC, Clang and GCC:

    > cl /nologo /c test.c && echo Success
    test.c
    Success

    > clang -masm=intel -c test.c && echo Success
    Success

    > gcc -masm=intel -c test.c && echo Success
    C:\Users\lh_mouse\AppData\Local\Temp\ccjcy1Qj.s: Assembler messages:
    C:\Users\lh_mouse\AppData\Local\Temp\ccjcy1Qj.s:23: Error: invalid use of register
    C:\Users\lh_mouse\AppData\Local\Temp\ccjcy1Qj.s:23: Warning: register value used as expression


but if I compile it to assembly first, then assemble the result to an object file:

    > cl /nologo /c test.c /Fatest.asm && ml64 /nologo /c test.asm && echo success
    test.c
     Assembling: test.asm
    test.asm(9) : error A2008:syntax error : rdx
    test.asm(15) : error A2032:invalid use of register

    > clang -masm=intel -S test.c -o test.s && clang -masm=intel test.s && echo Success
    test.s:26:8: error: expected relocatable expression
            .quad   rdx
                    ^

    > gcc -masm=intel -S test.c -o test.s && gcc -masm=intel test.s && echo Success
    test.s: Assembler messages:
    test.s:23: Error: invalid use of register
    test.s:23: Warning: register value used as expression


It looks to me that both MSVC and Clang have integrated assemblers, so their compiler outputs do not 
really turn into assembly code before finally becoming target code. This approach is not subject to 
the ambiguity.

As GCC still relies on GAS to produce object files, (as stated in the first paragraph,) it might 
make some sense to implement a strict mode on outputs from GCC to resolve the potential ambiguity, 
while still providing a permissive mode for inline or handwritten assembly.


-- 
Best regards,
LIU Hao


[-- Attachment #1.1.2: Formalized-Intel-Syntax-for-x86.md.txt --]
[-- Type: text/plain, Size: 5468 bytes --]

# The Motivation

The assembly language for x86 and x86-64 involves two major variations of syntax: the _Microsoft assembler (MASM) syntax_ and the _GNU assembler (GAS) syntax_. The MASM syntax, also known as the _Intel syntax_, is prescriptive in Intel Software Developer Manual, and is used extensively by many non-GNU tools. The GNU syntax, also known as the _AT&T syntax_, derives from PDP-11 assembly to create Unix, and is default and dominant in the post-Unix world.

The advantages of the MASM syntax are:
1. It looks more modern, closer to many other assembly languages, such as ARM, MIPS and RISC-V.
2. It is the syntax in Intel and AMD documentation.

The disadvantages of the MASM syntax are:
1. MASM is proprietary software.
2. The syntax has not been formally defined, and causes ambiguity sometimes.

For instance, the Intel Software Developer Manual contains this line:
```asm
MOV EBX, RAM_START
```

This is ambiguous in two ways. First, it could be interpreted as either of
```asm
MOV EBX, OFFSET RAM_START         ; `movl $RAM_START, %ebx`
MOV EBX, DWORD PTR [RAM_START]    ; `movl RAM_START, %ebx`
```

Second, on x86-64 the address might be RIP-relative or absolute, as in
```asm
MOV EBX, DWORD PTR [RAM_START]
          ; x86    absolute       ; 8B 1D    RAM_START   ; `movl RAM_START, %ebx`
          ; x86-64 RIP-relative   ; 8B 1D    RAM_START   ; `movl RAM_START(%rip), %ebx`
          ; x86-64 absolute       ; 8B 1C 25 RAM_START   ; `movl RAM_START, %ebx`
```

The first issue here is solved by interpreting it as an memory reference, but the ambiguity may still arise if the symbol results from a high-level language, such as C. When targeting x86, the Microsoft compiler decorates C identifiers: External names that denote objects or functions with the `__cdecl` or `__stdcall` calling convention are prefixed with an underscore `_`; external names that denote functions with the `__fastcall` or `__vectorcall` calling convention are prefixed with an at symbol `@`. This technique prevents symbols from conflicting with keywords in assembly.

But it is no longer the case for x86-64 (as well as ARM and ARM64). If a user declares an external variable with the name `RSI`, the compiler may generate the ambiguous and incorrect
```asm
MOV EAX, DWORD PTR [RSI]    ; parsed as `movl (%rsi), %eax`
                            ; should have been `movl rsi, %eax`
```

This RFC proposes formalization of the Intel syntax, by disallowing certain constructions, to resolve ambiguity.

# The Proposal

1. Indirect references shall always contain a mode specifier. Plain brackets are no longer allowed.
    ```asm
    MOV EAX, [RCX]                         ; invalid: operand size and mode specifier are required
    MOV EAX, DWORD [RCX]                   ; invalid: mode specifier is required
    MOV EAX, DWORD PTR [RCX]               ; valid: `movl (%rcx), %eax`
    VMULPD ZMM0, ZMM1, QWORD BCST [RCX]    ; valid: `vmulpd (%rcx){1to8}, %zmm1, %zmm0`
    LEA RAX, bx[RIP]                       ; invalid: operand size and mode specifier are required
    LEA RAX, BYTE PTR bx[RIP]              ; valid: `leaq bx(%rip), %rax`
    ```

2. Overriding segment registers shall occur before the operand size and mode specifier.
    ```asm
    MOV EAX, DWORD PTR CS:[RCX]            ; maybe invalid: symbol name cannot contain `:`
    MOV EAX, CS:DWORD PTR [RCX]            ; valid: `movl %cs:(%rcx), %eax`
    ```

3. If an identifier follows `PTR`, `BCAST` or `OFFSET`, then it is always treated as a symbol, even when it is a keyword. In other words, only registers are enclosed within brackets. This idea is shared with GAS syntax.
    ```asm
    MOV EAX, printf                        ; invalid: `printf` is not a known register
    MOV EAX, OFFSET printf                 ; valid: `movl $printf, %eax`
    MOV EAX, RCX                           ; invalid: operand size mismatch
    MOV EAX, OFFSET RCX                    ; valid: `movl $RCX, %eax`
    MOV EAX, DWORD PTR [RCX]               ; valid: `movl (%rcx), %eax`
    MOV EAX, DWORD PTR RCX                 ; valid: `movl RCX, %eax`
    MOV EAX, DWORD PTR RCX[RIP+10]         ; valid: `movl RCX+10(%rip), %eax`
    ```

4. For instructions with a dummy memory operand (`LEA`, `NOP`, etc.) and those with an uncommon size (`FXSAVE`/`FXRSTOR`, `FNSAVE`/`FNRSTOR`, etc.), `BYTE PTR` shall be used.
    ```asm
    NOP DWORD PTR [RAX], EAX               ; invalid: `BYTE PTR` is requred
    NOP BYTE PTR [RAX], EAX                ; valid: 0F 1F 00
    ```

5. RIP-relative operands must have `RIP` as the base register.
    ```asm
    MOV EBX, DWORD PTR foo                 ; valid: `movl foo, %ebx`
                                           ; note: might cause linker errors on x86-64
    MOV EBX, DWORD PTR foo[RIP]            ; valid: `movl foo(%rip), %ebx`
    ```

6. The base, index, scale and displacement parts of a memory operand shall appear uniformly. The displacement comes first, immediately following the mode specifier. If there is at least a base or index register, they are all placed in a pair of square brackets. This idea is also shared with GAS syntax.
    ```asm
    MOV ECX, DWORD PTR [RSI+RDI*4+field]   ; invalid: `field` is not a known register
    MOV ECX, DWORD PTR field[RSI+RDI*4]    ; valid: `movl field(%rsi,%rdi,4), %ecx`
    ```

# External Links

1. GCC [Bug 53929 - [meta-bug] -masm=intel with global symbol](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53929)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

  reply	other threads:[~2024-01-18 16:41 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-18  5:34 LIU Hao
2024-01-18  9:02 ` Fangrui Song
2024-01-18 12:54 ` Jan Beulich
2024-01-18 16:40   ` LIU Hao [this message]
2024-01-19  9:13     ` Jan Beulich
2024-01-20 12:40       ` LIU Hao
2024-01-22  8:39         ` Jan Beulich
2024-01-23  1:27           ` LIU Hao
2024-01-23  8:38             ` Jan Beulich
2024-01-23  9:00               ` LIU Hao
2024-01-23  9:03                 ` Jan Beulich
2024-01-23  9:21                   ` LIU Hao
2024-01-23  9:37                     ` Jan Beulich
2024-01-30  4:22     ` Hans-Peter Nilsson
2024-01-31 10:11       ` LIU Hao
     [not found] ` <DS7PR12MB5765DBF9500DE323DB4A8E29CB712@DS7PR12MB5765.namprd12.prod.outlook.com>
2024-01-19  1:42   ` LIU Hao
2024-01-19  7:41     ` Jan Beulich
2024-01-19  8:19     ` Fangrui Song
     [not found]     ` <DS7PR12MB5765654642BE3AD4C7F54E05CB702@DS7PR12MB5765.namprd12.prod.outlook.com>
2024-01-20 12:32       ` LIU Hao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40ae7cb2-c094-4594-859d-470e7a7fce49@126.com \
    --to=lh_mouse@126.com \
    --cc=binutils@sourceware.org \
    --cc=gcc@gcc.gnu.org \
    --cc=jbeulich@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).