[Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
@ 2020-08-29 17:47 andysem at mail dot ru
  2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: andysem at mail dot ru @ 2020-08-29 17:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

            Bug ID: 96846
           Summary: [x86] Prefer xor/test/setcc over test/setcc/movzx
                    sequence
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

This has been reported already in bug 86352, but that bug also describes a few
other issues, so I decoded to create a separate bug focused on this one
particular issue.

When performing boolean operations, gcc often prefers the following pattern:

1. Test instruction (test/cmp).
2. "setcc %al" (where al can be any 8-bit register).
3. If the bool needs to be further processed, "movzx %al, %eax" (where al and
eax are 8 and 32-bit versions of the register picked in #2).

The example code can be seen here:

https://gcc.godbolt.org/z/z3WGbq

For convenience I'm posting the code below:

bool narrow_boolean(int a) { return a!=5; }

unsigned int countmatch(unsigned int arr[], unsigned int key)
{
    unsigned count = 0;
    for (int i=0; i<1024 ; i++) {
        count += ((arr[i] & key) != 5);
    }
    return count;
}

narrow_boolean(int):
        cmp     edi, 5
        setne   al
        ret

countmatch(unsigned int*, unsigned int):
        lea     rcx, [rdi+4096]
        xor     eax, eax
.L4:
        mov     edx, DWORD PTR [rdi]
        and     edx, esi
        cmp     edx, 5
        setne   dl
        movzx   edx, dl
        add     rdi, 4
        add     eax, edx
        cmp     rcx, rdi
        jne     .L4
        ret

Command line parameters: -Wall -O3 -mtune=skylake -fno-tree-vectorize

The problem with the generated code is as follows:

- The setcc instruction only modifies the lower 8 bits of the full
architectural register. The upper bits remain unmodified and potentially
"dirty" meaning that the following instructions taking this register as input
may require merging the full register value, with a performance penalty.
- Since setcc preserves upper bits, the following instructions consuming the
output register become dependent on the previous instructions that write that
register. This results in an unnecessary dependency. This is especially a
problem in the case of narrow_boolean above.
- On Haswell and later, "movzx %al, %eax" is not eliminated at register rename
and consumes ALU and has non-zero latency. "movzx %al, %ebx" (i.e. when the
output register is different from input) is eliminated at rename stage, but
requires an additional register. But gcc seems to prefer the former form,
unless -frename-registers is specified.

See this excellent StackOverflow question and answers for details:

https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to

(BTW, the code in this bug originated from that question.)

A better instruction pattern would be this:

1. Zero the target flag register beforehand with "xor %eax, %eax".
2. Perform the test.
3. "setcc %al" to set the flag.

In case if this pattern is in a loop body, the initial xor can be hoisted out
of the loop. The important part is that the xor eliminates the dependency and
doesn't cost ALU uop. Arguably, it is also smaller code since xor is only 2
bytes vs. 3 bytes for movxz.

The initial zeroing requires an additional register, meaning that it cannot
reuse the register involved in the test in #2. However, that is often not a
problem, like in the examples above. I guess, the compiler could estimate if
there are spare registers and use xor/test/setcc sequence if there are and
test/setcc/movzx if not.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
  2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
@ 2020-08-29 19:39 ` jakub at gcc dot gnu.org
  2020-08-29 20:53 ` andysem at mail dot ru
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-08-29 19:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
This is something GCC is well aware of and attempts to optimize, it has several
peephole2s, starting with the
;; Convert setcc + movzbl to xor + setcc if operands don't overlap.
comment in i386.md.
The reason why a xor isn't added in your first function is that there is no
extension in that case, usually callers will just use the 8-bit register and
then the xor would be a waste of time.
The second case isn't handled because there is no place to insert the xor to,
xor modifies flags, so it can't go after the comparison that sets the flags,
and in this case can't go before either, because the register is live there (it
is an operand of the comparison).  So, I guess the only way around would be to
look if there is some currently unused register that could be used instead of
the one chosen by the register allocation, but that is not always the case.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
  2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
  2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
@ 2020-08-29 20:53 ` andysem at mail dot ru
  2020-08-29 21:10 ` jakub at gcc dot gnu.org
  2020-08-29 21:34 ` andysem at mail dot ru
  3 siblings, 0 replies; 5+ messages in thread
From: andysem at mail dot ru @ 2020-08-29 20:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

--- Comment #2 from andysem at mail dot ru ---
(In reply to Jakub Jelinek from comment #1)

In the godbolt link there is also a third case, which is similar to the second
one, but does not reuse the source register for comparison results.

unsigned int doubletmatch(unsigned int arr[], unsigned int key)
{
    unsigned count = 0;
    for (int i=0; i<1024 ; i++) {
        count += (arr[i] == key) | (arr[i] == ~key);
    }
    return count;
}

doubletmatch(unsigned int*, unsigned int):
        mov     r9d, esi
        not     r9d
        lea     rcx, [rdi+4096]
        xor     r8d, r8d
.L7:
        mov     edx, DWORD PTR [rdi]
        cmp     edx, esi
        sete    al
        cmp     edx, r9d
        sete    dl
        or      eax, edx
        movzx   eax, al
        add     rdi, 4
        add     r8d, eax
        cmp     rcx, rdi
        jne     .L7
        mov     eax, r8d
        ret

Note that in this case eax is not cleared either, so the "or eax, edx" has a
dependency on whatever prior instruction wrote to eax (before sete).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
  2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
  2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
  2020-08-29 20:53 ` andysem at mail dot ru
@ 2020-08-29 21:10 ` jakub at gcc dot gnu.org
  2020-08-29 21:34 ` andysem at mail dot ru
  3 siblings, 0 replies; 5+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-08-29 21:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
        mov     edx, DWORD PTR [rdi]
        cmp     edx, esi
        sete    al
        cmp     edx, r9d
        sete    dl
        or      eax, edx
        movzx   eax, al
This isn't what the peepholes are looking for, there are several other insns in
between, and peephole2s only work on exact insn sequences, doing anything more
complex would require doing it in some machine specific pass.
Note, while in theory it could add xor eax, eax before the cmp edx, esi insn,
it can't add xor edx, edx because the second comparison uses that register.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
  2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
                   ` (2 preceding siblings ...)
  2020-08-29 21:10 ` jakub at gcc dot gnu.org
@ 2020-08-29 21:34 ` andysem at mail dot ru
  3 siblings, 0 replies; 5+ messages in thread
From: andysem at mail dot ru @ 2020-08-29 21:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

--- Comment #4 from andysem at mail dot ru ---
(In reply to Jakub Jelinek from comment #3)
>         mov     edx, DWORD PTR [rdi]
>         cmp     edx, esi
>         sete    al
>         cmp     edx, r9d
>         sete    dl
>         or      eax, edx
>         movzx   eax, al
> This isn't what the peepholes are looking for, there are several other insns
> in between, and peephole2s only work on exact insn sequences, doing anything
> more complex would require doing it in some machine specific pass.

Yes, I think, this optimization needs to happen at an earlier stage. Rewriting
fixed instruction sequences doesn't allow for further optimizations like
hoisting the xor out of the loop body.

> Note, while in theory it could add xor eax, eax before the cmp edx, esi
> insn, it can't add xor edx, edx because the second comparison uses that
> register.

I don't think it should generate "xor edx, edx". I think, the logic has to be
roughly something like this:

1. Check if there is a spare register that we can use for the test result. If
there is, allocate it.
2. If we have a register, clear it with a xor before the test. Ideally, move
that xor out of the loop.
3. If not, decide if we are going to reuse one of the source registers or spill
some other register.
4. In the former case, keep the test/setcc/movxz sequence. In the latter, we
can still use xor/test/setcc, after spilling the victim register.

I.e. the main point is that it shouldn't try reusing the source register as
much; only reuse when you have to. Maybe, this requires some help from the
register allocator.

I admit, I have little knowledge how gcc internally works, so I may be talking
nonsense. That's just my naive thoughts about it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-29 21:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
2020-08-29 20:53 ` andysem at mail dot ru
2020-08-29 21:10 ` jakub at gcc dot gnu.org
2020-08-29 21:34 ` andysem at mail dot ru

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).