public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
@ 2020-08-29 17:47 andysem at mail dot ru
2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: andysem at mail dot ru @ 2020-08-29 17:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846
Bug ID: 96846
Summary: [x86] Prefer xor/test/setcc over test/setcc/movzx
sequence
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: andysem at mail dot ru
Target Milestone: ---
This has been reported already in bug 86352, but that bug also describes a few
other issues, so I decoded to create a separate bug focused on this one
particular issue.
When performing boolean operations, gcc often prefers the following pattern:
1. Test instruction (test/cmp).
2. "setcc %al" (where al can be any 8-bit register).
3. If the bool needs to be further processed, "movzx %al, %eax" (where al and
eax are 8 and 32-bit versions of the register picked in #2).
The example code can be seen here:
https://gcc.godbolt.org/z/z3WGbq
For convenience I'm posting the code below:
bool narrow_boolean(int a) { return a!=5; }
unsigned int countmatch(unsigned int arr[], unsigned int key)
{
unsigned count = 0;
for (int i=0; i<1024 ; i++) {
count += ((arr[i] & key) != 5);
}
return count;
}
narrow_boolean(int):
cmp edi, 5
setne al
ret
countmatch(unsigned int*, unsigned int):
lea rcx, [rdi+4096]
xor eax, eax
.L4:
mov edx, DWORD PTR [rdi]
and edx, esi
cmp edx, 5
setne dl
movzx edx, dl
add rdi, 4
add eax, edx
cmp rcx, rdi
jne .L4
ret
Command line parameters: -Wall -O3 -mtune=skylake -fno-tree-vectorize
The problem with the generated code is as follows:
- The setcc instruction only modifies the lower 8 bits of the full
architectural register. The upper bits remain unmodified and potentially
"dirty" meaning that the following instructions taking this register as input
may require merging the full register value, with a performance penalty.
- Since setcc preserves upper bits, the following instructions consuming the
output register become dependent on the previous instructions that write that
register. This results in an unnecessary dependency. This is especially a
problem in the case of narrow_boolean above.
- On Haswell and later, "movzx %al, %eax" is not eliminated at register rename
and consumes ALU and has non-zero latency. "movzx %al, %ebx" (i.e. when the
output register is different from input) is eliminated at rename stage, but
requires an additional register. But gcc seems to prefer the former form,
unless -frename-registers is specified.
See this excellent StackOverflow question and answers for details:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to
(BTW, the code in this bug originated from that question.)
A better instruction pattern would be this:
1. Zero the target flag register beforehand with "xor %eax, %eax".
2. Perform the test.
3. "setcc %al" to set the flag.
In case if this pattern is in a loop body, the initial xor can be hoisted out
of the loop. The important part is that the xor eliminates the dependency and
doesn't cost ALU uop. Arguably, it is also smaller code since xor is only 2
bytes vs. 3 bytes for movxz.
The initial zeroing requires an additional register, meaning that it cannot
reuse the register involved in the test in #2. However, that is often not a
problem, like in the examples above. I guess, the compiler could estimate if
there are spare registers and use xor/test/setcc sequence if there are and
test/setcc/movzx if not.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
@ 2020-08-29 19:39 ` jakub at gcc dot gnu.org
2020-08-29 20:53 ` andysem at mail dot ru
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-08-29 19:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
This is something GCC is well aware of and attempts to optimize, it has several
peephole2s, starting with the
;; Convert setcc + movzbl to xor + setcc if operands don't overlap.
comment in i386.md.
The reason why a xor isn't added in your first function is that there is no
extension in that case, usually callers will just use the 8-bit register and
then the xor would be a waste of time.
The second case isn't handled because there is no place to insert the xor to,
xor modifies flags, so it can't go after the comparison that sets the flags,
and in this case can't go before either, because the register is live there (it
is an operand of the comparison). So, I guess the only way around would be to
look if there is some currently unused register that could be used instead of
the one chosen by the register allocation, but that is not always the case.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
@ 2020-08-29 20:53 ` andysem at mail dot ru
2020-08-29 21:10 ` jakub at gcc dot gnu.org
2020-08-29 21:34 ` andysem at mail dot ru
3 siblings, 0 replies; 5+ messages in thread
From: andysem at mail dot ru @ 2020-08-29 20:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846
--- Comment #2 from andysem at mail dot ru ---
(In reply to Jakub Jelinek from comment #1)
In the godbolt link there is also a third case, which is similar to the second
one, but does not reuse the source register for comparison results.
unsigned int doubletmatch(unsigned int arr[], unsigned int key)
{
unsigned count = 0;
for (int i=0; i<1024 ; i++) {
count += (arr[i] == key) | (arr[i] == ~key);
}
return count;
}
doubletmatch(unsigned int*, unsigned int):
mov r9d, esi
not r9d
lea rcx, [rdi+4096]
xor r8d, r8d
.L7:
mov edx, DWORD PTR [rdi]
cmp edx, esi
sete al
cmp edx, r9d
sete dl
or eax, edx
movzx eax, al
add rdi, 4
add r8d, eax
cmp rcx, rdi
jne .L7
mov eax, r8d
ret
Note that in this case eax is not cleared either, so the "or eax, edx" has a
dependency on whatever prior instruction wrote to eax (before sete).
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
2020-08-29 20:53 ` andysem at mail dot ru
@ 2020-08-29 21:10 ` jakub at gcc dot gnu.org
2020-08-29 21:34 ` andysem at mail dot ru
3 siblings, 0 replies; 5+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-08-29 21:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |uros at gcc dot gnu.org
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
mov edx, DWORD PTR [rdi]
cmp edx, esi
sete al
cmp edx, r9d
sete dl
or eax, edx
movzx eax, al
This isn't what the peepholes are looking for, there are several other insns in
between, and peephole2s only work on exact insn sequences, doing anything more
complex would require doing it in some machine specific pass.
Note, while in theory it could add xor eax, eax before the cmp edx, esi insn,
it can't add xor edx, edx because the second comparison uses that register.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
` (2 preceding siblings ...)
2020-08-29 21:10 ` jakub at gcc dot gnu.org
@ 2020-08-29 21:34 ` andysem at mail dot ru
3 siblings, 0 replies; 5+ messages in thread
From: andysem at mail dot ru @ 2020-08-29 21:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846
--- Comment #4 from andysem at mail dot ru ---
(In reply to Jakub Jelinek from comment #3)
> mov edx, DWORD PTR [rdi]
> cmp edx, esi
> sete al
> cmp edx, r9d
> sete dl
> or eax, edx
> movzx eax, al
> This isn't what the peepholes are looking for, there are several other insns
> in between, and peephole2s only work on exact insn sequences, doing anything
> more complex would require doing it in some machine specific pass.
Yes, I think, this optimization needs to happen at an earlier stage. Rewriting
fixed instruction sequences doesn't allow for further optimizations like
hoisting the xor out of the loop body.
> Note, while in theory it could add xor eax, eax before the cmp edx, esi
> insn, it can't add xor edx, edx because the second comparison uses that
> register.
I don't think it should generate "xor edx, edx". I think, the logic has to be
roughly something like this:
1. Check if there is a spare register that we can use for the test result. If
there is, allocate it.
2. If we have a register, clear it with a xor before the test. Ideally, move
that xor out of the loop.
3. If not, decide if we are going to reuse one of the source registers or spill
some other register.
4. In the former case, keep the test/setcc/movxz sequence. In the latter, we
can still use xor/test/setcc, after spilling the victim register.
I.e. the main point is that it shouldn't try reusing the source register as
much; only reuse when you have to. Maybe, this requires some help from the
register allocator.
I admit, I have little knowledge how gcc internally works, so I may be talking
nonsense. That's just my naive thoughts about it.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-08-29 21:34 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-29 17:47 [Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence andysem at mail dot ru
2020-08-29 19:39 ` [Bug target/96846] " jakub at gcc dot gnu.org
2020-08-29 20:53 ` andysem at mail dot ru
2020-08-29 21:10 ` jakub at gcc dot gnu.org
2020-08-29 21:34 ` andysem at mail dot ru
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).