public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/106277] New: missed-optimization: redundant movzx
@ 2022-07-13  2:16 jl1184 at duke dot edu
  2022-07-13 15:49 ` [Bug target/106277] " pinskia at gcc dot gnu.org
  2022-07-13 16:06 ` amonakov at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: jl1184 at duke dot edu @ 2022-07-13  2:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277

            Bug ID: 106277
           Summary: missed-optimization: redundant movzx
           Product: gcc
           Version: 12.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jl1184 at duke dot edu
  Target Milestone: ---

I came across this when examining a loop that runs slower than I expected. It
involves explicit and implicit conversions between 8-bit and 32/64-bit values,
and as I looked through the generated assembly using Godbolt compiler explorer,
I found lots of movzx instructions that don't seem to break dependency or play
a role in correctness, not to mention many use the same register like "movzx
eax al", which cannot be eliminated.

I then tried some simple examples on Godbolt with X86-64 GCC 12.1, and found
that this behavior is persistent and easily reproducible, even when I specify
"-march=skylake". Here's an example:

#include <stdint.h>

int add2bytes(uint8_t* a, uint8_t* b) {
    return uint8_t(*a + *b);
}
gcc -O3 gives:
add2bytes(unsigned char*, unsigned char*):
        movzx   eax, BYTE PTR [rsi]
        add     al, BYTE PTR [rdi]
        movzx   eax, al
        ret

The first movzx here breaks dependency on old eax value, but what is the second
movzx doing? I don't think there's any dependency it can break, and it
shouldn't affect the result either.

I also asked this on Stack Overflow and [Peter Cordes] has a great response
(https://stackoverflow.com/a/72953035/14730360) explaining how this extra movzx
is bad for the vast majority of X86-64 processors. IMHO newer versions of GCC
should give newer processors more weight in performance tradeoff. Probably
-mtune=generic in a later GCC shouldn't care about P6-family partial-register
stalls. Practically there should be so few still using those CPUs to run latest
compiled softwares.

Godbolt link with code for examples: https://godbolt.org/z/4n6ezaav7
Here's another example closer to what I was originally examining:

int foo(uint8_t* a, uint8_t i, uint8_t j) {
    return a[a[i] | a[j]];
}
gcc -O3 gives:
foo(unsigned char*, unsigned char, unsigned char):
        movzx   esi, sil
        movzx   edx, dl
        movzx   eax, BYTE PTR [rdi+rsi]
        or      al, BYTE PTR [rdi+rdx]
        movzx   eax, al
        movzx   eax, BYTE PTR [rdi+rax]
        ret

As was discussed in the Stack Overflow post, the first 2 movzx should be
changed to use different registers so that some CPUs can have the benefit from
mov elimination.

The "movzx   eax, al" just seems unnecessary. The upper bits of RAX should
already be cleared, and the dependency of RAX on the "or" is not something that
"movzx eax al" can break. So I think it's better to just do "movzx   eax, byte
ptr [rdi + rax]" after the "or". Or maybe even better, just use "mov   eax,
byte ptr [rdi + rax]" since EAX should already be free and cleaned in upper
bits at this point.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/106277] missed-optimization: redundant movzx
  2022-07-13  2:16 [Bug middle-end/106277] New: missed-optimization: redundant movzx jl1184 at duke dot edu
@ 2022-07-13 15:49 ` pinskia at gcc dot gnu.org
  2022-07-13 16:06 ` amonakov at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-07-13 15:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-07-13
     Ever confirmed|0                           |1

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Because the RTL does not describe the load is zero-extended to SI:
(insn 9 8 10 2 (set (reg:QI 91 [ *b_6(D) ])
        (mem:QI (reg:DI 93) [0 *b_6(D)+0 S1 A8])) "/app/example.cpp":4:31 85
{*movqi_internal}
     (expr_list:REG_DEAD (reg:DI 93)
        (nil)))
(insn 10 9 11 2 (parallel [
            (set (reg:QI 89)
                (plus:QI (mem:QI (reg:DI 92) [0 *a_5(D)+0 S1 A8])
                    (reg:QI 91 [ *b_6(D) ])))
            (clobber (reg:CC 17 flags))
        ]) "/app/example.cpp":4:31 235 {*addqi_1}
     (expr_list:REG_DEAD (reg:DI 92)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (expr_list:REG_DEAD (reg:QI 91 [ *b_6(D) ])
                (nil)))))
(insn 11 10 16 2 (set (reg:SI 88)
        (zero_extend:SI (reg:QI 89))) "/app/example.cpp":4:35 152
{*zero_extendqisi2}
     (expr_list:REG_DEAD (reg:QI 89)
        (nil)))

That is why.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/106277] missed-optimization: redundant movzx
  2022-07-13  2:16 [Bug middle-end/106277] New: missed-optimization: redundant movzx jl1184 at duke dot edu
  2022-07-13 15:49 ` [Bug target/106277] " pinskia at gcc dot gnu.org
@ 2022-07-13 16:06 ` amonakov at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-07-13 16:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
You probably mean the addition, not the load.

It cannot: it really is an 8-bit addition, and if pseudo 91 is allocated to
e.g. AH or another "high" 8-bit register, there really will need to be an
explicit zero-extend.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-07-13 16:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-13  2:16 [Bug middle-end/106277] New: missed-optimization: redundant movzx jl1184 at duke dot edu
2022-07-13 15:49 ` [Bug target/106277] " pinskia at gcc dot gnu.org
2022-07-13 16:06 ` amonakov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).