From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 48C0A3858C27; Wed, 13 Jul 2022 02:16:52 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 48C0A3858C27 From: "jl1184 at duke dot edu" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/106277] New: missed-optimization: redundant movzx Date: Wed, 13 Jul 2022 02:16:52 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 12.1.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: jl1184 at duke dot edu X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jul 2022 02:16:52 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D106277 Bug ID: 106277 Summary: missed-optimization: redundant movzx Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: jl1184 at duke dot edu Target Milestone: --- I came across this when examining a loop that runs slower than I expected. = It involves explicit and implicit conversions between 8-bit and 32/64-bit valu= es, and as I looked through the generated assembly using Godbolt compiler explo= rer, I found lots of movzx instructions that don't seem to break dependency or p= lay a role in correctness, not to mention many use the same register like "movzx eax al", which cannot be eliminated. I then tried some simple examples on Godbolt with X86-64 GCC 12.1, and found that this behavior is persistent and easily reproducible, even when I speci= fy "-march=3Dskylake". Here's an example: #include int add2bytes(uint8_t* a, uint8_t* b) { return uint8_t(*a + *b); } gcc -O3 gives: add2bytes(unsigned char*, unsigned char*): movzx eax, BYTE PTR [rsi] add al, BYTE PTR [rdi] movzx eax, al ret The first movzx here breaks dependency on old eax value, but what is the se= cond movzx doing? I don't think there's any dependency it can break, and it shouldn't affect the result either. I also asked this on Stack Overflow and [Peter Cordes] has a great response (https://stackoverflow.com/a/72953035/14730360) explaining how this extra m= ovzx is bad for the vast majority of X86-64 processors. IMHO newer versions of G= CC should give newer processors more weight in performance tradeoff. Probably -mtune=3Dgeneric in a later GCC shouldn't care about P6-family partial-regi= ster stalls. Practically there should be so few still using those CPUs to run la= test compiled softwares. Godbolt link with code for examples: https://godbolt.org/z/4n6ezaav7 Here's another example closer to what I was originally examining: int foo(uint8_t* a, uint8_t i, uint8_t j) { return a[a[i] | a[j]]; } gcc -O3 gives: foo(unsigned char*, unsigned char, unsigned char): movzx esi, sil movzx edx, dl movzx eax, BYTE PTR [rdi+rsi] or al, BYTE PTR [rdi+rdx] movzx eax, al movzx eax, BYTE PTR [rdi+rax] ret As was discussed in the Stack Overflow post, the first 2 movzx should be changed to use different registers so that some CPUs can have the benefit f= rom mov elimination. The "movzx eax, al" just seems unnecessary. The upper bits of RAX should already be cleared, and the dependency of RAX on the "or" is not something = that "movzx eax al" can break. So I think it's better to just do "movzx eax, b= yte ptr [rdi + rax]" after the "or". Or maybe even better, just use "mov eax, byte ptr [rdi + rax]" since EAX should already be free and cleaned in upper bits at this point.=