From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 48C0A3858C27; Wed, 13 Jul 2022 02:16:52 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 48C0A3858C27
From: "jl1184 at duke dot edu" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/106277] New: missed-optimization: redundant movzx
Date: Wed, 13 Jul 2022 02:16:52 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 12.1.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: jl1184 at duke dot edu
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-106277-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Jul 2022 02:16:52 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D106277

            Bug ID: 106277
           Summary: missed-optimization: redundant movzx
           Product: gcc
           Version: 12.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jl1184 at duke dot edu
  Target Milestone: ---

I came across this when examining a loop that runs slower than I expected. =
It
involves explicit and implicit conversions between 8-bit and 32/64-bit valu=
es,
and as I looked through the generated assembly using Godbolt compiler explo=
rer,
I found lots of movzx instructions that don't seem to break dependency or p=
lay
a role in correctness, not to mention many use the same register like "movzx
eax al", which cannot be eliminated.

I then tried some simple examples on Godbolt with X86-64 GCC 12.1, and found
that this behavior is persistent and easily reproducible, even when I speci=
fy
"-march=3Dskylake". Here's an example:

#include <stdint.h>

int add2bytes(uint8_t* a, uint8_t* b) {
    return uint8_t(*a + *b);
}
gcc -O3 gives:
add2bytes(unsigned char*, unsigned char*):
        movzx   eax, BYTE PTR [rsi]
        add     al, BYTE PTR [rdi]
        movzx   eax, al
        ret

The first movzx here breaks dependency on old eax value, but what is the se=
cond
movzx doing? I don't think there's any dependency it can break, and it
shouldn't affect the result either.

I also asked this on Stack Overflow and [Peter Cordes] has a great response
(https://stackoverflow.com/a/72953035/14730360) explaining how this extra m=
ovzx
is bad for the vast majority of X86-64 processors. IMHO newer versions of G=
CC
should give newer processors more weight in performance tradeoff. Probably
-mtune=3Dgeneric in a later GCC shouldn't care about P6-family partial-regi=
ster
stalls. Practically there should be so few still using those CPUs to run la=
test
compiled softwares.

Godbolt link with code for examples: https://godbolt.org/z/4n6ezaav7
Here's another example closer to what I was originally examining:

int foo(uint8_t* a, uint8_t i, uint8_t j) {
    return a[a[i] | a[j]];
}
gcc -O3 gives:
foo(unsigned char*, unsigned char, unsigned char):
        movzx   esi, sil
        movzx   edx, dl
        movzx   eax, BYTE PTR [rdi+rsi]
        or      al, BYTE PTR [rdi+rdx]
        movzx   eax, al
        movzx   eax, BYTE PTR [rdi+rax]
        ret

As was discussed in the Stack Overflow post, the first 2 movzx should be
changed to use different registers so that some CPUs can have the benefit f=
rom
mov elimination.

The "movzx   eax, al" just seems unnecessary. The upper bits of RAX should
already be cleared, and the dependency of RAX on the "or" is not something =
that
"movzx eax al" can break. So I think it's better to just do "movzx   eax, b=
yte
ptr [rdi + rax]" after the "or". Or maybe even better, just use "mov   eax,
byte ptr [rdi + rax]" since EAX should already be free and cleaned in upper
bits at this point.=