From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 21A243858D20; Mon, 22 Jan 2024 16:46:41 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 21A243858D20 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1705942001; bh=RL3EgBzIy/eLNPzBN/zzKa8d58qziKVTKAgwzyIvSwc=; h=From:To:Subject:Date:From; b=pqUk461Bu4QCOGsQf2IR+PBAUscx5DEtN1pFela+xag2pPT4HkCIKGPh6Bvj2LE40 TqeOvxVt12Cwe8zx1gjjpValXERP8ao+cLOTek4bU7b/y2Y/Bd+Cr5dRGmb4WrpENE yJjoYO1o2hYnQe7ip/Z2uuRf9169BYT1qaZhANxg= From: "janschultke at googlemail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug c++/113543] New: Poor codegen for bit-counting functions (countl_zero, countl_one, countr_zero, countr_one) Date: Mon, 22 Jan 2024 16:46:40 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: c++ X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: janschultke at googlemail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113543 Bug ID: 113543 Summary: Poor codegen for bit-counting functions (countl_zero, countl_one, countr_zero, countr_one) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: janschultke at googlemail dot com Target Milestone: --- ## Code to Reproduce (https://godbolt.org/z/qPeszhaPv) #include template T countr_zero(T x) { return std::countr_zero(x); } template unsigned char countr_zero(unsigned char); template unsigned short countr_zero(unsigned short); template unsigned int countr_zero(unsigned int); template unsigned long countr_zero(unsigned long); template unsigned long long countr_zero(unsigned long long); template T countr_one(T x) { return std::countr_one(x); } template unsigned char countr_one(unsigned char); template unsigned short countr_one(unsigned short); template unsigned int countr_one(unsigned int); template unsigned long countr_one(unsigned long); template unsigned long long countr_one(unsigned long long); template T countl_zero(T x) { return std::countl_zero(x); } template unsigned char countl_zero(unsigned char); template unsigned short countl_zero(unsigned short); template unsigned int countl_zero(unsigned int); template unsigned long countl_zero(unsigned long); template unsigned long long countl_zero(unsigned long long); template T countl_one(T x) { return std::countl_zero(x); } template unsigned char countl_one(unsigned char); template unsigned short countl_one(unsigned short); template unsigned int countl_one(unsigned int); template unsigned long countl_one(unsigned long); template unsigned long long countl_one(unsigned long long); ## Summary GCC consistently emits much more code for these function than clang. For example, GCC: > unsigned int countl_one(unsigned int): > xor eax, eax > lzcnt eax, edi > ret Clang does not emit the extra xor instruction. I don't really know why. LZC= NT has a wide contract and should be equivalent to std::countl_zero. It gets a lot worse though: > unsigned short countl_zero(unsigned short): > mov eax, 16 > test di, di > je .L23 > movzx edi, di > lzcnt edi, edi > lea eax, [rdi-16] > .L23: > ret I don't really know what all of this schmutz is. Clang emits lzcnt and ret = in this case. Another bit of disappointing codegen is this: > unsigned char countr_zero(unsigned char): > movzx eax, dil > xor edx, edx > tzcnt edx, eax > test dil, dil > mov eax, 8 > cmovne eax, edx > ret Clang emits: > or edi, 256 > tzcnt eax, edi > ret This clang codegen is very clever. It simply adds a bit on the left, so that the 32-bit routine can be re-used with only one additional instruction.=