From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 374383858D3C; Fri, 9 Feb 2024 18:35:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 374383858D3C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1707503726; bh=TxGwlbBRo0A7EuEOfDW0hrVPyMmjbjFOVfT9Db9WHOw=; h=From:To:Subject:Date:In-Reply-To:References:From; b=loKjiU92ENAecKMv9cZuODP0O30x1+3K7lzC9LgL2CwFF85omczB8lbjDCrqLXNMc d7kqpn4j9OSzjgyvYXmEnaNxkhFN1s27BuYN+Ogqcd4ccJZyaySzWnWAKXSqMV22BA ZxCNwIMMknEAYRAQGwnQ1HCfdHZq7mhKW597Dur4= From: "roger at nextmovesoftware dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/113764] [X86] Generates lzcnt when bsr is sufficient Date: Fri, 09 Feb 2024 18:35:25 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.2.1 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: roger at nextmovesoftware dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113764 --- Comment #2 from Roger Sayle --- Investigating further, the thinking behind GCC's current behaviour can be f= ound in Agner Fog's instruction tables; on many architectures BSR is much slower than LZCNT. Legacy AMD: BSR=3D4 cycles, LZCNT=3D2 cycles AMD BOBCAT: BSR=3D6 cycles, LZCNT=3D5 cycles AMD JAGUAR: BSR=3D4 cycles, LZCNT=3D1 cycle AMD ZEN[1-3]: BSR=3D4 cycles, LZCNT=3D1 cycle AMD ZEN4: BSR=3D1 cycle, LZCNT=3D1 cycle INTEL: BSR=3D3 cycles, LZCNT=3D3 cycles KNIGHTS LANDING: BSR=3D11 cycles, LZCNT=3D3 cycles Hence using bsr is only "better" in some (but not all) contexts, and a reasonable default (for generic tuning) is to ignore BSR when LZCNT is available, as it's only one extra cycle of latency to perform the XOR. The correct solution is to add a tuning parameter to the x86 backend, to control whether it's beneficial to use BSR when LZCNT is available, for exa= mple when optimizing for size with -Os or -Oz. This is more reasonable now that current Intel and AMD architectures have the same latency for BSR and LZCNT, than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).=