From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 374383858D3C; Fri,  9 Feb 2024 18:35:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 374383858D3C
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1707503726;
	bh=TxGwlbBRo0A7EuEOfDW0hrVPyMmjbjFOVfT9Db9WHOw=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=loKjiU92ENAecKMv9cZuODP0O30x1+3K7lzC9LgL2CwFF85omczB8lbjDCrqLXNMc
	 d7kqpn4j9OSzjgyvYXmEnaNxkhFN1s27BuYN+Ogqcd4ccJZyaySzWnWAKXSqMV22BA
	 ZxCNwIMMknEAYRAQGwnQ1HCfdHZq7mhKW597Dur4=
From: "roger at nextmovesoftware dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/113764] [X86] Generates lzcnt when bsr is sufficient
Date: Fri, 09 Feb 2024 18:35:25 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.2.1
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: roger at nextmovesoftware dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-113764-4-MTuqbcVwTQ@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-113764-4@http.gcc.gnu.org/bugzilla/>
References: <bug-113764-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113764
--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> ---
Investigating further, the thinking behind GCC's current behaviour can be f=
ound
in Agner Fog's instruction tables; on many architectures BSR is much slower
than LZCNT.

Legacy AMD:      BSR=3D4 cycles,  LZCNT=3D2 cycles
AMD BOBCAT:      BSR=3D6 cycles,  LZCNT=3D5 cycles
AMD JAGUAR:      BSR=3D4 cycles,  LZCNT=3D1 cycle
AMD ZEN[1-3]:    BSR=3D4 cycles,  LZCNT=3D1 cycle
AMD ZEN4:        BSR=3D1 cycle,   LZCNT=3D1 cycle
INTEL:           BSR=3D3 cycles,  LZCNT=3D3 cycles
KNIGHTS LANDING: BSR=3D11 cycles, LZCNT=3D3 cycles

Hence using bsr is only "better" in some (but not all) contexts, and a
reasonable default (for generic tuning) is to ignore BSR when LZCNT is
available, as it's only one extra cycle of latency to perform the XOR.

The correct solution is to add a tuning parameter to the x86 backend, to
control whether it's beneficial to use BSR when LZCNT is available, for exa=
mple
when optimizing for size with -Os or -Oz.  This is more reasonable now that
current Intel and AMD architectures have the same latency for BSR and LZCNT,
than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).=