From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id C84C3385B510; Fri, 17 Feb 2023 14:27:19 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C84C3385B510
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1676644039;
	bh=TXsthoPCCLuyZdGf5pA27LATn+rumz/oE45WY8GQZ6E=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=LQMbGGqQI87mgpreB03RGDUUsOnNP/No7W/NZsA7jCoggzhXkTkp6dXoH5w/XNzvF
	 gW+v77IijettoZQk+K4reVx4RFt/hzJC/Nv8QH6TJV/TfwaSaoXbMmwclJa3hLODeU
	 wD5tkbyP5z6uDTQU3lmyCZLpLeJ+d/Xqc/sBNLn8=
From: "wilco at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/90838] Detect table-based ctz implementation
Date: Fri, 17 Feb 2023 14:27:19 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 9.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: wilco at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Resolution: FIXED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: wilco at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 10.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-90838-4-yDWu1ixilw@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-90838-4@http.gcc.gnu.org/bugzilla/>
References: <bug-90838-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D90838
--- Comment #17 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #16)
> (In reply to Wilco from comment #15)
> > It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_Z=
ERO
> > =3D=3D 2 so that you always get the same result even when you don't hav=
e tzcnt.
> > A conditional move would be possible, so it adds an extra 2 instruction=
s at
> > worst (ie. still significantly faster than doing the table lookup, mult=
iply
> > etc). And it could be optimized when you know CLZ/CTZ input is non-zero.
>=20
> Conditional moves are a lottery on x86, in many cases very bad idea.  And
> when people actually use __builtin_clz*, they state that they don't care
> about the 0 value, so emitting terribly performing code for it just in ca=
se
> would be wrong.
> If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=3D2
> case, on targets where conditional moves are beneficial for it it can also
> emit them, or emit the jump which say on x86 will be most likely faster t=
han
> cmov.

Well GCC emits a cmov for this (-O2 -march=3Dx86-64-v2):

int ctz(long a)
{
  return (a =3D=3D 0) ? 64 : __builtin_ctzl (a);
}

ctz:
        xor     edx, edx
        mov     eax, 64
        rep bsf rdx, rdi
        test    rdi, rdi
        cmovne  eax, edx
        ret

Note the extra 'test' seems redundant since IIRC bsf sets Z=3D1 if the inpu=
t is
zero.

On Zen 2 this has identical performance as the plain builtin when you loop =
it
as res =3D ctz (res) + 1; (ie. measuring latency of non-zero case). So I fi=
nd it
hard to believe cmov is expensive on modern cores.=