public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/113764] New: [X86] Generates lzcnt when bsr is sufficient
@ 2024-02-05 11:38 chfast at gmail dot com
  2024-02-08  0:35 ` [Bug target/113764] " roger at nextmovesoftware dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: chfast at gmail dot com @ 2024-02-05 11:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

            Bug ID: 113764
           Summary: [X86] Generates lzcnt when bsr is sufficient
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chfast at gmail dot com
  Target Milestone: ---

When lzcnt instructions is enabled (-mlzcnt) the compiler generates lzcnt for
__builtin_clz() in the context where the bsr instruction is sufficient and
better.

unsigned bsr(unsigned x)
{
    return __builtin_clz(x) ^ 31;
}

bsr:
  xor eax, eax
  lzcnt eax, edi
  xor eax, 31
  ret


Without -mlzcnt the generated code is optimal.

bsr:
  bsr eax, edi
  ret


https://godbolt.org/z/5qcTq18nr

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/113764] [X86] Generates lzcnt when bsr is sufficient
  2024-02-05 11:38 [Bug target/113764] New: [X86] Generates lzcnt when bsr is sufficient chfast at gmail dot com
@ 2024-02-08  0:35 ` roger at nextmovesoftware dot com
  2024-02-09 18:35 ` roger at nextmovesoftware dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-02-08  0:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-02-08

--- Comment #1 from Roger Sayle <roger at nextmovesoftware dot com> ---
Confirmed.  This issue has two parts.  The first is that the bsr_1 pattern (and
variants) is (are) conditional on !TARGET_LZCNT, so the bsrl instruction isn't
currently available with -mlzcnt.  The second is that the middle-end doesn't
have a preferred canonical RTL representation for this idiom, but all three of
the following equivalent functions should generate identical code:

unsigned bsr1(unsigned x) { return __builtin_clz(x) ^ 31; }
unsigned bsr2(unsigned x) { return 31 - __builtin_clz(x); }
unsigned bsr3(unsigned x) { return ~__builtin_clz(x) & 31; }

[Note that the tree-ssa optimizers do transform bsr3 into bsr1].
A suitable fix would be to add the equivalent clz(x)^31 variant pattern to
i386.md as a "synonymous" define_insn pattern.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/113764] [X86] Generates lzcnt when bsr is sufficient
  2024-02-05 11:38 [Bug target/113764] New: [X86] Generates lzcnt when bsr is sufficient chfast at gmail dot com
  2024-02-08  0:35 ` [Bug target/113764] " roger at nextmovesoftware dot com
@ 2024-02-09 18:35 ` roger at nextmovesoftware dot com
  2024-02-09 21:58 ` jakub at gcc dot gnu.org
  2024-02-11 11:34 ` [Bug target/113764] [X86] __builtin_clz generates " roger at nextmovesoftware dot com
  3 siblings, 0 replies; 5+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-02-09 18:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> ---
Investigating further, the thinking behind GCC's current behaviour can be found
in Agner Fog's instruction tables; on many architectures BSR is much slower
than LZCNT.

Legacy AMD:      BSR=4 cycles,  LZCNT=2 cycles
AMD BOBCAT:      BSR=6 cycles,  LZCNT=5 cycles
AMD JAGUAR:      BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN[1-3]:    BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN4:        BSR=1 cycle,   LZCNT=1 cycle
INTEL:           BSR=3 cycles,  LZCNT=3 cycles
KNIGHTS LANDING: BSR=11 cycles, LZCNT=3 cycles

Hence using bsr is only "better" in some (but not all) contexts, and a
reasonable default (for generic tuning) is to ignore BSR when LZCNT is
available, as it's only one extra cycle of latency to perform the XOR.

The correct solution is to add a tuning parameter to the x86 backend, to
control whether it's beneficial to use BSR when LZCNT is available, for example
when optimizing for size with -Os or -Oz.  This is more reasonable now that
current Intel and AMD architectures have the same latency for BSR and LZCNT,
than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/113764] [X86] Generates lzcnt when bsr is sufficient
  2024-02-05 11:38 [Bug target/113764] New: [X86] Generates lzcnt when bsr is sufficient chfast at gmail dot com
  2024-02-08  0:35 ` [Bug target/113764] " roger at nextmovesoftware dot com
  2024-02-09 18:35 ` roger at nextmovesoftware dot com
@ 2024-02-09 21:58 ` jakub at gcc dot gnu.org
  2024-02-11 11:34 ` [Bug target/113764] [X86] __builtin_clz generates " roger at nextmovesoftware dot com
  3 siblings, 0 replies; 5+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-02-09 21:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
It is far more complicated than this.
When TARGET_LZCNT is on, CLZ_DEFINED_VALUE_AT_ZERO is 2 and already in GIMPLE
opts can use the fact that it has particular behavior on zero argument.
Before my _BitInt changes for clz/ctz etc., there was no way to differentiate
it in GIMPLE except for builtin (which had UB at zero) vs. ifn (which had it
depending on C?Z_DEFINED_VALUE_AT_ZERO).  Now even ifn can be UB at zero
(single argument) or well defined (two).  But still on RTL we have just one
thing, CLZ or CTZ rtxes which honor
C?Z_DEFINED_VALUE_AT_ZERO for the particular mode.
So, I think having at least in one function some lzcnt and some bsr insns
wouldn't be possible.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/113764] [X86] __builtin_clz generates lzcnt when bsr is sufficient
  2024-02-05 11:38 [Bug target/113764] New: [X86] Generates lzcnt when bsr is sufficient chfast at gmail dot com
                   ` (2 preceding siblings ...)
  2024-02-09 21:58 ` jakub at gcc dot gnu.org
@ 2024-02-11 11:34 ` roger at nextmovesoftware dot com
  3 siblings, 0 replies; 5+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-02-11 11:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[X86] Generates lzcnt when  |[X86] __builtin_clz
                   |bsr is sufficient           |generates lzcnt when bsr is
                   |                            |sufficient

--- Comment #4 from Roger Sayle <roger at nextmovesoftware dot com> ---
Yep, CLZ_DEFINED_VALUE_AT_ZERO really complicates things.  With a single
"global" macro it's currently impossible for a backend to support two different
CLZ instructions; one with defined behavior at zero, and the other with
undefined behavior at zero.

It might just be possible to do something encoding LZCNT patterns in RTL using:
(if_then_else:SI (ne:SI (reg:SI x) (const_int 0))
                 (clz:SI (reg:SI x))
                 (const_int VALUE))

Additionally on x86_64, the BSR instruction sets the zero flag if it's input is
zero, when the destination register becomes undefined, which can be useful with
CMOV, i.e. it's possible to get defined behavior without an additional test and
branch.  But for Pawel's original tescase, __builtin_clz is undefined at zero,
so this really is a missed optimization, with either -Os or a modern -march
such as cascadelake or znver4.

I agree with Jakub, this is a can of worms; potentially a lot of effort for a
marginal improvement.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-02-11 11:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-05 11:38 [Bug target/113764] New: [X86] Generates lzcnt when bsr is sufficient chfast at gmail dot com
2024-02-08  0:35 ` [Bug target/113764] " roger at nextmovesoftware dot com
2024-02-09 18:35 ` roger at nextmovesoftware dot com
2024-02-09 21:58 ` jakub at gcc dot gnu.org
2024-02-11 11:34 ` [Bug target/113764] [X86] __builtin_clz generates " roger at nextmovesoftware dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).