public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster?
@ 2020-11-09  0:01 gcc-bugs at marehr dot dialup.fu-berlin.de
  2020-11-09  5:41 ` [Bug libstdc++/97759] Could std::has_single_bit " tkoenig at gcc dot gnu.org
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: gcc-bugs at marehr dot dialup.fu-berlin.de @ 2020-11-09  0:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

            Bug ID: 97759
           Summary: Could std::has_single_bit implementation be faster?
           Product: gcc
           Version: 10.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gcc-bugs at marehr dot dialup.fu-berlin.de
  Target Milestone: ---

Hello gcc-team,

we are thrilled that C++20 offers some efficient bit implementation and that we
could exchange some of our own implementation with the standardized ones,
making the code more accessible.

I replaced our implementation and noticed that `std::has_single_bit` was slower
than what we had before by around 30%. (The other functions matched our
timings.)

Additionally, we have a (micro-)benchmark that compares the standard arithmetic
bit trick
(https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2) with
the implementation where popcount == 1. We decided to use the arithmetic
version, because we measured that it was faster than popcount on our machines
(mostly intel processors).

Interestingly, it seems that the popcount benchmark matches the
std::has_single_bit time-wise, so I guess that std::has_single_bit is
implemented via popcount.

Those timings could be reproduced at an unknown location
https://quick-bench.com/q/Y28keu_mSh25WwhO05T4SKrbHpk

I don't know how to fix this, but I would expect that the optimizer would
recognize popcount=1 and knows that there is a more efficient version. Or
change the implementation to arithmetic, where again the optimizer could decide
to replace that by a popcount if that is more efficient on some architecture?

Thank you!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
@ 2020-11-09  5:41 ` tkoenig at gcc dot gnu.org
  2020-11-09  8:16 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-11-09  5:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

Thomas Koenig <tkoenig at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Severity|normal                      |enhancement
                 CC|                            |tkoenig at gcc dot gnu.org

--- Comment #1 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Could you post the benchmark and the exact architecture where the arithmetic
version is faster?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
  2020-11-09  5:41 ` [Bug libstdc++/97759] Could std::has_single_bit " tkoenig at gcc dot gnu.org
@ 2020-11-09  8:16 ` rguenth at gcc dot gnu.org
  2020-11-09  9:13 ` crazylht at gmail dot com
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-09  8:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com
             Target|                            |x86_64-*-* i?86-*-*

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think using popcount () in the standard library is reasonable and instead
GCC should eventually optimize popcount () == 1 during RTL expansion.  An
alternative would be to add an explicit __builtin_is_pow2 ().  Note
matching popcount (x) == 1 || x == 0 for the simple arithmetic trick would
be always good IMHO.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
  2020-11-09  5:41 ` [Bug libstdc++/97759] Could std::has_single_bit " tkoenig at gcc dot gnu.org
  2020-11-09  8:16 ` rguenth at gcc dot gnu.org
@ 2020-11-09  9:13 ` crazylht at gmail dot com
  2020-11-09 10:44 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2020-11-09  9:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
for testcase:
---
#include<stdbool.h>

bool
is_power2_popcnt (int a)
{
  return __builtin_popcount (a) == 1;
}

bool
is_power2_arithmetic (int a)
{
  return !(a & (a - 1)) && a;
}
---

gcc -O2 -mavx2 -S got

---
        .file   "test.c"
        .text
        .p2align 4
        .globl  is_power2_popcnt
        .type   is_power2_popcnt, @function
is_power2_popcnt:
.LFB0:
        .cfi_startproc
        popcntl %edi, %edi
        cmpl    $1, %edi
        sete    %al
        ret
        .cfi_endproc
.LFE0:
        .size   is_power2_popcnt, .-is_power2_popcnt
        .p2align 4
        .globl  is_power2_arithmetic
        .type   is_power2_arithmetic, @function
is_power2_arithmetic:
.LFB1:
        .cfi_startproc
        leal    -1(%rdi), %eax
        testl   %edi, %eax
        sete    %al
        testl   %edi, %edi
        setne   %dl
        andl    %edx, %eax
        ret
        .cfi_endproc
---

Latency of popcnt is 3, others is 1.

Can't tell which version is better from static observation.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (2 preceding siblings ...)
  2020-11-09  9:13 ` crazylht at gmail dot com
@ 2020-11-09 10:44 ` jakub at gcc dot gnu.org
  2020-11-09 11:00 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-11-09 10:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Looking at the referenced benchmark, options that would result in popcnt*
instruction aren't there though and in the assembly it calls __popcountdi2.
So, I guess we want at least:
1) for POPCOUNT (x) == 1 use x && (x & (x - 1)) == 0 if optab_handler
(popcount_optab, mode) == CODE_FOR_nothing (a fuzzy case are the double-word
cases where we'd end up with popcount (x >> wordbits) + popcount ((word) x) ==
1
vs. double-word x && (x & (x - 1)) = 0)
2) the case Richi wrote about, optimize POPCOUNT (x) <= 1 or POPCOUNT (x) == 1
|| x == 0 to (x & (x - 1)) == 0 always

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (3 preceding siblings ...)
  2020-11-09 10:44 ` jakub at gcc dot gnu.org
@ 2020-11-09 11:00 ` jakub at gcc dot gnu.org
  2020-11-09 11:13 ` redi at gcc dot gnu.org
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-11-09 11:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Unfortunately we don't TER calls, so the expander doesn't see POPCOUNT (x) == 1
or POPCOUNT (x) <= 1
and the isel pass seems to be (at least so far) extremely vector specific;
another option is do it in match.pd or in tree-ssa-math-opts.c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (4 preceding siblings ...)
  2020-11-09 11:00 ` jakub at gcc dot gnu.org
@ 2020-11-09 11:13 ` redi at gcc dot gnu.org
  2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2020-11-09 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #6 from Jonathan Wakely <redi at gcc dot gnu.org> ---
As an aside, libstdc++ does already use the ((x-1) & x) == 0 idiom in
<bits/uniform_int_dist.h> where we are happy for zero to be treated as a power
of two (because we call _Power_of_2(n+1) and we want the result to be true for
n==numeric_limits<decltype(n)>::max()). We could replace _Power_of_2(n+1) with
std::__has_single_bit(n+1) || (n+1)==0 but would need to define
__has_single_bit for C++11 as well as C++14.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (5 preceding siblings ...)
  2020-11-09 11:13 ` redi at gcc dot gnu.org
@ 2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
  2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: gcc-bugs at marehr dot dialup.fu-berlin.de @ 2020-11-09 23:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #7 from gcc-bugs at marehr dot dialup.fu-berlin.de ---
Created attachment 49530
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49530&action=edit
CMakeLists.txt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (6 preceding siblings ...)
  2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
@ 2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
  2020-11-09 23:55 ` gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: gcc-bugs at marehr dot dialup.fu-berlin.de @ 2020-11-09 23:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #8 from gcc-bugs at marehr dot dialup.fu-berlin.de ---
Created attachment 49531
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49531&action=edit
has_single_bit_benchmark.cpp

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (7 preceding siblings ...)
  2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
@ 2020-11-09 23:55 ` gcc-bugs at marehr dot dialup.fu-berlin.de
  2020-11-10  0:16 ` gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: gcc-bugs at marehr dot dialup.fu-berlin.de @ 2020-11-09 23:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #9 from gcc-bugs at marehr dot dialup.fu-berlin.de ---
Thank you for so many responses

(In reply to Thomas Koenig from comment #1)
> Could you post the benchmark and the exact architecture where the arithmetic
> version is faster?

```
> cat /proc/cpuinfo

rocessor        : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
stepping        : 9
microcode       : 0xd6
cpu MHz         : 3489.761
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi
flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1
xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear
flush_l1d
vmx flags       : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb
flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
shadow_vmcs pml ept_mode_based_exec
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds
swapgs taa itlb_multihit srbds
bogomips        : 5802.42
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
```

I added the sources.

You can execute it by putting it into one folder and execute

```
> cmake -DCMAKE_BUILD_TYPE=Release ./
> make VERBOSE=1

/usr/bin/c++  -I/benchmark/_deps/gbenchmark_fetch_content-src/src/../include
-O3 -DNDEBUG -std=gnu++2a -o
CMakeFiles/has_single_bit_benchmark.dir/has_single_bit_benchmark.cpp.o -c
/benchmark/has_single_bit_benchmark.cpp
```

```
> ./has_single_bit_benchmark --benchmark_repetitions=9 --benchmark_min_time=2

2020-11-10T00:32:52+01:00
Running ./has_single_bit_benchmark
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.57, 0.82, 1.35
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may
be noisy and will incur extra overhead.
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
has_single_bit_arithmetic_mean         6.45 ns         6.44 ns            9
has_single_bit_arithmetic_median       6.44 ns         6.44 ns            9
has_single_bit_arithmetic_stddev      0.113 ns        0.111 ns            9

has_single_bit_popcount_mean           8.84 ns         8.82 ns            9
has_single_bit_popcount_median         8.84 ns         8.82 ns            9
has_single_bit_popcount_stddev        0.060 ns        0.061 ns            9

has_single_bit_std_mean                9.23 ns         9.21 ns            9
has_single_bit_std_median              9.23 ns         9.20 ns            9
has_single_bit_std_stddev             0.046 ns        0.048 ns            9
```

----

I thought about it and tried the same again with `-march=native` and noticed
that popcount was now (slightly) more efficient. Some more testing showed that
`-mpopcnt` is everything needed to make this test fly.

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
has_single_bit_arithmetic_mean         6.81 ns         6.81 ns            9
has_single_bit_arithmetic_median       6.81 ns         6.80 ns            9
has_single_bit_arithmetic_stddev      0.201 ns        0.200 ns            9

has_single_bit_popcount_mean           6.47 ns         6.47 ns            9
has_single_bit_popcount_median         6.46 ns         6.46 ns            9
has_single_bit_popcount_stddev        0.043 ns        0.042 ns            9

has_single_bit_std_mean                6.50 ns         6.50 ns            9
has_single_bit_std_median              6.51 ns         6.50 ns            9
has_single_bit_std_stddev             0.031 ns        0.031 ns            9


So the use case would be generic builds like debian binaries.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (8 preceding siblings ...)
  2020-11-09 23:55 ` gcc-bugs at marehr dot dialup.fu-berlin.de
@ 2020-11-10  0:16 ` gcc-bugs at marehr dot dialup.fu-berlin.de
  2020-11-10  2:15 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: gcc-bugs at marehr dot dialup.fu-berlin.de @ 2020-11-10  0:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #10 from gcc-bugs at marehr dot dialup.fu-berlin.de ---
And maybe a related question:

I know that an arithmetic implementation might auto-vectorize, but would a
popcount implementation do that too?

Since AVX512_BITALG
(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=popcnt&expand=4350)
retro-introduced popcount for smaller vector lengths, what is with instruction
sets like AVX2 and lower?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (9 preceding siblings ...)
  2020-11-10  0:16 ` gcc-bugs at marehr dot dialup.fu-berlin.de
@ 2020-11-10  2:15 ` crazylht at gmail dot com
  2020-11-10  8:57 ` redi at gcc dot gnu.org
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2020-11-10  2:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to gcc-bugs from comment #10)
> And maybe a related question:
> 
> I know that an arithmetic implementation might auto-vectorize, but would a
> popcount implementation do that too?
> 
> Since AVX512_BITALG
> (https://software.intel.com/sites/landingpage/IntrinsicsGuide/
> #text=popcnt&expand=4350) retro-introduced popcount for smaller vector
> lengths, what is with instruction sets like AVX2 and lower?

I open a bugzilla for it in 97770.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (10 preceding siblings ...)
  2020-11-10  2:15 ` crazylht at gmail dot com
@ 2020-11-10  8:57 ` redi at gcc dot gnu.org
  2021-08-03  5:51 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2020-11-10  8:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #12 from Jonathan Wakely <redi at gcc dot gnu.org> ---
r11-4843 should have removed the small difference between the std and popcount
benchmarks.

I still see a small advantage to arithmetic for skylake i7-6700 CPU @ 3.40GHz
and i7-8650U CPU @ 1.90GHz though.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (11 preceding siblings ...)
  2020-11-10  8:57 ` redi at gcc dot gnu.org
@ 2021-08-03  5:51 ` pinskia at gcc dot gnu.org
  2022-03-03 11:08 ` peter at cordes dot ca
  2024-03-04 22:56 ` pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-03  5:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

--- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For aarch64, we get:
is_power2_popcnt(int):
        fmov    s0, w0
        cnt     v0.8b, v0.8b
        addv    b0, v0.8b
        fmov    w0, s0
        cmp     w0, 1
        cset    w0, eq
        ret
is_power2_arithmetic(int):
        sub     w1, w0, #1
        tst     w1, w0
        ccmp    w0, 0, 4, eq
        cset    w0, ne
        ret

Obviously is_power2_arithmetic is better.
Note clang uses popcntl for both (on both x86 and arm64) if popcntl can be done
inline and uses test/set/and if popcntl does not exist.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (12 preceding siblings ...)
  2021-08-03  5:51 ` pinskia at gcc dot gnu.org
@ 2022-03-03 11:08 ` peter at cordes dot ca
  2024-03-04 22:56 ` pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: peter at cordes dot ca @ 2022-03-03 11:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #14 from Peter Cordes <peter at cordes dot ca> ---
Agreed with the idea of expanding doing the  popcount(x) == 1  peephole
replacement in the compiler, not forcing the header to figure out whether we
have efficient popcount or not.

If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was
zero, and ZF=1 if the output is zero.  Unfortunately none of the standard
jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first.  
(If Intel had designed it to produce ZF and CF inverted, it would be
non-intuitive for all other uses but would have allowed blsr / jcc to implement
if(has_single_bit(x)).)

CF=1, ZF=0  impossible: input was zero, output was non-zero
CF=1, ZF=1  input was zero
CF=0, ZF=0  input had multiple bits set
CF=0, ZF=1  input had a single bit set.

If we're going to branch on it anyway after inlining, a branchy strategy is
probably good:

singlebit_bmi2_branchy:
   xor    %eax, %eax
   blsr   %edi, %edi    #  ZF=1 means we cleared the last bit, or the input was
zero
   jc     .Linput_zero  # input was zero, return 0 regardless of ZF
   setz   %al
 .Linput_zero:
   ret


And when we want a boolean in a register, a combination of setz and cmovc can
materialize one.  With clever choice of registers, we can even avoid giving
setcc a false dependency on a register that isn't already part of its dep chain

singlebit_bmi2_cmov:
   blsr    %edi, %eax
   setz    %al         # false dep, but it's ready if FLAGS are ready because
we wrote it with BLSR
   cmovc   %edi, %eax  # return 1 only if ZF=1 (setz produces 1) and CF=0
(cmovc doesn't overwrite it with the input 0)
   ret

With xor-zeroing first, we could produce the boolean zero-extended to 32-bit,
instead of here where only the low 8 bits are actually 0 / 1.  (Which is fine
for returning a bool in all the mainstream calling conventions)

(This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can
conditionally update flags.)

An evil variation on this uses setnz / dec to invert ZF without affecting CF,
allowing JA:

   blsr   %edi,%eax
   setnz  %al             # AL = !ZF
   dec    %al             # 1-1 -> ZF=1,  0-1 -> ZF=0.  ZF=!ZF without
affecting CF
   # seta   %al             # set on CF=0 and ZF=0
   ja     was_single_bit    # only actually useful for branching after inlining

dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it
doesn't cost any extra partial-FLAGS merging uops, because JA simply has both
parts of FLAGS as separate inputs.  (This is why CMOVA / CMOVBE are still 2
uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts
of FLAGS, while others need either CF or SPAZO not both.  Interestingly, Zen1/2
have that effect but not Zen3)

I don't know how AMD handles dec / ja partial-flags shenanigans.  Intel Haswell
would I think have a flags-merging uop; older Intel doesn't support BMI1 so
P6-family is irrelevant.  https://stackoverflow.com/a/49868149/224132

I haven't benchmarked them because they have different use-cases (materializing
a boolean vs. creating a FLAGS condition to branch on, being branchless
itself), so any single benchmark would make one of them look good.  If your
data almost never (or always) has an all-zero input, the JC in the first
version will predict well.  After inlining, if the caller branches on the bool
result, you might want to just branch on both conditions separately.

I don't think this setnz/dec/ja version is ever useful.  Unlike branching
separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common
while single-bit inputs are rare.  But blsr/setz/cmovc + test/jnz is only 4
uops, same as this on Skylake. (test can macro-fuse with jnz).

The uops are all dependent on each other, so it also has the same latency (to
detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops.  The only
thing this has going for it is avoiding a port-1-only uop, I think.

It's also possible to blsr / lahf / and  ah, (1<<6) | (1<<0) / cmp  ah, 1<<6   
to directly check that ZF=1 and CF=0.  I doubt that's useful.  Or hmm, can we
branch directly on PF after AND with that 2-bit mask?  CF=1 ZF=0 is impossible,
so the only other odd-parity case is CF=0 ZF=1.  AMD and Intel can macro-fuse
test/jp.

   blsr  %edi, %eax
   lahf
   test  $(1<<6) | (1<<0), %ah        # check ZF and CF.
   jpo   was_single_bit               # ZF != CF means CF=0, ZF=1 because the
other way is impossible.

Also possible of course is the straightforward 2x setcc and AND to materialize
a boolean in the bottom byte of EAX.  Good ILP, only 3 cycle latency from input
to result on Intel, but that's the same as setz/cmovc which is fewer uops and
can avoid false dependencies without destroying the input.

  blsr   %edi, %eax
  setnc  %al
  setz   %dil
  and    %edi, %eax     # or test %dil, %al

----

If BLSR is available, POPCNT is also definitely available, so these need to be
compared against POPCNT / cmp / je (or sete).  With that in mind, BLSR / 2x JCC
is actually worse for the front-end, 3 uops because it can't macro-fuse.  It
has slightly lower latency before a branch miss could be discovered, but it has
2 separate branches instead of just 1.

Potentially we could take on of the JCCs off the fast-path with BLSR/JCC/JCC,
at the expense of branch-prediction throughput when jumping to another branch
on some CPUs.

Low-power cores like Silvermont-family don't do macro-fusion, and Alder Lake
suddenly made Gracemont significantly more relevant for tune=generic than Intel
low-power chips were before.  They have single-uop 3c latency popcnt, and 1 uop
3c latency for BMI1 instructions like blsr.


---

Also note that AMD Zen has 1-cycle latency popcnt as a single uop (down from 4c
in Bulldozer-family), but BLSR is a 2-uop instruction on AMD even including
Zen3.  (Perhaps due to setting CF from the input instead of output?  Or they
just decode it to sub/and uops.)

So with a -mtune that indicates AMD, we should favour popcnt/cmp/setz to
materialize a boolean.  (And write the setz target with popcnt to break *its*
false dependency, because unlike Intel before IceLake, popcnt itself doesn't
have a false output dependency on AMD.)

But with Intel tuning and BMI2 available, we should consider blsr/setz/cmovc if
not branching on the result.  That might be too niche to be worth
adding/maintaining code to look for it; probably most use-cases of single-bit
checks *do* branch on it instead of assigning a boolean result somewhere.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug libstdc++/97759] Could std::has_single_bit be faster?
  2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
                   ` (13 preceding siblings ...)
  2022-03-03 11:08 ` peter at cordes dot ca
@ 2024-03-04 22:56 ` pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-04 22:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
           Assignee|unassigned at gcc dot gnu.org      |pinskia at gcc dot gnu.org
   Last reconfirmed|                            |2024-03-04
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>popcount (x) == 1 || x == 0

That could be optimized to just `popcount (x) <= 1`.

I am going to look to see what is left in GCC 15.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-03-04 22:56 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09  0:01 [Bug libstdc++/97759] New: Could std::has_single_bit implementation be faster? gcc-bugs at marehr dot dialup.fu-berlin.de
2020-11-09  5:41 ` [Bug libstdc++/97759] Could std::has_single_bit " tkoenig at gcc dot gnu.org
2020-11-09  8:16 ` rguenth at gcc dot gnu.org
2020-11-09  9:13 ` crazylht at gmail dot com
2020-11-09 10:44 ` jakub at gcc dot gnu.org
2020-11-09 11:00 ` jakub at gcc dot gnu.org
2020-11-09 11:13 ` redi at gcc dot gnu.org
2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
2020-11-09 23:28 ` gcc-bugs at marehr dot dialup.fu-berlin.de
2020-11-09 23:55 ` gcc-bugs at marehr dot dialup.fu-berlin.de
2020-11-10  0:16 ` gcc-bugs at marehr dot dialup.fu-berlin.de
2020-11-10  2:15 ` crazylht at gmail dot com
2020-11-10  8:57 ` redi at gcc dot gnu.org
2021-08-03  5:51 ` pinskia at gcc dot gnu.org
2022-03-03 11:08 ` peter at cordes dot ca
2024-03-04 22:56 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).