* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
@ 2020-12-10 17:31 ` redi at gcc dot gnu.org
2020-12-10 17:33 ` redi at gcc dot gnu.org
` (12 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: redi at gcc dot gnu.org @ 2020-12-10 17:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> ---
This seems like an optimizer bug. There is no way I'm going to repeat the
entire body of countr_zero in countr_one.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
2020-12-10 17:31 ` [Bug libstdc++/98226] " redi at gcc dot gnu.org
@ 2020-12-10 17:33 ` redi at gcc dot gnu.org
2020-12-10 18:08 ` pinskia at gcc dot gnu.org
` (11 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: redi at gcc dot gnu.org @ 2020-12-10 17:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #2 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Oh, but you didn't enable any optimization at all, so who cares about the
performance?
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
2020-12-10 17:31 ` [Bug libstdc++/98226] " redi at gcc dot gnu.org
2020-12-10 17:33 ` redi at gcc dot gnu.org
@ 2020-12-10 18:08 ` pinskia at gcc dot gnu.org
2020-12-10 22:00 ` cvs-commit at gcc dot gnu.org
` (10 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2020-12-10 18:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Jonathan Wakely from comment #2)
> Oh, but you didn't enable any optimization at all, so who cares about the
> performance?
I was thinking the same.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (2 preceding siblings ...)
2020-12-10 18:08 ` pinskia at gcc dot gnu.org
@ 2020-12-10 22:00 ` cvs-commit at gcc dot gnu.org
2020-12-10 22:08 ` redi at gcc dot gnu.org
` (9 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-12-10 22:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #4 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:
https://gcc.gnu.org/g:2ea62857a3fbdf091ba38cbb62e98dc76b198e2e
commit r11-5922-g2ea62857a3fbdf091ba38cbb62e98dc76b198e2e
Author: Jonathan Wakely <jwakely@redhat.com>
Date: Thu Dec 10 21:57:42 2020 +0000
libstdc++: Remove redundant branches in countl_one and countr_one [PR
98226]
There's no need to explicitly check for the maximum value, because the
function we call handles it correctly anyway.
libstdc++-v3/ChangeLog:
PR libstdc++/98226
* include/std/bit (__countl_one, __countr_one): Remove redundant
branches.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (3 preceding siblings ...)
2020-12-10 22:00 ` cvs-commit at gcc dot gnu.org
@ 2020-12-10 22:08 ` redi at gcc dot gnu.org
2020-12-11 9:17 ` zaikin.icc at gmail dot com
` (8 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: redi at gcc dot gnu.org @ 2020-12-10 22:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #5 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I've removed some redundant code from them, but not changed the indirection
that this PR complains about. I don't plan to change that.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (4 preceding siblings ...)
2020-12-10 22:08 ` redi at gcc dot gnu.org
@ 2020-12-11 9:17 ` zaikin.icc at gmail dot com
2020-12-11 9:18 ` zaikin.icc at gmail dot com
` (7 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: zaikin.icc at gmail dot com @ 2020-12-11 9:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #6 from Oleg Zaikin <zaikin.icc at gmail dot com> ---
(In reply to Jonathan Wakely from comment #2)
> Oh, but you didn't enable any optimization at all, so who cares about the
> performance?
Let me give the whole picture. The issue is very close to that from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759 where a C-style
implementation was replaced by C++20 std::has_single_bit. We have a complex
project that of course is compiled with all proper optimization flags. There we
have a function firstzero(unsigned x) that returns 2^i, with i the position of
the first 0 of x, and 0 iff there is no 0. Its implementation is:
unsigned firstzero(const unsigned x) noexcept {
#if __cplusplus > 201703L
return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x);
#else
const unsigned y = x+1; return (y ^ x) & y;
#endif
}
When we switched from C++17-based g++ to C++20-based g++, the performance of
the whole program decreased by about 7 %. It turned out that the main reason is
the firstzero function.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (5 preceding siblings ...)
2020-12-11 9:17 ` zaikin.icc at gmail dot com
@ 2020-12-11 9:18 ` zaikin.icc at gmail dot com
2020-12-11 10:59 ` redi at gcc dot gnu.org
` (6 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: zaikin.icc at gmail dot com @ 2020-12-11 9:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #7 from Oleg Zaikin <zaikin.icc at gmail dot com> ---
(In reply to Jonathan Wakely from comment #5)
> I've removed some redundant code from them, but not changed the indirection
> that this PR complains about. I don't plan to change that.
Thank you! I've got your point regarding the indirection.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (6 preceding siblings ...)
2020-12-11 9:18 ` zaikin.icc at gmail dot com
@ 2020-12-11 10:59 ` redi at gcc dot gnu.org
2020-12-11 11:08 ` redi at gcc dot gnu.org
` (5 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: redi at gcc dot gnu.org @ 2020-12-11 10:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #8 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Oleg Zaikin from comment #6)
> When we switched from C++17-based g++ to C++20-based g++, the performance of
> the whole program decreased by about 7 %. It turned out that the main reason
> is the firstzero function.
But if I create a microbenchmark for this function and replace countr_one(x) by
__builtin_ctz(~x) I don't see any improvement. The problem is not that
countr_one(x) calls countr_zero(~x). The problem is that the two versions of
firstzero are completely different.
If the (y ^ x) & y version is more efficient, just use that.
In other words, this is not a problem with countr_one, it's that you chose to
use countr_one here.
For example, using google benchmark:
#include <benchmark/benchmark.h>
#include <bit>
#include <vector>
#include <iostream>
#include <stdlib.h>
unsigned total = 0;
unsigned firstzero(const unsigned x) noexcept {
const unsigned y = x+1;
return (y ^ x) & y;
}
unsigned firstzero_cxx20(const unsigned x) noexcept {
return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x);
}
unsigned firstzero_ctz(const unsigned x) noexcept {
return x == unsigned(-1) ? 0 : unsigned(1) << __builtin_ctz(~x);
}
static void BM_orig(benchmark::State& state) {
srand(0);
std::vector<unsigned> v(1000000);
for (auto& u : v)
u = rand();
for (auto _ : state)
{
total = 0;
for (auto& u : v)
total += firstzero(u);
}
}
static void BM_cxx20(benchmark::State& state) {
unsigned total;
srand(0);
std::vector<unsigned> v(1000000);
for (auto& u : v)
u = rand();
for (auto _ : state)
{
total = 0;
for (auto& u : v)
total += firstzero_cxx20(u);
}
if (total != ::total) // check for correct result
throw 1;
}
static void BM_ctz(benchmark::State& state) {
unsigned total;
srand(0);
std::vector<unsigned> v(1000000);
for (auto& u : v)
u = rand();
for (auto _ : state)
{
total = 0;
for (auto& u : v)
total += firstzero_ctz(u);
}
if (total != ::total) // check for correct result
throw 1;
}
BENCHMARK(BM_orig);
BENCHMARK(BM_cxx20);
BENCHMARK(BM_ctz);
BENCHMARK_MAIN();
With GCC 10 I get:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_orig 179283 ns 178739 ns 4388
BM_cxx20 888982 ns 886885 ns 770
BM_ctz 883696 ns 881765 ns 783
I think you've just chosen a bad algorithm.
However, I do see a regression using GCC 11:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_orig 174122 ns 172492 ns 4466
BM_cxx20 1104081 ns 1097202 ns 608
BM_ctz 1119849 ns 1112550 ns 634
That needs to be investigated, but it's a problem with the compiler. It has
nothing to do with countr_one being implemented using countr_zero (as shown by
the same problem being present when calling __builtin_ctz directly).
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (7 preceding siblings ...)
2020-12-11 10:59 ` redi at gcc dot gnu.org
@ 2020-12-11 11:08 ` redi at gcc dot gnu.org
2020-12-11 11:28 ` amonakov at gcc dot gnu.org
` (4 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: redi at gcc dot gnu.org @ 2020-12-11 11:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #9 from Jonathan Wakely <redi at gcc dot gnu.org> ---
The generated code hasn't changed between gcc-10 and gcc-11 though, so the
difference must be in the code used to run the benchmarks, not the code under
test.
See https://godbolt.org/z/bWeMen
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (8 preceding siblings ...)
2020-12-11 11:08 ` redi at gcc dot gnu.org
@ 2020-12-11 11:28 ` amonakov at gcc dot gnu.org
2020-12-11 12:16 ` zaikin.icc at gmail dot com
` (3 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-12-11 11:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #10 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Oleg Zaikin from comment #6)
> There
> we have a function firstzero(unsigned x) that returns 2^i, with i the
> position of the first 0 of x, and 0 iff there is no 0. Its implementation is:
> unsigned firstzero(const unsigned x) noexcept {
> #if __cplusplus > 201703L
> return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x);
> #else
> const unsigned y = x+1; return (y ^ x) & y;
> #endif
> }
But why you are trying to use a more complex branchy expression in C++17 mode
when you already have a more efficient expression as a "fallback"?
Note that a cheaper way is available:
return (x+1) & ~x;
(though gcc can optimize '(y ^ x) & y' you have to the same machine code)
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (9 preceding siblings ...)
2020-12-11 11:28 ` amonakov at gcc dot gnu.org
@ 2020-12-11 12:16 ` zaikin.icc at gmail dot com
2020-12-11 12:21 ` zaikin.icc at gmail dot com
` (2 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: zaikin.icc at gmail dot com @ 2020-12-11 12:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #11 from Oleg Zaikin <zaikin.icc at gmail dot com> ---
(In reply to Jonathan Wakely from comment #8)
> That needs to be investigated, but it's a problem with the compiler. It has
> nothing to do with countr_one being implemented using countr_zero (as shown
> by the same problem being present when calling __builtin_ctz directly).
Thank you for the detailed clarification!
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (10 preceding siblings ...)
2020-12-11 12:16 ` zaikin.icc at gmail dot com
@ 2020-12-11 12:21 ` zaikin.icc at gmail dot com
2020-12-11 12:33 ` zaikin.icc at gmail dot com
2021-03-29 20:01 ` cvs-commit at gcc dot gnu.org
13 siblings, 0 replies; 15+ messages in thread
From: zaikin.icc at gmail dot com @ 2020-12-11 12:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #12 from Oleg Zaikin <zaikin.icc at gmail dot com> ---
(In reply to Alexander Monakov from comment #10)
> But why you are trying to use a more complex branchy expression in C++17
> mode when you already have a more efficient expression as a "fallback"?
>
> Note that a cheaper way is available:
>
> return (x+1) & ~x;
>
> (though gcc can optimize '(y ^ x) & y' you have to the same machine code)
Thank you! We will try it.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (11 preceding siblings ...)
2020-12-11 12:21 ` zaikin.icc at gmail dot com
@ 2020-12-11 12:33 ` zaikin.icc at gmail dot com
2021-03-29 20:01 ` cvs-commit at gcc dot gnu.org
13 siblings, 0 replies; 15+ messages in thread
From: zaikin.icc at gmail dot com @ 2020-12-11 12:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
Oleg Zaikin <zaikin.icc at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
Resolution|--- |INVALID
--- Comment #13 from Oleg Zaikin <zaikin.icc at gmail dot com> ---
countr_one() should be used where it suits better.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug libstdc++/98226] Slow std::countr_one
2020-12-10 16:51 [Bug libstdc++/98226] New: Slow std::countr_one zaikin.icc at gmail dot com
` (12 preceding siblings ...)
2020-12-11 12:33 ` zaikin.icc at gmail dot com
@ 2021-03-29 20:01 ` cvs-commit at gcc dot gnu.org
13 siblings, 0 replies; 15+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-03-29 20:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #14 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-10 branch has been updated by Jonathan Wakely
<redi@gcc.gnu.org>:
https://gcc.gnu.org/g:d7216ea6c0cd6c4fef06e9501bd630c3161b14fd
commit r10-9576-gd7216ea6c0cd6c4fef06e9501bd630c3161b14fd
Author: Jonathan Wakely <jwakely@redhat.com>
Date: Thu Dec 10 21:57:42 2020 +0000
libstdc++: Remove redundant branches in countl_one and countr_one [PR
98226]
There's no need to explicitly check for the maximum value, because the
function we call handles it correctly anyway.
libstdc++-v3/ChangeLog:
PR libstdc++/98226
* include/std/bit (__countl_one, __countr_one): Remove redundant
branches.
(cherry picked from commit 2ea62857a3fbdf091ba38cbb62e98dc76b198e2e)
^ permalink raw reply [flat|nested] 15+ messages in thread