From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 87F97385DC14; Fri, 11 Dec 2020 10:59:07 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 87F97385DC14 From: "redi at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/98226] Slow std::countr_one Date: Fri, 11 Dec 2020 10:59:07 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libstdc++ X-Bugzilla-Version: 10.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: redi at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Dec 2020 10:59:07 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98226 --- Comment #8 from Jonathan Wakely --- (In reply to Oleg Zaikin from comment #6) > When we switched from C++17-based g++ to C++20-based g++, the performance= of > the whole program decreased by about 7 %. It turned out that the main rea= son > is the firstzero function. But if I create a microbenchmark for this function and replace countr_one(x= ) by __builtin_ctz(~x) I don't see any improvement. The problem is not that countr_one(x) calls countr_zero(~x). The problem is that the two versions of firstzero are completely different. If the (y ^ x) & y version is more efficient, just use that. In other words, this is not a problem with countr_one, it's that you chose = to use countr_one here. For example, using google benchmark: #include #include #include #include #include unsigned total =3D 0; unsigned firstzero(const unsigned x) noexcept { const unsigned y =3D x+1; return (y ^ x) & y; } unsigned firstzero_cxx20(const unsigned x) noexcept { return x =3D=3D unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x); } unsigned firstzero_ctz(const unsigned x) noexcept { return x =3D=3D unsigned(-1) ? 0 : unsigned(1) << __builtin_ctz(~x); } static void BM_orig(benchmark::State& state) { srand(0); std::vector v(1000000); for (auto& u : v) u =3D rand(); for (auto _ : state) { total =3D 0; for (auto& u : v) total +=3D firstzero(u); } } static void BM_cxx20(benchmark::State& state) { unsigned total; srand(0); std::vector v(1000000); for (auto& u : v) u =3D rand(); for (auto _ : state) { total =3D 0; for (auto& u : v) total +=3D firstzero_cxx20(u); } if (total !=3D ::total) // check for correct result throw 1; } static void BM_ctz(benchmark::State& state) { unsigned total; srand(0); std::vector v(1000000); for (auto& u : v) u =3D rand(); for (auto _ : state) { total =3D 0; for (auto& u : v) total +=3D firstzero_ctz(u); } if (total !=3D ::total) // check for correct result throw 1; } BENCHMARK(BM_orig); BENCHMARK(BM_cxx20); BENCHMARK(BM_ctz); BENCHMARK_MAIN(); With GCC 10 I get: ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- BM_orig 179283 ns 178739 ns 4388 BM_cxx20 888982 ns 886885 ns 770 BM_ctz 883696 ns 881765 ns 783 I think you've just chosen a bad algorithm. However, I do see a regression using GCC 11: ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- BM_orig 174122 ns 172492 ns 4466 BM_cxx20 1104081 ns 1097202 ns 608 BM_ctz 1119849 ns 1112550 ns 634 That needs to be investigated, but it's a problem with the compiler. It has nothing to do with countr_one being implemented using countr_zero (as shown= by the same problem being present when calling __builtin_ctz directly).=