On 06/10/20 00:25 +0100, Jonathan Wakely wrote: >I'm sorry it's taken a year to review this properly. Comments below ... > >On 27/09/19 14:18 -0400, Daniel Lemire wrote: >>(This is a revised patch proposal. I am revising both the description >>and the code itself.) >> >>Even on recent processors, integer division is relatively expensive. >>The current implementation of std::uniform_int_distribution typically >>requires two divisions by invocation: >> >> // downscaling >> const __uctype __uerange = __urange + 1; // __urange can be zero >> const __uctype __scaling = __urngrange / __uerange; >> const __uctype __past = __uerange * __scaling; >> do >> __ret = __uctype(__urng()) - __urngmin; >> while (__ret >= __past); >> __ret /= __scaling; >> >>We can achieve the same algorithmic result with at most one division, >>and typically no division at all without requiring more calls to the >>random number generator. >>This was recently added to Swift (https://github.com/apple/swift/pull/25286) >> >>The main challenge is that we need to be able to compute the "full" >>product. E.g., given two 64-bit integers, we want the 128-bit result; >>given two 32-bit integers we want the 64-bit result. This is fast on >>common processors. >>The 128-bit product is not natively supported in C/C++ but can be >>achieved with the >>__int128 extension when it is available. The patch checks for >>__int128 support; when >>support is lacking, we fallback on the existing approach which uses >>two divisions per >>call. >> >>For example, if we replace the above code by the following, we get a substantial >>performance boost on skylake microarchitectures. E.g., it can >>be twice as fast to shuffle arrays of 1 million elements (e.g., using >>the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark ) >> >> >> unsigned __int128 __product = (unsigned >>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); >> uint64_t __lsb = uint64_t(__product); >> if(__lsb < __uerange) >> { >> uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange); >> while (__lsb < __threshold) >> { >> __product = (unsigned __int128)(__uctype(__urng()) - >>__urngmin) * (unsigned __int128)(__uerange); >> __lsb = uint64_t(__product); >> } >> } >> __ret = __product >> 64; >> >>Included is a patch that would bring better performance (e.g., 2x gain) to >>std::uniform_int_distribution in some cases. Here are some actual numbers... >> >>With this patch: >> >>std::shuffle(testvalues, testvalues + size, g) : 7952091 >>ns total, 7.95 ns per input key >> >>Before this patch: >> >>std::shuffle(testvalues, testvalues + size, g) : >>14954058 ns total, 14.95 ns per input key >> >> >>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700). >> >>Furthermore, the new algorithm is unbiased, so the randomness of the >>result is not affected. >> >>I ran both performance and biases tests with the proposed patch. >> >> >>This patch proposal was improved following feedback by Jonathan >>Wakely. An earlier version used the __uint128_t type, which is widely >>supported but not used in the C++ library, instead we now use unsigned >>__int128. Furthermore, the previous patch was accidentally broken: it >>was not computing the full product since a rhs cast was missing. These >>issues are fixed and verified. > >After looking at GCC's internals, it looks like __uint128_t is >actually fine to use, even though we never currently use it in the >library. I didn't even know it was supported for C++ mode, sorry! > >>Reference: Fast Random Integer Generation in an Interval, ACM Transactions on >>Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941 > >>Index: libstdc++-v3/include/bits/uniform_int_dist.h >>=================================================================== >>--- libstdc++-v3/include/bits/uniform_int_dist.h (revision 276183) >>+++ libstdc++-v3/include/bits/uniform_int_dist.h (working copy) >>@@ -33,7 +33,8 @@ >> >>#include >>#include >>- >>+#include >>+#include >>namespace std _GLIBCXX_VISIBILITY(default) >>{ >>_GLIBCXX_BEGIN_NAMESPACE_VERSION >>@@ -239,18 +240,61 @@ >> = __uctype(__param.b()) - __uctype(__param.a()); >> >> __uctype __ret; >>- >>- if (__urngrange > __urange) >>+ if (__urngrange > __urange) >> { >>- // downscaling >>- const __uctype __uerange = __urange + 1; // __urange can be zero >>- const __uctype __scaling = __urngrange / __uerange; >>- const __uctype __past = __uerange * __scaling; >>- do >>- __ret = __uctype(__urng()) - __urngmin; >>- while (__ret >= __past); >>- __ret /= __scaling; >>- } >>+ const __uctype __uerange = __urange + 1; // __urange can be zero >>+#if _GLIBCXX_USE_INT128 == 1 >>+ if(sizeof(__uctype) == sizeof(uint64_t) and >>+ (__urngrange == numeric_limits::max())) >>+ { >>+ // 64-bit case >>+ // reference: Fast Random Integer Generation in an Interval >>+ // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 >>+ // https://arxiv.org/abs/1805.10941 >>+ unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); > >Is subtracting __urngmin necessary here? > >The condition above checks that __urngrange == 2**64-1 which means >that U::max() - U::min() is the maximum 64-bit value, which means >means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't >need to subtract it. > >Also, I think the casts to uint64_t are unnecessary. We know that >__uctype is an unsigned integral type, and we've checked that it has >exactly 64-bits, so I think we can just use __uctype. It's got the >same width and signedness as uint64_t anyway. > >That said, the uint64_t(__uerange) above isn't redundant, because it >should be (unsigned __int128)__uerange, I think. > >i.e. > > unsigned __int128 __product = (unsigned __int128)__urng() * (unsigned __int128)__uerange; > > >>+ uint64_t __lsb = uint64_t(__product); >>+ if(__lsb < __uerange) >>+ { >>+ uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange); >>+ while (__lsb < __threshold) >>+ { >>+ __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * (unsigned __int128)(__uerange); >>+ __lsb = uint64_t(__product); >>+ } >>+ } >>+ __ret = __product >> 64; >>+ } >>+ else >>+#endif // _GLIBCXX_USE_INT128 == 1 >>+ if(sizeof(__uctype) == sizeof(uint32_t) >>+ and (__urngrange == numeric_limits::max()) ) >>+ { >>+ // 32-bit case >>+ // reference: Fast Random Integer Generation in an Interval >>+ // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 >>+ // https://arxiv.org/abs/1805.10941 >>+ uint64_t __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); >>+ uint32_t __lsb = uint32_t(__product); >>+ if(__lsb < __uerange) { >>+ uint64_t __threshold = -uint32_t(__uerange) % uint32_t(__uerange); > >This __threshold should be uint32_t, right? > >>+ while (__lsb < __threshold) { >>+ __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); >>+ __lsb = uint32_t(__product); >>+ } >>+ } >>+ __ret = __product >> 32; >>+ } >>+ else >>+ { >>+ // fallback case (2 divisions) >>+ const __uctype __scaling = __urngrange / __uerange; >>+ const __uctype __past = __uerange * __scaling; >>+ do >>+ __ret = __uctype(__urng()) - __urngmin; >>+ while (__ret >= __past); >>+ __ret /= __scaling; >>+ } >>+ } >> else if (__urngrange < __urange) >> { >> // upscaling > >I've attached a revised patch which hoists the clever bit into a >separate function template so we don't need to repeat the code. I >think it's still right and I haven't broken your logic. > >Does it look OK to you? > > >commit b6e2027df22d9f3837b906c5b3de5cfe1fc5f0d7 >Author: Daniel Lemire >Date: Tue Oct 6 00:05:52 2020 > > libstdc++: Optimize uniform_int_distribution using Lemire's algorithm > > Co-authored-by: Jonathan Wakely > > libstdc++-v3/ChangeLog: > > * include/bits/uniform_int_dist.h (uniform_int_distribution::_S_nd): > New member function implementing Lemire's "nearly divisionless" > algorithm. > (uniform_int_distribution::operator()): Use _S_nd when the range > of the URBG is the full width of the result type. > >diff --git a/libstdc++-v3/include/bits/uniform_int_dist.h b/libstdc++-v3/include/bits/uniform_int_dist.h >index 6e1e3d5fc5f..9c6927e6923 100644 >--- a/libstdc++-v3/include/bits/uniform_int_dist.h >+++ b/libstdc++-v3/include/bits/uniform_int_dist.h >@@ -234,6 +234,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > const param_type& __p); > > param_type _M_param; >+ >+ // Lemire's nearly divisionless algorithm >+ template >+ static _Up >+ _S_nd(_Urbg& __g, _Up __range) >+ { >+ using __gnu_cxx::__int_traits; >+ constexpr __digits = __int_traits<_Up>::__digits; >+ static_assert(__int_traits<_Wp>::__digits == (2 * __digits), >+ "_Wp must be twice as wide as _Up"); Gah, this static_assert needs to be removed (the function is instantiated for arguments that cause it to fail, even though that instantiation isn't called). I hadn't committed that fix to git before generating the patch though. Correct patch attached.