[PATCH, libstdc++] Improve the performance of std::uniform_int

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
@ 2019-09-08 16:10 Daniel Lemire
  2019-09-27 18:19 ` Daniel Lemire
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Lemire @ 2019-09-08 16:10 UTC (permalink / raw)
  To: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2364 bytes --]

Even on recent processors, integer division is relatively expensive.
The current implementation of  std::uniform_int_distribution typically
requires two divisions by invocation:

        // downscaling
        const __uctype __uerange = __urange + 1; // __urange can be zero
        const __uctype __scaling = __urngrange / __uerange;
        const __uctype __past = __uerange * __scaling;
        do
          __ret = __uctype(__urng()) - __urngmin;
        while (__ret >= __past);
        __ret /= __scaling;

We can achieve the same algorithmic result with at most one division,
and typically
no division at all without requiring more calls to the random number generator.
This was recently added to Swift (https://github.com/apple/swift/pull/25286)

The main challenge is that we need to be able to compute the "full"
product. E.g.,
given two 64-bit integers, we want the 128-bit result; given two 32-bit integers
we want the 64-bit result. This is fast on common processors. The
128-bit product
is not natively supported in C/C++ but can be achieved with the
__uint128_t extension
which is widely supported by GCC. The patch checks for __uint128_t support; when
support is lacking, we fallback on the existing approach.

For example, if we replace the above code by the following, we get a substantial
performance boost on skylake microarchitectures. E.g., it can
be twice as fast to shuffle arrays of 1 million elements (e.g., using
the following
benchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )


      const __uctype __uerange = __urange + 1; // __urange can be zero
      uint64_t __product = (__uctype(__urng()) - __urngmin) * __uerange;
      uint32_t __lsb = uint32_t(__product);
      if(__lsb < __uerange) {
        uint64_t __threshold = -__uerange % __uerange;
        while (__lsb < __threshold) {
          __product = (__uctype(__urng()) - __urngmin) * __uerange;
          __lsb = uint32_t(__product);
        }
      }

Included is a patch that would bring better performance (e.g., 2x gain) to
std::uniform_int_distribution  in some cases.

This patch proposal was previously submitted solely to the libc++ list
and improved following feedback by Jonathan Wakely.

Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941

[-- Attachment #2: patch_uniform_int_dist.txt --]
[-- Type: text/plain, Size: 2804 bytes --]

Index: trunk/libstdc++-v3/include/bits/uniform_int_dist.h
===================================================================
--- trunk/libstdc++-v3/include/bits/uniform_int_dist.h	(revision 275324)
+++ trunk/libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
@@ -33,6 +33,7 @@

 #include <type_traits>
 #include <limits>
+#include <cstdint>

 namespace std _GLIBCXX_VISIBILITY(default)
 {
@@ -242,15 +243,59 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

 	if (__urngrange > __urange)
 	  {
-	    // downscaling
-	    const __uctype __uerange = __urange + 1; // __urange can be zero
-	    const __uctype __scaling = __urngrange / __uerange;
-	    const __uctype __past = __uerange * __scaling;
-	    do
-	      __ret = __uctype(__urng()) - __urngmin;
-	    while (__ret >= __past);
-	    __ret /= __scaling;
-	  }
+		const __uctype __uerange = __urange + 1; // __urange can be zero
+#if _GLIBCXX_USE_INT128 == 1
+    if(sizeof(__uctype) == sizeof(uint64_t) and
+      (__urngrange == numeric_limits<uint64_t>::max()))
+    {
+      // 64-bit case
+      // reference: Fast Random Integer Generation in an Interval
+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+      // https://arxiv.org/abs/1805.10941
+      __uint128_t __product = (__uctype(__urng()) - __urngmin) * __uerange;
+      uint64_t __lsb = uint64_t(__product);
+      if(__lsb < __uerange)
+      {
+        uint64_t __threshold = -__uerange % __uerange;
+        while (__lsb < __threshold)
+        {
+          __product = (__uctype(__urng()) - __urngmin) * __uerange;
+          __lsb = uint64_t(__product);
+        }
+      }
+      __ret = __product >> 64;
+    }
+    else
+#endif // _GLIBCXX_USE_INT128 == 1
+    if(sizeof(__uctype) == sizeof(uint32_t)
+      and (__urngrange == numeric_limits<uint32_t>::max()) )
+    {
+      // 32-bit case
+      // reference: Fast Random Integer Generation in an Interval
+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+      // https://arxiv.org/abs/1805.10941
+      uint64_t __product = (__uctype(__urng()) - __urngmin) * __uerange;
+      uint32_t __lsb = uint32_t(__product);
+      if(__lsb < __uerange) {
+        uint64_t __threshold = -__uerange % __uerange;
+        while (__lsb < __threshold) {
+          __product = (__uctype(__urng()) - __urngmin) * __uerange;
+          __lsb = uint32_t(__product);
+        }
+      }
+      __ret = __product >> 32;
+    }
+    else
+    {
+      // fallback case (2 divisions)
+      const __uctype __scaling = __urngrange / __uerange;
+      const __uctype __past = __uerange * __scaling;
+      do
+        __ret = __uctype(__urng()) - __urngmin;
+      while (__ret >= __past);
+      __ret /= __scaling;
+    }
+  }
 	else if (__urngrange < __urange)
 	  {
 	    // upscaling

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2019-09-08 16:10 [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions) Daniel Lemire
@ 2019-09-27 18:19 ` Daniel Lemire
  2020-10-05 23:25   ` Jonathan Wakely
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Lemire @ 2019-09-27 18:19 UTC (permalink / raw)
  To: libstdc++, gcc-patches, Jonathan Wakely

[-- Attachment #1: Type: text/plain, Size: 3313 bytes --]

(This is a revised patch proposal. I am revising both the description
and the code itself.)

Even on recent processors, integer division is relatively expensive.
The current implementation of  std::uniform_int_distribution typically
requires two divisions by invocation:

        // downscaling
         const __uctype __uerange = __urange + 1; // __urange can be zero
         const __uctype __scaling = __urngrange / __uerange;
         const __uctype __past = __uerange * __scaling;
         do
           __ret = __uctype(__urng()) - __urngmin;
         while (__ret >= __past);
         __ret /= __scaling;

We can achieve the same algorithmic result with at most one division,
and typically no division at all without requiring more calls to the
random number generator.
 This was recently added to Swift (https://github.com/apple/swift/pull/25286)

The main challenge is that we need to be able to compute the "full"
product. E.g., given two 64-bit integers, we want the 128-bit result;
given two 32-bit integers we want the 64-bit result. This is fast on
common processors.
The 128-bit product is not natively supported in C/C++ but can be
achieved with the
 __int128 extension when it is available. The patch checks for
__int128 support; when
support is lacking, we fallback on the existing approach which uses
two divisions per
call.

For example, if we replace the above code by the following, we get a substantial
performance boost on skylake microarchitectures. E.g., it can
be twice as fast to shuffle arrays of 1 million elements (e.g., using
the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )


      unsigned __int128 __product = (unsigned
__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
      uint64_t __lsb = uint64_t(__product);
      if(__lsb < __uerange)
      {
        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
        while (__lsb < __threshold)
        {
          __product = (unsigned __int128)(__uctype(__urng()) -
__urngmin) * (unsigned __int128)(__uerange);
          __lsb = uint64_t(__product);
        }
      }
      __ret = __product >> 64;

Included is a patch that would bring better performance (e.g., 2x gain) to
 std::uniform_int_distribution  in some cases. Here are some actual numbers...

With this patch:

std::shuffle(testvalues, testvalues + size, g)              :  7952091
ns total,  7.95 ns per input key

Before this patch:

std::shuffle(testvalues, testvalues + size, g)              :
14954058 ns total,  14.95 ns per input key


Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).

Furthermore, the new algorithm is unbiased, so the randomness of the
result is not affected.

I ran both performance and biases tests with the proposed patch.


This patch proposal was improved following feedback by Jonathan
Wakely. An earlier version used the __uint128_t type, which is widely
supported but not used in the C++ library, instead we now use unsigned
__int128. Furthermore, the previous patch was accidentally broken: it
was not computing the full product since a rhs cast was missing. These
issues are fixed and verified.

Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941

[-- Attachment #2: patch_uniform_int_dist.txt --]
[-- Type: text/plain, Size: 3111 bytes --]

Index: libstdc++-v3/include/bits/uniform_int_dist.h
===================================================================
--- libstdc++-v3/include/bits/uniform_int_dist.h	(revision 276183)
+++ libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
@@ -33,7 +33,8 @@

 #include <type_traits>
 #include <limits>
-
+#include <cstdint>
+#include <cstdio>
 namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
@@ -239,18 +240,61 @@
 	  = __uctype(__param.b()) - __uctype(__param.a());

 	__uctype __ret;
-
-	if (__urngrange > __urange)
+ 	if (__urngrange > __urange)
 	  {
-	    // downscaling
-	    const __uctype __uerange = __urange + 1; // __urange can be zero
-	    const __uctype __scaling = __urngrange / __uerange;
-	    const __uctype __past = __uerange * __scaling;
-	    do
-	      __ret = __uctype(__urng()) - __urngmin;
-	    while (__ret >= __past);
-	    __ret /= __scaling;
-	  }
+		const __uctype __uerange = __urange + 1; // __urange can be zero
+#if _GLIBCXX_USE_INT128 == 1
+    if(sizeof(__uctype) == sizeof(uint64_t) and
+      (__urngrange == numeric_limits<uint64_t>::max()))
+    {
+      // 64-bit case
+      // reference: Fast Random Integer Generation in an Interval
+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+      // https://arxiv.org/abs/1805.10941
+      unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
+      uint64_t __lsb = uint64_t(__product);
+      if(__lsb < __uerange)
+      {
+        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
+        while (__lsb < __threshold)
+        {
+          __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * (unsigned __int128)(__uerange);
+          __lsb = uint64_t(__product);
+        }
+      }
+      __ret = __product >> 64;
+    }
+    else
+#endif // _GLIBCXX_USE_INT128 == 1
+    if(sizeof(__uctype) == sizeof(uint32_t)
+      and (__urngrange == numeric_limits<uint32_t>::max()) )
+    {
+      // 32-bit case
+      // reference: Fast Random Integer Generation in an Interval
+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+      // https://arxiv.org/abs/1805.10941
+      uint64_t __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
+      uint32_t __lsb = uint32_t(__product);
+      if(__lsb < __uerange) {
+        uint64_t __threshold = -uint32_t(__uerange) % uint32_t(__uerange);
+        while (__lsb < __threshold) {
+          __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
+          __lsb = uint32_t(__product);
+        }
+      }
+      __ret = __product >> 32;
+    }
+    else
+    {
+      // fallback case (2 divisions)
+      const __uctype __scaling = __urngrange / __uerange;
+      const __uctype __past = __uerange * __scaling;
+      do
+        __ret = __uctype(__urng()) - __urngmin;
+      while (__ret >= __past);
+      __ret /= __scaling;
+    }
+  }
 	else if (__urngrange < __urange)
 	  {
 	    // upscaling
@@ -377,3 +421,4 @@
 } // namespace std

 #endif

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2019-09-27 18:19 ` Daniel Lemire
@ 2020-10-05 23:25   ` Jonathan Wakely
  2020-10-05 23:38     ` Jonathan Wakely
  2020-10-05 23:40     ` Jonathan Wakely
  0 siblings, 2 replies; 9+ messages in thread
From: Jonathan Wakely @ 2020-10-05 23:25 UTC (permalink / raw)
  To: Daniel Lemire; +Cc: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 7862 bytes --]

I'm sorry it's taken a year to review this properly. Comments below ...

On 27/09/19 14:18 -0400, Daniel Lemire wrote:
>(This is a revised patch proposal. I am revising both the description
>and the code itself.)
>
>Even on recent processors, integer division is relatively expensive.
>The current implementation of  std::uniform_int_distribution typically
>requires two divisions by invocation:
>
>        // downscaling
>         const __uctype __uerange = __urange + 1; // __urange can be zero
>         const __uctype __scaling = __urngrange / __uerange;
>         const __uctype __past = __uerange * __scaling;
>         do
>           __ret = __uctype(__urng()) - __urngmin;
>         while (__ret >= __past);
>         __ret /= __scaling;
>
>We can achieve the same algorithmic result with at most one division,
>and typically no division at all without requiring more calls to the
>random number generator.
> This was recently added to Swift (https://github.com/apple/swift/pull/25286)
>
>The main challenge is that we need to be able to compute the "full"
>product. E.g., given two 64-bit integers, we want the 128-bit result;
>given two 32-bit integers we want the 64-bit result. This is fast on
>common processors.
>The 128-bit product is not natively supported in C/C++ but can be
>achieved with the
> __int128 extension when it is available. The patch checks for
>__int128 support; when
>support is lacking, we fallback on the existing approach which uses
>two divisions per
>call.
>
>For example, if we replace the above code by the following, we get a substantial
>performance boost on skylake microarchitectures. E.g., it can
>be twice as fast to shuffle arrays of 1 million elements (e.g., using
>the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )
>
>
>      unsigned __int128 __product = (unsigned
>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>      uint64_t __lsb = uint64_t(__product);
>      if(__lsb < __uerange)
>      {
>        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>        while (__lsb < __threshold)
>        {
>          __product = (unsigned __int128)(__uctype(__urng()) -
>__urngmin) * (unsigned __int128)(__uerange);
>          __lsb = uint64_t(__product);
>        }
>      }
>      __ret = __product >> 64;
>
>Included is a patch that would bring better performance (e.g., 2x gain) to
> std::uniform_int_distribution  in some cases. Here are some actual numbers...
>
>With this patch:
>
>std::shuffle(testvalues, testvalues + size, g)              :  7952091
>ns total,  7.95 ns per input key
>
>Before this patch:
>
>std::shuffle(testvalues, testvalues + size, g)              :
>14954058 ns total,  14.95 ns per input key
>
>
>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).
>
>Furthermore, the new algorithm is unbiased, so the randomness of the
>result is not affected.
>
>I ran both performance and biases tests with the proposed patch.
>
>
>This patch proposal was improved following feedback by Jonathan
>Wakely. An earlier version used the __uint128_t type, which is widely
>supported but not used in the C++ library, instead we now use unsigned
>__int128. Furthermore, the previous patch was accidentally broken: it
>was not computing the full product since a rhs cast was missing. These
>issues are fixed and verified.

After looking at GCC's internals, it looks like __uint128_t is
actually fine to use, even though we never currently use it in the
library. I didn't even know it was supported for C++ mode, sorry!

>Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
>Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941

>Index: libstdc++-v3/include/bits/uniform_int_dist.h
>===================================================================
>--- libstdc++-v3/include/bits/uniform_int_dist.h	(revision 276183)
>+++ libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
>@@ -33,7 +33,8 @@
>
> #include <type_traits>
> #include <limits>
>-
>+#include <cstdint>
>+#include <cstdio>
> namespace std _GLIBCXX_VISIBILITY(default)
> {
> _GLIBCXX_BEGIN_NAMESPACE_VERSION
>@@ -239,18 +240,61 @@
> 	  = __uctype(__param.b()) - __uctype(__param.a());
>
> 	__uctype __ret;
>-
>-	if (__urngrange > __urange)
>+ 	if (__urngrange > __urange)
> 	  {
>-	    // downscaling
>-	    const __uctype __uerange = __urange + 1; // __urange can be zero
>-	    const __uctype __scaling = __urngrange / __uerange;
>-	    const __uctype __past = __uerange * __scaling;
>-	    do
>-	      __ret = __uctype(__urng()) - __urngmin;
>-	    while (__ret >= __past);
>-	    __ret /= __scaling;
>-	  }
>+		const __uctype __uerange = __urange + 1; // __urange can be zero
>+#if _GLIBCXX_USE_INT128 == 1
>+    if(sizeof(__uctype) == sizeof(uint64_t) and
>+      (__urngrange == numeric_limits<uint64_t>::max()))
>+    {
>+      // 64-bit case
>+      // reference: Fast Random Integer Generation in an Interval
>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>+      // https://arxiv.org/abs/1805.10941
>+      unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);

Is subtracting  __urngmin necessary here?

The condition above checks that __urngrange == 2**64-1 which means
that U::max() - U::min() is the maximum 64-bit value, which means
means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
need to subtract it.

Also, I think the casts to uint64_t are unnecessary. We know that
__uctype is an unsigned integral type, and we've checked that it has
exactly 64-bits, so I think we can just use __uctype. It's got the
same width and signedness as uint64_t anyway.

That said, the uint64_t(__uerange) above isn't redundant, because it
should be (unsigned __int128)__uerange, I think.

i.e.

   unsigned __int128 __product = (unsigned __int128)__urng() * (unsigned __int128)__uerange;


>+      uint64_t __lsb = uint64_t(__product);
>+      if(__lsb < __uerange)
>+      {
>+        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>+        while (__lsb < __threshold)
>+        {
>+          __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * (unsigned __int128)(__uerange);
>+          __lsb = uint64_t(__product);
>+        }
>+      }
>+      __ret = __product >> 64;
>+    }
>+    else
>+#endif // _GLIBCXX_USE_INT128 == 1
>+    if(sizeof(__uctype) == sizeof(uint32_t)
>+      and (__urngrange == numeric_limits<uint32_t>::max()) )
>+    {
>+      // 32-bit case
>+      // reference: Fast Random Integer Generation in an Interval
>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>+      // https://arxiv.org/abs/1805.10941
>+      uint64_t __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>+      uint32_t __lsb = uint32_t(__product);
>+      if(__lsb < __uerange) {
>+        uint64_t __threshold = -uint32_t(__uerange) % uint32_t(__uerange);

This __threshold should be uint32_t, right?

>+        while (__lsb < __threshold) {
>+          __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>+          __lsb = uint32_t(__product);
>+        }
>+      }
>+      __ret = __product >> 32;
>+    }
>+    else
>+    {
>+      // fallback case (2 divisions)
>+      const __uctype __scaling = __urngrange / __uerange;
>+      const __uctype __past = __uerange * __scaling;
>+      do
>+        __ret = __uctype(__urng()) - __urngmin;
>+      while (__ret >= __past);
>+      __ret /= __scaling;
>+    }
>+  }
> 	else if (__urngrange < __urange)
> 	  {
> 	    // upscaling

I've attached a revised patch which hoists the clever bit into a
separate function template so we don't need to repeat the code. I
think it's still right and I haven't broken your logic.

Does it look OK to you?



[-- Attachment #2: patch.txt --]
[-- Type: text/x-patch, Size: 3077 bytes --]

commit b6e2027df22d9f3837b906c5b3de5cfe1fc5f0d7
Author: Daniel Lemire <lemire@gmail.com>
Date:   Tue Oct 6 00:05:52 2020

    libstdc++: Optimize uniform_int_distribution using Lemire's algorithm
    
    Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
    
    libstdc++-v3/ChangeLog:
    
            * include/bits/uniform_int_dist.h (uniform_int_distribution::_S_nd):
            New member function implementing Lemire's "nearly divisionless"
            algorithm.
            (uniform_int_distribution::operator()): Use _S_nd when the range
            of the URBG is the full width of the result type.

diff --git a/libstdc++-v3/include/bits/uniform_int_dist.h b/libstdc++-v3/include/bits/uniform_int_dist.h
index 6e1e3d5fc5f..9c6927e6923 100644
--- a/libstdc++-v3/include/bits/uniform_int_dist.h
+++ b/libstdc++-v3/include/bits/uniform_int_dist.h
@@ -234,6 +234,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 			const param_type& __p);
 
       param_type _M_param;
+
+      // Lemire's nearly divisionless algorithm
+      template<typename _Wp, typename _Urbg, typename _Up>
+	static _Up
+	_S_nd(_Urbg& __g, _Up __range)
+	{
+	  using __gnu_cxx::__int_traits;
+	  constexpr __digits = __int_traits<_Up>::__digits;
+	  static_assert(__int_traits<_Wp>::__digits == (2 * __digits),
+			"_Wp must be twice as wide as _Up");
+
+	  // reference: Fast Random Integer Generation in an Interval
+	  // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+	  // https://arxiv.org/abs/1805.10941
+	  _Wp __product = _Wp(__g()) * _Wp(__range);
+	  _Up __lsb = _Up(__product);
+	  if (__lsb < __range)
+	    {
+	      _Up __threshold = -__range % __range;
+	      while (__lsb < __threshold)
+		{
+		  __product = _Wp(__g()) * _Wp(__range);
+		  __lsb = _Up(__product);
+		}
+	    }
+	  return __product >> __digits;
+	}
     };
 
   template<typename _IntType>
@@ -256,17 +283,30 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  = __uctype(__param.b()) - __uctype(__param.a());
 
 	__uctype __ret;
-
 	if (__urngrange > __urange)
 	  {
-	    // downscaling
 	    const __uctype __uerange = __urange + 1; // __urange can be zero
-	    const __uctype __scaling = __urngrange / __uerange;
-	    const __uctype __past = __uerange * __scaling;
-	    do
-	      __ret = __uctype(__urng()) - __urngmin;
-	    while (__ret >= __past);
-	    __ret /= __scaling;
+
+	    using __gnu_cxx::__int_traits;
+#if __SIZEOF_INT128__
+	    if (__int_traits<__uctype>::__digits == 64
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      __ret = _S_nd<unsigned __int128>(__urng, __uerange);
+	    else
+#endif
+	    if (__int_traits<__uctype>::__digits == 32
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      __ret = _S_nd<__UINT64_TYPE__>(__urng, __uerange);
+	    else
+	      {
+		// fallback case (2 divisions)
+		const __uctype __scaling = __urngrange / __uerange;
+		const __uctype __past = __uerange * __scaling;
+		do
+		  __ret = __uctype(__urng()) - __urngmin;
+		while (__ret >= __past);
+		__ret /= __scaling;
+	      }
 	  }
 	else if (__urngrange < __urange)
 	  {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2020-10-05 23:25   ` Jonathan Wakely
@ 2020-10-05 23:38     ` Jonathan Wakely
  2020-10-05 23:40     ` Jonathan Wakely
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Wakely @ 2020-10-05 23:38 UTC (permalink / raw)
  To: Daniel Lemire; +Cc: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 9709 bytes --]

On 06/10/20 00:25 +0100, Jonathan Wakely wrote:
>I'm sorry it's taken a year to review this properly. Comments below ...
>
>On 27/09/19 14:18 -0400, Daniel Lemire wrote:
>>(This is a revised patch proposal. I am revising both the description
>>and the code itself.)
>>
>>Even on recent processors, integer division is relatively expensive.
>>The current implementation of  std::uniform_int_distribution typically
>>requires two divisions by invocation:
>>
>>       // downscaling
>>        const __uctype __uerange = __urange + 1; // __urange can be zero
>>        const __uctype __scaling = __urngrange / __uerange;
>>        const __uctype __past = __uerange * __scaling;
>>        do
>>          __ret = __uctype(__urng()) - __urngmin;
>>        while (__ret >= __past);
>>        __ret /= __scaling;
>>
>>We can achieve the same algorithmic result with at most one division,
>>and typically no division at all without requiring more calls to the
>>random number generator.
>>This was recently added to Swift (https://github.com/apple/swift/pull/25286)
>>
>>The main challenge is that we need to be able to compute the "full"
>>product. E.g., given two 64-bit integers, we want the 128-bit result;
>>given two 32-bit integers we want the 64-bit result. This is fast on
>>common processors.
>>The 128-bit product is not natively supported in C/C++ but can be
>>achieved with the
>>__int128 extension when it is available. The patch checks for
>>__int128 support; when
>>support is lacking, we fallback on the existing approach which uses
>>two divisions per
>>call.
>>
>>For example, if we replace the above code by the following, we get a substantial
>>performance boost on skylake microarchitectures. E.g., it can
>>be twice as fast to shuffle arrays of 1 million elements (e.g., using
>>the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )
>>
>>
>>     unsigned __int128 __product = (unsigned
>>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>     uint64_t __lsb = uint64_t(__product);
>>     if(__lsb < __uerange)
>>     {
>>       uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>>       while (__lsb < __threshold)
>>       {
>>         __product = (unsigned __int128)(__uctype(__urng()) -
>>__urngmin) * (unsigned __int128)(__uerange);
>>         __lsb = uint64_t(__product);
>>       }
>>     }
>>     __ret = __product >> 64;
>>
>>Included is a patch that would bring better performance (e.g., 2x gain) to
>>std::uniform_int_distribution  in some cases. Here are some actual numbers...
>>
>>With this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :  7952091
>>ns total,  7.95 ns per input key
>>
>>Before this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :
>>14954058 ns total,  14.95 ns per input key
>>
>>
>>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).
>>
>>Furthermore, the new algorithm is unbiased, so the randomness of the
>>result is not affected.
>>
>>I ran both performance and biases tests with the proposed patch.
>>
>>
>>This patch proposal was improved following feedback by Jonathan
>>Wakely. An earlier version used the __uint128_t type, which is widely
>>supported but not used in the C++ library, instead we now use unsigned
>>__int128. Furthermore, the previous patch was accidentally broken: it
>>was not computing the full product since a rhs cast was missing. These
>>issues are fixed and verified.
>
>After looking at GCC's internals, it looks like __uint128_t is
>actually fine to use, even though we never currently use it in the
>library. I didn't even know it was supported for C++ mode, sorry!
>
>>Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
>>Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941
>
>>Index: libstdc++-v3/include/bits/uniform_int_dist.h
>>===================================================================
>>--- libstdc++-v3/include/bits/uniform_int_dist.h	(revision 276183)
>>+++ libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
>>@@ -33,7 +33,8 @@
>>
>>#include <type_traits>
>>#include <limits>
>>-
>>+#include <cstdint>
>>+#include <cstdio>
>>namespace std _GLIBCXX_VISIBILITY(default)
>>{
>>_GLIBCXX_BEGIN_NAMESPACE_VERSION
>>@@ -239,18 +240,61 @@
>>	  = __uctype(__param.b()) - __uctype(__param.a());
>>
>>	__uctype __ret;
>>-
>>-	if (__urngrange > __urange)
>>+ 	if (__urngrange > __urange)
>>	  {
>>-	    // downscaling
>>-	    const __uctype __uerange = __urange + 1; // __urange can be zero
>>-	    const __uctype __scaling = __urngrange / __uerange;
>>-	    const __uctype __past = __uerange * __scaling;
>>-	    do
>>-	      __ret = __uctype(__urng()) - __urngmin;
>>-	    while (__ret >= __past);
>>-	    __ret /= __scaling;
>>-	  }
>>+		const __uctype __uerange = __urange + 1; // __urange can be zero
>>+#if _GLIBCXX_USE_INT128 == 1
>>+    if(sizeof(__uctype) == sizeof(uint64_t) and
>>+      (__urngrange == numeric_limits<uint64_t>::max()))
>>+    {
>>+      // 64-bit case
>>+      // reference: Fast Random Integer Generation in an Interval
>>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>>+      // https://arxiv.org/abs/1805.10941
>>+      unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>
>Is subtracting  __urngmin necessary here?
>
>The condition above checks that __urngrange == 2**64-1 which means
>that U::max() - U::min() is the maximum 64-bit value, which means
>means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
>need to subtract it.
>
>Also, I think the casts to uint64_t are unnecessary. We know that
>__uctype is an unsigned integral type, and we've checked that it has
>exactly 64-bits, so I think we can just use __uctype. It's got the
>same width and signedness as uint64_t anyway.
>
>That said, the uint64_t(__uerange) above isn't redundant, because it
>should be (unsigned __int128)__uerange, I think.
>
>i.e.
>
>  unsigned __int128 __product = (unsigned __int128)__urng() * (unsigned __int128)__uerange;
>
>
>>+      uint64_t __lsb = uint64_t(__product);
>>+      if(__lsb < __uerange)
>>+      {
>>+        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>>+        while (__lsb < __threshold)
>>+        {
>>+          __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * (unsigned __int128)(__uerange);
>>+          __lsb = uint64_t(__product);
>>+        }
>>+      }
>>+      __ret = __product >> 64;
>>+    }
>>+    else
>>+#endif // _GLIBCXX_USE_INT128 == 1
>>+    if(sizeof(__uctype) == sizeof(uint32_t)
>>+      and (__urngrange == numeric_limits<uint32_t>::max()) )
>>+    {
>>+      // 32-bit case
>>+      // reference: Fast Random Integer Generation in an Interval
>>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>>+      // https://arxiv.org/abs/1805.10941
>>+      uint64_t __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>+      uint32_t __lsb = uint32_t(__product);
>>+      if(__lsb < __uerange) {
>>+        uint64_t __threshold = -uint32_t(__uerange) % uint32_t(__uerange);
>
>This __threshold should be uint32_t, right?
>
>>+        while (__lsb < __threshold) {
>>+          __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>+          __lsb = uint32_t(__product);
>>+        }
>>+      }
>>+      __ret = __product >> 32;
>>+    }
>>+    else
>>+    {
>>+      // fallback case (2 divisions)
>>+      const __uctype __scaling = __urngrange / __uerange;
>>+      const __uctype __past = __uerange * __scaling;
>>+      do
>>+        __ret = __uctype(__urng()) - __urngmin;
>>+      while (__ret >= __past);
>>+      __ret /= __scaling;
>>+    }
>>+  }
>>	else if (__urngrange < __urange)
>>	  {
>>	    // upscaling
>
>I've attached a revised patch which hoists the clever bit into a
>separate function template so we don't need to repeat the code. I
>think it's still right and I haven't broken your logic.
>
>Does it look OK to you?
>
>

>commit b6e2027df22d9f3837b906c5b3de5cfe1fc5f0d7
>Author: Daniel Lemire <lemire@gmail.com>
>Date:   Tue Oct 6 00:05:52 2020
>
>    libstdc++: Optimize uniform_int_distribution using Lemire's algorithm
>    
>    Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
>    
>    libstdc++-v3/ChangeLog:
>    
>            * include/bits/uniform_int_dist.h (uniform_int_distribution::_S_nd):
>            New member function implementing Lemire's "nearly divisionless"
>            algorithm.
>            (uniform_int_distribution::operator()): Use _S_nd when the range
>            of the URBG is the full width of the result type.
>
>diff --git a/libstdc++-v3/include/bits/uniform_int_dist.h b/libstdc++-v3/include/bits/uniform_int_dist.h
>index 6e1e3d5fc5f..9c6927e6923 100644
>--- a/libstdc++-v3/include/bits/uniform_int_dist.h
>+++ b/libstdc++-v3/include/bits/uniform_int_dist.h
>@@ -234,6 +234,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> 			const param_type& __p);
> 
>       param_type _M_param;
>+
>+      // Lemire's nearly divisionless algorithm
>+      template<typename _Wp, typename _Urbg, typename _Up>
>+	static _Up
>+	_S_nd(_Urbg& __g, _Up __range)
>+	{
>+	  using __gnu_cxx::__int_traits;
>+	  constexpr __digits = __int_traits<_Up>::__digits;

>+	  static_assert(__int_traits<_Wp>::__digits == (2 * __digits),
>+			"_Wp must be twice as wide as _Up");

Gah, this static_assert needs to be removed (the function is
instantiated for arguments that cause it to fail, even though that
instantiation isn't called).  I hadn't committed that fix to git
before generating the patch though.

Correct patch attached.



[-- Attachment #2: patch.txt --]
[-- Type: text/x-patch, Size: 2910 bytes --]

commit 5905c99cabca1433ff56befde5003fede50a5dc3
Author: Daniel Lemire <lemire@gmail.com>
Date:   Tue Oct 6 00:34:39 2020

    libstdc++: Optimize uniform_int_distribution using Lemire's algorithm
    
    Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
    
    libstdc++-v3/ChangeLog:
    
            * include/bits/uniform_int_dist.h (uniform_int_distribution::_S_nd):
            New member function implementing Lemire's "nearly divisionless"
            algorithm.
            (uniform_int_distribution::operator()): Use _S_nd when the range
            of the URBG is the full width of the result type.

diff --git a/libstdc++-v3/include/bits/uniform_int_dist.h b/libstdc++-v3/include/bits/uniform_int_dist.h
index 6e1e3d5fc5f..18c88ca506d 100644
--- a/libstdc++-v3/include/bits/uniform_int_dist.h
+++ b/libstdc++-v3/include/bits/uniform_int_dist.h
@@ -234,6 +234,28 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 			const param_type& __p);
 
       param_type _M_param;
+
+      // Lemire's nearly divisionless algorithm
+      template<typename _Wp, typename _Urbg, typename _Up>
+	static _Up
+	_S_nd(_Urbg& __g, _Up __range)
+	{
+	  // reference: Fast Random Integer Generation in an Interval
+	  // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+	  // https://arxiv.org/abs/1805.10941
+	  _Wp __product = _Wp(__g()) * _Wp(__range);
+	  _Up __lsb = _Up(__product);
+	  if (__lsb < __range)
+	    {
+	      _Up __threshold = -__range % __range;
+	      while (__lsb < __threshold)
+		{
+		  __product = _Wp(__g()) * _Wp(__range);
+		  __lsb = _Up(__product);
+		}
+	    }
+	  return __product >> __gnu_cxx::__int_traits<_Up>::__digits;
+	}
     };
 
   template<typename _IntType>
@@ -256,17 +278,30 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  = __uctype(__param.b()) - __uctype(__param.a());
 
 	__uctype __ret;
-
 	if (__urngrange > __urange)
 	  {
-	    // downscaling
 	    const __uctype __uerange = __urange + 1; // __urange can be zero
-	    const __uctype __scaling = __urngrange / __uerange;
-	    const __uctype __past = __uerange * __scaling;
-	    do
-	      __ret = __uctype(__urng()) - __urngmin;
-	    while (__ret >= __past);
-	    __ret /= __scaling;
+
+	    using __gnu_cxx::__int_traits;
+#if __SIZEOF_INT128__
+	    if (__int_traits<__uctype>::__digits == 64
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      __ret = _S_nd<unsigned __int128>(__urng, __uerange);
+	    else
+#endif
+	    if (__int_traits<__uctype>::__digits == 32
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      __ret = _S_nd<__UINT64_TYPE__>(__urng, __uerange);
+	    else
+	      {
+		// fallback case (2 divisions)
+		const __uctype __scaling = __urngrange / __uerange;
+		const __uctype __past = __uerange * __scaling;
+		do
+		  __ret = __uctype(__urng()) - __urngmin;
+		while (__ret >= __past);
+		__ret /= __scaling;
+	      }
 	  }
 	else if (__urngrange < __urange)
 	  {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2020-10-05 23:25   ` Jonathan Wakely
  2020-10-05 23:38     ` Jonathan Wakely
@ 2020-10-05 23:40     ` Jonathan Wakely
  2020-10-06 19:09       ` Daniel Lemire
  2020-10-06 19:55       ` Daniel Lemire
  1 sibling, 2 replies; 9+ messages in thread
From: Jonathan Wakely @ 2020-10-05 23:40 UTC (permalink / raw)
  To: Daniel Lemire; +Cc: libstdc++, gcc-patches

On 06/10/20 00:25 +0100, Jonathan Wakely wrote:
>I'm sorry it's taken a year to review this properly. Comments below ...
>
>On 27/09/19 14:18 -0400, Daniel Lemire wrote:
>>(This is a revised patch proposal. I am revising both the description
>>and the code itself.)
>>
>>Even on recent processors, integer division is relatively expensive.
>>The current implementation of  std::uniform_int_distribution typically
>>requires two divisions by invocation:
>>
>>       // downscaling
>>        const __uctype __uerange = __urange + 1; // __urange can be zero
>>        const __uctype __scaling = __urngrange / __uerange;
>>        const __uctype __past = __uerange * __scaling;
>>        do
>>          __ret = __uctype(__urng()) - __urngmin;
>>        while (__ret >= __past);
>>        __ret /= __scaling;
>>
>>We can achieve the same algorithmic result with at most one division,
>>and typically no division at all without requiring more calls to the
>>random number generator.
>>This was recently added to Swift (https://github.com/apple/swift/pull/25286)
>>
>>The main challenge is that we need to be able to compute the "full"
>>product. E.g., given two 64-bit integers, we want the 128-bit result;
>>given two 32-bit integers we want the 64-bit result. This is fast on
>>common processors.
>>The 128-bit product is not natively supported in C/C++ but can be
>>achieved with the
>>__int128 extension when it is available. The patch checks for
>>__int128 support; when
>>support is lacking, we fallback on the existing approach which uses
>>two divisions per
>>call.
>>
>>For example, if we replace the above code by the following, we get a substantial
>>performance boost on skylake microarchitectures. E.g., it can
>>be twice as fast to shuffle arrays of 1 million elements (e.g., using
>>the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )
>>
>>
>>     unsigned __int128 __product = (unsigned
>>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>     uint64_t __lsb = uint64_t(__product);
>>     if(__lsb < __uerange)
>>     {
>>       uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>>       while (__lsb < __threshold)
>>       {
>>         __product = (unsigned __int128)(__uctype(__urng()) -
>>__urngmin) * (unsigned __int128)(__uerange);
>>         __lsb = uint64_t(__product);
>>       }
>>     }
>>     __ret = __product >> 64;
>>
>>Included is a patch that would bring better performance (e.g., 2x gain) to
>>std::uniform_int_distribution  in some cases. Here are some actual numbers...
>>
>>With this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :  7952091
>>ns total,  7.95 ns per input key
>>
>>Before this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :
>>14954058 ns total,  14.95 ns per input key
>>
>>
>>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).
>>
>>Furthermore, the new algorithm is unbiased, so the randomness of the
>>result is not affected.
>>
>>I ran both performance and biases tests with the proposed patch.
>>
>>
>>This patch proposal was improved following feedback by Jonathan
>>Wakely. An earlier version used the __uint128_t type, which is widely
>>supported but not used in the C++ library, instead we now use unsigned
>>__int128. Furthermore, the previous patch was accidentally broken: it
>>was not computing the full product since a rhs cast was missing. These
>>issues are fixed and verified.
>
>After looking at GCC's internals, it looks like __uint128_t is
>actually fine to use, even though we never currently use it in the
>library. I didn't even know it was supported for C++ mode, sorry!
>
>>Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
>>Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941
>
>>Index: libstdc++-v3/include/bits/uniform_int_dist.h
>>===================================================================
>>--- libstdc++-v3/include/bits/uniform_int_dist.h	(revision 276183)
>>+++ libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
>>@@ -33,7 +33,8 @@
>>
>>#include <type_traits>
>>#include <limits>
>>-
>>+#include <cstdint>
>>+#include <cstdio>
>>namespace std _GLIBCXX_VISIBILITY(default)
>>{
>>_GLIBCXX_BEGIN_NAMESPACE_VERSION
>>@@ -239,18 +240,61 @@
>>	  = __uctype(__param.b()) - __uctype(__param.a());
>>
>>	__uctype __ret;
>>-
>>-	if (__urngrange > __urange)
>>+ 	if (__urngrange > __urange)
>>	  {
>>-	    // downscaling
>>-	    const __uctype __uerange = __urange + 1; // __urange can be zero
>>-	    const __uctype __scaling = __urngrange / __uerange;
>>-	    const __uctype __past = __uerange * __scaling;
>>-	    do
>>-	      __ret = __uctype(__urng()) - __urngmin;
>>-	    while (__ret >= __past);
>>-	    __ret /= __scaling;
>>-	  }
>>+		const __uctype __uerange = __urange + 1; // __urange can be zero
>>+#if _GLIBCXX_USE_INT128 == 1
>>+    if(sizeof(__uctype) == sizeof(uint64_t) and
>>+      (__urngrange == numeric_limits<uint64_t>::max()))
>>+    {
>>+      // 64-bit case
>>+      // reference: Fast Random Integer Generation in an Interval
>>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>>+      // https://arxiv.org/abs/1805.10941
>>+      unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>
>Is subtracting  __urngmin necessary here?
>
>The condition above checks that __urngrange == 2**64-1 which means
>that U::max() - U::min() is the maximum 64-bit value, which means
>means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
>need to subtract it.
>
>Also, I think the casts to uint64_t are unnecessary. We know that
>__uctype is an unsigned integral type, and we've checked that it has
>exactly 64-bits, so I think we can just use __uctype. It's got the
>same width and signedness as uint64_t anyway.
>
>That said, the uint64_t(__uerange) above isn't redundant, because it
>should be (unsigned __int128)__uerange, I think.

Ah yes, you pointed out that last bit in your Sept 28 2019 email.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2020-10-05 23:40     ` Jonathan Wakely
@ 2020-10-06 19:09       ` Daniel Lemire
  2020-10-06 19:55       ` Daniel Lemire
  1 sibling, 0 replies; 9+ messages in thread
From: Daniel Lemire @ 2020-10-06 19:09 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

>
> >The condition above checks that __urngrange == 2**64-1 which means
> >that U::max() - U::min() is the maximum 64-bit value, which means
> >means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
> >need to subtract it.
>

That sounds correct.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2020-10-05 23:40     ` Jonathan Wakely
  2020-10-06 19:09       ` Daniel Lemire
@ 2020-10-06 19:55       ` Daniel Lemire
  2020-10-06 20:04         ` Jonathan Wakely
  2020-10-09 13:17         ` Jonathan Wakely
  1 sibling, 2 replies; 9+ messages in thread
From: Daniel Lemire @ 2020-10-06 19:55 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

The updated patch looks good to me. It is indeed cleaner to have a separate
(static) function.

It might be nice to add a comment to explain the _S_nd function maybe with
a comment like "returns a random value in [0,__range)
without any bias" (or something to that effect).

Otherwise, it is algorithmically correct.


On Mon, Oct 5, 2020 at 7:40 PM Jonathan Wakely <jwakely@redhat.com> wrote:

> On 06/10/20 00:25 +0100, Jonathan Wakely wrote:
> >I'm sorry it's taken a year to review this properly. Comments below ...
> >
> >On 27/09/19 14:18 -0400, Daniel Lemire wrote:
> >>(This is a revised patch proposal. I am revising both the description
> >>and the code itself.)
> >>
> >>Even on recent processors, integer division is relatively expensive.
> >>The current implementation of  std::uniform_int_distribution typically
> >>requires two divisions by invocation:
> >>
> >>       // downscaling
> >>        const __uctype __uerange = __urange + 1; // __urange can be zero
> >>        const __uctype __scaling = __urngrange / __uerange;
> >>        const __uctype __past = __uerange * __scaling;
> >>        do
> >>          __ret = __uctype(__urng()) - __urngmin;
> >>        while (__ret >= __past);
> >>        __ret /= __scaling;
> >>
> >>We can achieve the same algorithmic result with at most one division,
> >>and typically no division at all without requiring more calls to the
> >>random number generator.
> >>This was recently added to Swift (
> https://github.com/apple/swift/pull/25286)
> >>
> >>The main challenge is that we need to be able to compute the "full"
> >>product. E.g., given two 64-bit integers, we want the 128-bit result;
> >>given two 32-bit integers we want the 64-bit result. This is fast on
> >>common processors.
> >>The 128-bit product is not natively supported in C/C++ but can be
> >>achieved with the
> >>__int128 extension when it is available. The patch checks for
> >>__int128 support; when
> >>support is lacking, we fallback on the existing approach which uses
> >>two divisions per
> >>call.
> >>
> >>For example, if we replace the above code by the following, we get a
> substantial
> >>performance boost on skylake microarchitectures. E.g., it can
> >>be twice as fast to shuffle arrays of 1 million elements (e.g., using
> >>the followingbenchmark:
> https://github.com/lemire/simple_cpp_shuffle_benchmark )
> >>
> >>
> >>     unsigned __int128 __product = (unsigned
> >>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
> >>     uint64_t __lsb = uint64_t(__product);
> >>     if(__lsb < __uerange)
> >>     {
> >>       uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
> >>       while (__lsb < __threshold)
> >>       {
> >>         __product = (unsigned __int128)(__uctype(__urng()) -
> >>__urngmin) * (unsigned __int128)(__uerange);
> >>         __lsb = uint64_t(__product);
> >>       }
> >>     }
> >>     __ret = __product >> 64;
> >>
> >>Included is a patch that would bring better performance (e.g., 2x gain)
> to
> >>std::uniform_int_distribution  in some cases. Here are some actual
> numbers...
> >>
> >>With this patch:
> >>
> >>std::shuffle(testvalues, testvalues + size, g)              :  7952091
> >>ns total,  7.95 ns per input key
> >>
> >>Before this patch:
> >>
> >>std::shuffle(testvalues, testvalues + size, g)              :
> >>14954058 ns total,  14.95 ns per input key
> >>
> >>
> >>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).
> >>
> >>Furthermore, the new algorithm is unbiased, so the randomness of the
> >>result is not affected.
> >>
> >>I ran both performance and biases tests with the proposed patch.
> >>
> >>
> >>This patch proposal was improved following feedback by Jonathan
> >>Wakely. An earlier version used the __uint128_t type, which is widely
> >>supported but not used in the C++ library, instead we now use unsigned
> >>__int128. Furthermore, the previous patch was accidentally broken: it
> >>was not computing the full product since a rhs cast was missing. These
> >>issues are fixed and verified.
> >
> >After looking at GCC's internals, it looks like __uint128_t is
> >actually fine to use, even though we never currently use it in the
> >library. I didn't even know it was supported for C++ mode, sorry!
> >
> >>Reference: Fast Random Integer Generation in an Interval, ACM
> Transactions on
> >>Modeling and Computer Simulation 29 (1), 2019
> https://arxiv.org/abs/1805.10941
> >
> >>Index: libstdc++-v3/include/bits/uniform_int_dist.h
> >>===================================================================
> >>--- libstdc++-v3/include/bits/uniform_int_dist.h      (revision 276183)
> >>+++ libstdc++-v3/include/bits/uniform_int_dist.h      (working copy)
> >>@@ -33,7 +33,8 @@
> >>
> >>#include <type_traits>
> >>#include <limits>
> >>-
> >>+#include <cstdint>
> >>+#include <cstdio>
> >>namespace std _GLIBCXX_VISIBILITY(default)
> >>{
> >>_GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>@@ -239,18 +240,61 @@
> >>        = __uctype(__param.b()) - __uctype(__param.a());
> >>
> >>      __uctype __ret;
> >>-
> >>-     if (__urngrange > __urange)
> >>+     if (__urngrange > __urange)
> >>        {
> >>-         // downscaling
> >>-         const __uctype __uerange = __urange + 1; // __urange can be
> zero
> >>-         const __uctype __scaling = __urngrange / __uerange;
> >>-         const __uctype __past = __uerange * __scaling;
> >>-         do
> >>-           __ret = __uctype(__urng()) - __urngmin;
> >>-         while (__ret >= __past);
> >>-         __ret /= __scaling;
> >>-       }
> >>+             const __uctype __uerange = __urange + 1; // __urange can
> be zero
> >>+#if _GLIBCXX_USE_INT128 == 1
> >>+    if(sizeof(__uctype) == sizeof(uint64_t) and
> >>+      (__urngrange == numeric_limits<uint64_t>::max()))
> >>+    {
> >>+      // 64-bit case
> >>+      // reference: Fast Random Integer Generation in an Interval
> >>+      // ACM Transactions on Modeling and Computer Simulation 29 (1),
> 2019
> >>+      // https://arxiv.org/abs/1805.10941
> >>+      unsigned __int128 __product = (unsigned
> __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
> >
> >Is subtracting  __urngmin necessary here?
> >
> >The condition above checks that __urngrange == 2**64-1 which means
> >that U::max() - U::min() is the maximum 64-bit value, which means
> >means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
> >need to subtract it.
> >
> >Also, I think the casts to uint64_t are unnecessary. We know that
> >__uctype is an unsigned integral type, and we've checked that it has
> >exactly 64-bits, so I think we can just use __uctype. It's got the
> >same width and signedness as uint64_t anyway.
> >
> >That said, the uint64_t(__uerange) above isn't redundant, because it
> >should be (unsigned __int128)__uerange, I think.
>
> Ah yes, you pointed out that last bit in your Sept 28 2019 email.
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2020-10-06 19:55       ` Daniel Lemire
@ 2020-10-06 20:04         ` Jonathan Wakely
  2020-10-09 13:17         ` Jonathan Wakely
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Wakely @ 2020-10-06 20:04 UTC (permalink / raw)
  To: lemire; +Cc: libstdc++, gcc-patches

On 06/10/20 15:55 -0400, Daniel Lemire via Libstdc++ wrote:
>The updated patch looks good to me. It is indeed cleaner to have a separate
>(static) function.
>
>It might be nice to add a comment to explain the _S_nd function maybe with
>a comment like "returns a random value in [0,__range)
>without any bias" (or something to that effect).

Indeed. My current local branch has this comment on _S_nd:

+      // Lemire's nearly divisionless algorithm.
+      // Returns a random number from __g downscaled to [0,__range)
+      // using an unsigned type _Wp twice as wide as unsigned type _Up.

I think "Returns an unbiased random number from ..." would be an
improvement.

>Otherwise, it is algorithmically correct.

Great, thanks for the review.

I'll get it committed tomorrow then. Thanks for the patch, and sorry
it took so long.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
  2020-10-06 19:55       ` Daniel Lemire
  2020-10-06 20:04         ` Jonathan Wakely
@ 2020-10-09 13:17         ` Jonathan Wakely
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Wakely @ 2020-10-09 13:17 UTC (permalink / raw)
  To: lemire; +Cc: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]

On 06/10/20 15:55 -0400, Daniel Lemire via Libstdc++ wrote:
>The updated patch looks good to me. It is indeed cleaner to have a separate
>(static) function.
>
>It might be nice to add a comment to explain the _S_nd function maybe with
>a comment like "returns a random value in [0,__range)
>without any bias" (or something to that effect).
>
>Otherwise, it is algorithmically correct.

Here's what I've just committed and pushed to the master branch.

As expected, this shows significant improvements for some (but not
all) of the cases in the new test I added earlier today,
testsuite/performance/26_numerics/random_dist.cc

Thanks again for the patch.



[-- Attachment #2: patch.txt --]
[-- Type: text/x-patch, Size: 3280 bytes --]

commit 98c37d3bacbb2f8bbbe56ed53a9547d3be01b66b
Author: Daniel Lemire <lemire@gmail.com>
Date:   Fri Oct 9 14:09:36 2020

    libstdc++: Optimize uniform_int_distribution using Lemire's algorithm
    
    Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
    
    libstdc++-v3/ChangeLog:
    
            * include/bits/uniform_int_dist.h (uniform_int_distribution::_S_nd):
            New member function implementing Lemire's "nearly divisionless"
            algorithm.
            (uniform_int_distribution::operator()): Use _S_nd when the range
            of the URBG is the full width of the result type.

diff --git a/libstdc++-v3/include/bits/uniform_int_dist.h b/libstdc++-v3/include/bits/uniform_int_dist.h
index 6e1e3d5fc5f..ecb8574864a 100644
--- a/libstdc++-v3/include/bits/uniform_int_dist.h
+++ b/libstdc++-v3/include/bits/uniform_int_dist.h
@@ -234,6 +234,34 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 			const param_type& __p);
 
       param_type _M_param;
+
+      // Lemire's nearly divisionless algorithm.
+      // Returns an unbiased random number from __g downscaled to [0,__range)
+      // using an unsigned type _Wp twice as wide as unsigned type _Up.
+      template<typename _Wp, typename _Urbg, typename _Up>
+	static _Up
+	_S_nd(_Urbg& __g, _Up __range)
+	{
+	  using __gnu_cxx::__int_traits;
+	  static_assert(!__int_traits<_Up>::__is_signed, "U must be unsigned");
+	  static_assert(!__int_traits<_Wp>::__is_signed, "W must be unsigned");
+
+	  // reference: Fast Random Integer Generation in an Interval
+	  // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+	  // https://arxiv.org/abs/1805.10941
+	  _Wp __product = _Wp(__g()) * _Wp(__range);
+	  _Up __low = _Up(__product);
+	  if (__low < __range)
+	    {
+	      _Up __threshold = -__range % __range;
+	      while (__low < __threshold)
+		{
+		  __product = _Wp(__g()) * _Wp(__range);
+		  __low = _Up(__product);
+		}
+	    }
+	  return __product >> __gnu_cxx::__int_traits<_Up>::__digits;
+	}
     };
 
   template<typename _IntType>
@@ -256,17 +284,36 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  = __uctype(__param.b()) - __uctype(__param.a());
 
 	__uctype __ret;
-
 	if (__urngrange > __urange)
 	  {
 	    // downscaling
+
 	    const __uctype __uerange = __urange + 1; // __urange can be zero
-	    const __uctype __scaling = __urngrange / __uerange;
-	    const __uctype __past = __uerange * __scaling;
-	    do
-	      __ret = __uctype(__urng()) - __urngmin;
-	    while (__ret >= __past);
-	    __ret /= __scaling;
+
+	    using __gnu_cxx::__int_traits;
+#if __SIZEOF_INT128__
+	    if (__int_traits<__uctype>::__digits == 64
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      {
+		__ret = _S_nd<unsigned __int128>(__urng, __uerange);
+	      }
+	    else
+#endif
+	    if (__int_traits<__uctype>::__digits == 32
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      {
+		__ret = _S_nd<__UINT64_TYPE__>(__urng, __uerange);
+	      }
+	    else
+	      {
+		// fallback case (2 divisions)
+		const __uctype __scaling = __urngrange / __uerange;
+		const __uctype __past = __uerange * __scaling;
+		do
+		  __ret = __uctype(__urng()) - __urngmin;
+		while (__ret >= __past);
+		__ret /= __scaling;
+	      }
 	  }
 	else if (__urngrange < __urange)
 	  {

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-10-09 13:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-08 16:10 [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions) Daniel Lemire
2019-09-27 18:19 ` Daniel Lemire
2020-10-05 23:25   ` Jonathan Wakely
2020-10-05 23:38     ` Jonathan Wakely
2020-10-05 23:40     ` Jonathan Wakely
2020-10-06 19:09       ` Daniel Lemire
2020-10-06 19:55       ` Daniel Lemire
2020-10-06 20:04         ` Jonathan Wakely
2020-10-09 13:17         ` Jonathan Wakely

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).