Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jonathan Wakely <jwakely@redhat.com>
To: Daniel Lemire <lemire@gmail.com>
Cc: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
Date: Tue, 6 Oct 2020 00:38:32 +0100	[thread overview]
Message-ID: <20201005233832.GE7004@redhat.com> (raw)
In-Reply-To: <20201005232515.GD7004@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 9709 bytes --]

On 06/10/20 00:25 +0100, Jonathan Wakely wrote:
>I'm sorry it's taken a year to review this properly. Comments below ...
>
>On 27/09/19 14:18 -0400, Daniel Lemire wrote:
>>(This is a revised patch proposal. I am revising both the description
>>and the code itself.)
>>
>>Even on recent processors, integer division is relatively expensive.
>>The current implementation of  std::uniform_int_distribution typically
>>requires two divisions by invocation:
>>
>>       // downscaling
>>        const __uctype __uerange = __urange + 1; // __urange can be zero
>>        const __uctype __scaling = __urngrange / __uerange;
>>        const __uctype __past = __uerange * __scaling;
>>        do
>>          __ret = __uctype(__urng()) - __urngmin;
>>        while (__ret >= __past);
>>        __ret /= __scaling;
>>
>>We can achieve the same algorithmic result with at most one division,
>>and typically no division at all without requiring more calls to the
>>random number generator.
>>This was recently added to Swift (https://github.com/apple/swift/pull/25286)
>>
>>The main challenge is that we need to be able to compute the "full"
>>product. E.g., given two 64-bit integers, we want the 128-bit result;
>>given two 32-bit integers we want the 64-bit result. This is fast on
>>common processors.
>>The 128-bit product is not natively supported in C/C++ but can be
>>achieved with the
>>__int128 extension when it is available. The patch checks for
>>__int128 support; when
>>support is lacking, we fallback on the existing approach which uses
>>two divisions per
>>call.
>>
>>For example, if we replace the above code by the following, we get a substantial
>>performance boost on skylake microarchitectures. E.g., it can
>>be twice as fast to shuffle arrays of 1 million elements (e.g., using
>>the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )
>>
>>
>>     unsigned __int128 __product = (unsigned
>>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>     uint64_t __lsb = uint64_t(__product);
>>     if(__lsb < __uerange)
>>     {
>>       uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>>       while (__lsb < __threshold)
>>       {
>>         __product = (unsigned __int128)(__uctype(__urng()) -
>>__urngmin) * (unsigned __int128)(__uerange);
>>         __lsb = uint64_t(__product);
>>       }
>>     }
>>     __ret = __product >> 64;
>>
>>Included is a patch that would bring better performance (e.g., 2x gain) to
>>std::uniform_int_distribution  in some cases. Here are some actual numbers...
>>
>>With this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :  7952091
>>ns total,  7.95 ns per input key
>>
>>Before this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :
>>14954058 ns total,  14.95 ns per input key
>>
>>
>>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).
>>
>>Furthermore, the new algorithm is unbiased, so the randomness of the
>>result is not affected.
>>
>>I ran both performance and biases tests with the proposed patch.
>>
>>
>>This patch proposal was improved following feedback by Jonathan
>>Wakely. An earlier version used the __uint128_t type, which is widely
>>supported but not used in the C++ library, instead we now use unsigned
>>__int128. Furthermore, the previous patch was accidentally broken: it
>>was not computing the full product since a rhs cast was missing. These
>>issues are fixed and verified.
>
>After looking at GCC's internals, it looks like __uint128_t is
>actually fine to use, even though we never currently use it in the
>library. I didn't even know it was supported for C++ mode, sorry!
>
>>Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
>>Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941
>
>>Index: libstdc++-v3/include/bits/uniform_int_dist.h
>>===================================================================
>>--- libstdc++-v3/include/bits/uniform_int_dist.h	(revision 276183)
>>+++ libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
>>@@ -33,7 +33,8 @@
>>
>>#include <type_traits>
>>#include <limits>
>>-
>>+#include <cstdint>
>>+#include <cstdio>
>>namespace std _GLIBCXX_VISIBILITY(default)
>>{
>>_GLIBCXX_BEGIN_NAMESPACE_VERSION
>>@@ -239,18 +240,61 @@
>>	  = __uctype(__param.b()) - __uctype(__param.a());
>>
>>	__uctype __ret;
>>-
>>-	if (__urngrange > __urange)
>>+ 	if (__urngrange > __urange)
>>	  {
>>-	    // downscaling
>>-	    const __uctype __uerange = __urange + 1; // __urange can be zero
>>-	    const __uctype __scaling = __urngrange / __uerange;
>>-	    const __uctype __past = __uerange * __scaling;
>>-	    do
>>-	      __ret = __uctype(__urng()) - __urngmin;
>>-	    while (__ret >= __past);
>>-	    __ret /= __scaling;
>>-	  }
>>+		const __uctype __uerange = __urange + 1; // __urange can be zero
>>+#if _GLIBCXX_USE_INT128 == 1
>>+    if(sizeof(__uctype) == sizeof(uint64_t) and
>>+      (__urngrange == numeric_limits<uint64_t>::max()))
>>+    {
>>+      // 64-bit case
>>+      // reference: Fast Random Integer Generation in an Interval
>>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>>+      // https://arxiv.org/abs/1805.10941
>>+      unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>
>Is subtracting  __urngmin necessary here?
>
>The condition above checks that __urngrange == 2**64-1 which means
>that U::max() - U::min() is the maximum 64-bit value, which means
>means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
>need to subtract it.
>
>Also, I think the casts to uint64_t are unnecessary. We know that
>__uctype is an unsigned integral type, and we've checked that it has
>exactly 64-bits, so I think we can just use __uctype. It's got the
>same width and signedness as uint64_t anyway.
>
>That said, the uint64_t(__uerange) above isn't redundant, because it
>should be (unsigned __int128)__uerange, I think.
>
>i.e.
>
>  unsigned __int128 __product = (unsigned __int128)__urng() * (unsigned __int128)__uerange;
>
>
>>+      uint64_t __lsb = uint64_t(__product);
>>+      if(__lsb < __uerange)
>>+      {
>>+        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>>+        while (__lsb < __threshold)
>>+        {
>>+          __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * (unsigned __int128)(__uerange);
>>+          __lsb = uint64_t(__product);
>>+        }
>>+      }
>>+      __ret = __product >> 64;
>>+    }
>>+    else
>>+#endif // _GLIBCXX_USE_INT128 == 1
>>+    if(sizeof(__uctype) == sizeof(uint32_t)
>>+      and (__urngrange == numeric_limits<uint32_t>::max()) )
>>+    {
>>+      // 32-bit case
>>+      // reference: Fast Random Integer Generation in an Interval
>>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>>+      // https://arxiv.org/abs/1805.10941
>>+      uint64_t __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>+      uint32_t __lsb = uint32_t(__product);
>>+      if(__lsb < __uerange) {
>>+        uint64_t __threshold = -uint32_t(__uerange) % uint32_t(__uerange);
>
>This __threshold should be uint32_t, right?
>
>>+        while (__lsb < __threshold) {
>>+          __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>+          __lsb = uint32_t(__product);
>>+        }
>>+      }
>>+      __ret = __product >> 32;
>>+    }
>>+    else
>>+    {
>>+      // fallback case (2 divisions)
>>+      const __uctype __scaling = __urngrange / __uerange;
>>+      const __uctype __past = __uerange * __scaling;
>>+      do
>>+        __ret = __uctype(__urng()) - __urngmin;
>>+      while (__ret >= __past);
>>+      __ret /= __scaling;
>>+    }
>>+  }
>>	else if (__urngrange < __urange)
>>	  {
>>	    // upscaling
>
>I've attached a revised patch which hoists the clever bit into a
>separate function template so we don't need to repeat the code. I
>think it's still right and I haven't broken your logic.
>
>Does it look OK to you?
>
>

>commit b6e2027df22d9f3837b906c5b3de5cfe1fc5f0d7
>Author: Daniel Lemire <lemire@gmail.com>
>Date:   Tue Oct 6 00:05:52 2020
>
>    libstdc++: Optimize uniform_int_distribution using Lemire's algorithm
>    
>    Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
>    
>    libstdc++-v3/ChangeLog:
>    
>            * include/bits/uniform_int_dist.h (uniform_int_distribution::_S_nd):
>            New member function implementing Lemire's "nearly divisionless"
>            algorithm.
>            (uniform_int_distribution::operator()): Use _S_nd when the range
>            of the URBG is the full width of the result type.
>
>diff --git a/libstdc++-v3/include/bits/uniform_int_dist.h b/libstdc++-v3/include/bits/uniform_int_dist.h
>index 6e1e3d5fc5f..9c6927e6923 100644
>--- a/libstdc++-v3/include/bits/uniform_int_dist.h
>+++ b/libstdc++-v3/include/bits/uniform_int_dist.h
>@@ -234,6 +234,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> 			const param_type& __p);
> 
>       param_type _M_param;
>+
>+      // Lemire's nearly divisionless algorithm
>+      template<typename _Wp, typename _Urbg, typename _Up>
>+	static _Up
>+	_S_nd(_Urbg& __g, _Up __range)
>+	{
>+	  using __gnu_cxx::__int_traits;
>+	  constexpr __digits = __int_traits<_Up>::__digits;

>+	  static_assert(__int_traits<_Wp>::__digits == (2 * __digits),
>+			"_Wp must be twice as wide as _Up");

Gah, this static_assert needs to be removed (the function is
instantiated for arguments that cause it to fail, even though that
instantiation isn't called).  I hadn't committed that fix to git
before generating the patch though.

Correct patch attached.



[-- Attachment #2: patch.txt --]
[-- Type: text/x-patch, Size: 2910 bytes --]

commit 5905c99cabca1433ff56befde5003fede50a5dc3
Author: Daniel Lemire <lemire@gmail.com>
Date:   Tue Oct 6 00:34:39 2020

    libstdc++: Optimize uniform_int_distribution using Lemire's algorithm
    
    Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
    
    libstdc++-v3/ChangeLog:
    
            * include/bits/uniform_int_dist.h (uniform_int_distribution::_S_nd):
            New member function implementing Lemire's "nearly divisionless"
            algorithm.
            (uniform_int_distribution::operator()): Use _S_nd when the range
            of the URBG is the full width of the result type.

diff --git a/libstdc++-v3/include/bits/uniform_int_dist.h b/libstdc++-v3/include/bits/uniform_int_dist.h
index 6e1e3d5fc5f..18c88ca506d 100644
--- a/libstdc++-v3/include/bits/uniform_int_dist.h
+++ b/libstdc++-v3/include/bits/uniform_int_dist.h
@@ -234,6 +234,28 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 			const param_type& __p);
 
       param_type _M_param;
+
+      // Lemire's nearly divisionless algorithm
+      template<typename _Wp, typename _Urbg, typename _Up>
+	static _Up
+	_S_nd(_Urbg& __g, _Up __range)
+	{
+	  // reference: Fast Random Integer Generation in an Interval
+	  // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+	  // https://arxiv.org/abs/1805.10941
+	  _Wp __product = _Wp(__g()) * _Wp(__range);
+	  _Up __lsb = _Up(__product);
+	  if (__lsb < __range)
+	    {
+	      _Up __threshold = -__range % __range;
+	      while (__lsb < __threshold)
+		{
+		  __product = _Wp(__g()) * _Wp(__range);
+		  __lsb = _Up(__product);
+		}
+	    }
+	  return __product >> __gnu_cxx::__int_traits<_Up>::__digits;
+	}
     };
 
   template<typename _IntType>
@@ -256,17 +278,30 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  = __uctype(__param.b()) - __uctype(__param.a());
 
 	__uctype __ret;
-
 	if (__urngrange > __urange)
 	  {
-	    // downscaling
 	    const __uctype __uerange = __urange + 1; // __urange can be zero
-	    const __uctype __scaling = __urngrange / __uerange;
-	    const __uctype __past = __uerange * __scaling;
-	    do
-	      __ret = __uctype(__urng()) - __urngmin;
-	    while (__ret >= __past);
-	    __ret /= __scaling;
+
+	    using __gnu_cxx::__int_traits;
+#if __SIZEOF_INT128__
+	    if (__int_traits<__uctype>::__digits == 64
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      __ret = _S_nd<unsigned __int128>(__urng, __uerange);
+	    else
+#endif
+	    if (__int_traits<__uctype>::__digits == 32
+		&& __urngrange == __int_traits<__uctype>::__max)
+	      __ret = _S_nd<__UINT64_TYPE__>(__urng, __uerange);
+	    else
+	      {
+		// fallback case (2 divisions)
+		const __uctype __scaling = __urngrange / __uerange;
+		const __uctype __past = __uerange * __scaling;
+		do
+		  __ret = __uctype(__urng()) - __urngmin;
+		while (__ret >= __past);
+		__ret /= __scaling;
+	      }
 	  }
 	else if (__urngrange < __urange)
 	  {

next prev parent reply	other threads:[~2020-10-05 23:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-08 16:10 Daniel Lemire
2019-09-27 18:19 ` Daniel Lemire
2020-10-05 23:25   ` Jonathan Wakely
2020-10-05 23:38     ` Jonathan Wakely [this message]
2020-10-05 23:40     ` Jonathan Wakely
2020-10-06 19:09       ` Daniel Lemire
2020-10-06 19:55       ` Daniel Lemire
2020-10-06 20:04         ` Jonathan Wakely
2020-10-09 13:17         ` Jonathan Wakely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201005233832.GE7004@redhat.com \
    --to=jwakely@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=lemire@gmail.com \
    --cc=libstdc++@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).