Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)

public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed

From: Daniel Lemire <lemire@gmail.com>
To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org,
		Jonathan Wakely <jwakely@redhat.com>
Subject: Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
Date: Fri, 27 Sep 2019 18:19:00 -0000	[thread overview]
Message-ID: <CAJ0XVj2xcKmDppxQa6Lht758TWQpykJ1XX2+b4Z9fN2TFcZKzA@mail.gmail.com> (raw)
In-Reply-To: <CAJ0XVj1YOGT-pnzwnge7Wr8rC-0DxaONw20YWasgHhDgL02ATw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3313 bytes --]

(This is a revised patch proposal. I am revising both the description
and the code itself.)

Even on recent processors, integer division is relatively expensive.
The current implementation of  std::uniform_int_distribution typically
requires two divisions by invocation:

        // downscaling
         const __uctype __uerange = __urange + 1; // __urange can be zero
         const __uctype __scaling = __urngrange / __uerange;
         const __uctype __past = __uerange * __scaling;
         do
           __ret = __uctype(__urng()) - __urngmin;
         while (__ret >= __past);
         __ret /= __scaling;

We can achieve the same algorithmic result with at most one division,
and typically no division at all without requiring more calls to the
random number generator.
 This was recently added to Swift (https://github.com/apple/swift/pull/25286)

The main challenge is that we need to be able to compute the "full"
product. E.g., given two 64-bit integers, we want the 128-bit result;
given two 32-bit integers we want the 64-bit result. This is fast on
common processors.
The 128-bit product is not natively supported in C/C++ but can be
achieved with the
 __int128 extension when it is available. The patch checks for
__int128 support; when
support is lacking, we fallback on the existing approach which uses
two divisions per
call.

For example, if we replace the above code by the following, we get a substantial
performance boost on skylake microarchitectures. E.g., it can
be twice as fast to shuffle arrays of 1 million elements (e.g., using
the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )


      unsigned __int128 __product = (unsigned
__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
      uint64_t __lsb = uint64_t(__product);
      if(__lsb < __uerange)
      {
        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
        while (__lsb < __threshold)
        {
          __product = (unsigned __int128)(__uctype(__urng()) -
__urngmin) * (unsigned __int128)(__uerange);
          __lsb = uint64_t(__product);
        }
      }
      __ret = __product >> 64;

Included is a patch that would bring better performance (e.g., 2x gain) to
 std::uniform_int_distribution  in some cases. Here are some actual numbers...

With this patch:

std::shuffle(testvalues, testvalues + size, g)              :  7952091
ns total,  7.95 ns per input key

Before this patch:

std::shuffle(testvalues, testvalues + size, g)              :
14954058 ns total,  14.95 ns per input key


Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).

Furthermore, the new algorithm is unbiased, so the randomness of the
result is not affected.

I ran both performance and biases tests with the proposed patch.


This patch proposal was improved following feedback by Jonathan
Wakely. An earlier version used the __uint128_t type, which is widely
supported but not used in the C++ library, instead we now use unsigned
__int128. Furthermore, the previous patch was accidentally broken: it
was not computing the full product since a rhs cast was missing. These
issues are fixed and verified.

Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941

[-- Attachment #2: patch_uniform_int_dist.txt --]
[-- Type: text/plain, Size: 3111 bytes --]

Index: libstdc++-v3/include/bits/uniform_int_dist.h
===================================================================
--- libstdc++-v3/include/bits/uniform_int_dist.h	(revision 276183)
+++ libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
@@ -33,7 +33,8 @@

 #include <type_traits>
 #include <limits>
-
+#include <cstdint>
+#include <cstdio>
 namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
@@ -239,18 +240,61 @@
 	  = __uctype(__param.b()) - __uctype(__param.a());

 	__uctype __ret;
-
-	if (__urngrange > __urange)
+ 	if (__urngrange > __urange)
 	  {
-	    // downscaling
-	    const __uctype __uerange = __urange + 1; // __urange can be zero
-	    const __uctype __scaling = __urngrange / __uerange;
-	    const __uctype __past = __uerange * __scaling;
-	    do
-	      __ret = __uctype(__urng()) - __urngmin;
-	    while (__ret >= __past);
-	    __ret /= __scaling;
-	  }
+		const __uctype __uerange = __urange + 1; // __urange can be zero
+#if _GLIBCXX_USE_INT128 == 1
+    if(sizeof(__uctype) == sizeof(uint64_t) and
+      (__urngrange == numeric_limits<uint64_t>::max()))
+    {
+      // 64-bit case
+      // reference: Fast Random Integer Generation in an Interval
+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+      // https://arxiv.org/abs/1805.10941
+      unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
+      uint64_t __lsb = uint64_t(__product);
+      if(__lsb < __uerange)
+      {
+        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
+        while (__lsb < __threshold)
+        {
+          __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * (unsigned __int128)(__uerange);
+          __lsb = uint64_t(__product);
+        }
+      }
+      __ret = __product >> 64;
+    }
+    else
+#endif // _GLIBCXX_USE_INT128 == 1
+    if(sizeof(__uctype) == sizeof(uint32_t)
+      and (__urngrange == numeric_limits<uint32_t>::max()) )
+    {
+      // 32-bit case
+      // reference: Fast Random Integer Generation in an Interval
+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
+      // https://arxiv.org/abs/1805.10941
+      uint64_t __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
+      uint32_t __lsb = uint32_t(__product);
+      if(__lsb < __uerange) {
+        uint64_t __threshold = -uint32_t(__uerange) % uint32_t(__uerange);
+        while (__lsb < __threshold) {
+          __product = uint64_t(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
+          __lsb = uint32_t(__product);
+        }
+      }
+      __ret = __product >> 32;
+    }
+    else
+    {
+      // fallback case (2 divisions)
+      const __uctype __scaling = __urngrange / __uerange;
+      const __uctype __past = __uerange * __scaling;
+      do
+        __ret = __uctype(__urng()) - __urngmin;
+      while (__ret >= __past);
+      __ret /= __scaling;
+    }
+  }
 	else if (__urngrange < __urange)
 	  {
 	    // upscaling
@@ -377,3 +421,4 @@
 } // namespace std

 #endif

next prev parent reply	other threads:[~2019-09-27 18:19 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-08 16:10 Daniel Lemire
2019-09-27 18:19 ` Daniel Lemire [this message]
2020-10-05 23:25   ` Jonathan Wakely
2020-10-05 23:38     ` Jonathan Wakely
2020-10-05 23:40     ` Jonathan Wakely
2020-10-06 19:09       ` Daniel Lemire
2020-10-06 19:55       ` Daniel Lemire
2020-10-06 20:04         ` Jonathan Wakely
2020-10-09 13:17         ` Jonathan Wakely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJ0XVj2xcKmDppxQa6Lht758TWQpykJ1XX2+b4Z9fN2TFcZKzA@mail.gmail.com \
    --to=lemire@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jwakely@redhat.com \
    --cc=libstdc++@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).