From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jwakely@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [63.128.21.124])
 by sourceware.org (Postfix) with ESMTP id DB96D3857C4E
 for <libstdc++@gcc.gnu.org>; Mon,  5 Oct 2020 23:40:37 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org DB96D3857C4E
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-308-OekykyMxMkWGk0V8_J-WzA-1; Mon, 05 Oct 2020 19:40:33 -0400
X-MC-Unique: OekykyMxMkWGk0V8_J-WzA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A2325101FFA5;
 Mon,  5 Oct 2020 23:40:32 +0000 (UTC)
Received: from localhost (unknown [10.33.37.1])
 by smtp.corp.redhat.com (Postfix) with ESMTP id 520677AEC4;
 Mon,  5 Oct 2020 23:40:32 +0000 (UTC)
Date: Tue, 6 Oct 2020 00:40:31 +0100
From: Jonathan Wakely <jwakely@redhat.com>
To: Daniel Lemire <lemire@gmail.com>
Cc: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, libstdc++] Improve the performance of
 std::uniform_int_distribution (fewer divisions)
Message-ID: <20201005234031.GF7004@redhat.com>
References: <CAJ0XVj1YOGT-pnzwnge7Wr8rC-0DxaONw20YWasgHhDgL02ATw@mail.gmail.com>
 <CAJ0XVj2xcKmDppxQa6Lht758TWQpykJ1XX2+b4Z9fN2TFcZKzA@mail.gmail.com>
 <20201005232515.GD7004@redhat.com>
MIME-Version: 1.0
In-Reply-To: <20201005232515.GD7004@redhat.com>
X-Clacks-Overhead: GNU Terry Pratchett
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
X-Spam-Status: No, score=-8.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE,
 RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=unavailable autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Oct 2020 23:40:39 -0000

On 06/10/20 00:25 +0100, Jonathan Wakely wrote:
>I'm sorry it's taken a year to review this properly. Comments below ...
>
>On 27/09/19 14:18 -0400, Daniel Lemire wrote:
>>(This is a revised patch proposal. I am revising both the description
>>and the code itself.)
>>
>>Even on recent processors, integer division is relatively expensive.
>>The current implementation of  std::uniform_int_distribution typically
>>requires two divisions by invocation:
>>
>>       // downscaling
>>        const __uctype __uerange = __urange + 1; // __urange can be zero
>>        const __uctype __scaling = __urngrange / __uerange;
>>        const __uctype __past = __uerange * __scaling;
>>        do
>>          __ret = __uctype(__urng()) - __urngmin;
>>        while (__ret >= __past);
>>        __ret /= __scaling;
>>
>>We can achieve the same algorithmic result with at most one division,
>>and typically no division at all without requiring more calls to the
>>random number generator.
>>This was recently added to Swift (https://github.com/apple/swift/pull/25286)
>>
>>The main challenge is that we need to be able to compute the "full"
>>product. E.g., given two 64-bit integers, we want the 128-bit result;
>>given two 32-bit integers we want the 64-bit result. This is fast on
>>common processors.
>>The 128-bit product is not natively supported in C/C++ but can be
>>achieved with the
>>__int128 extension when it is available. The patch checks for
>>__int128 support; when
>>support is lacking, we fallback on the existing approach which uses
>>two divisions per
>>call.
>>
>>For example, if we replace the above code by the following, we get a substantial
>>performance boost on skylake microarchitectures. E.g., it can
>>be twice as fast to shuffle arrays of 1 million elements (e.g., using
>>the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark )
>>
>>
>>     unsigned __int128 __product = (unsigned
>>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>>     uint64_t __lsb = uint64_t(__product);
>>     if(__lsb < __uerange)
>>     {
>>       uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
>>       while (__lsb < __threshold)
>>       {
>>         __product = (unsigned __int128)(__uctype(__urng()) -
>>__urngmin) * (unsigned __int128)(__uerange);
>>         __lsb = uint64_t(__product);
>>       }
>>     }
>>     __ret = __product >> 64;
>>
>>Included is a patch that would bring better performance (e.g., 2x gain) to
>>std::uniform_int_distribution  in some cases. Here are some actual numbers...
>>
>>With this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :  7952091
>>ns total,  7.95 ns per input key
>>
>>Before this patch:
>>
>>std::shuffle(testvalues, testvalues + size, g)              :
>>14954058 ns total,  14.95 ns per input key
>>
>>
>>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).
>>
>>Furthermore, the new algorithm is unbiased, so the randomness of the
>>result is not affected.
>>
>>I ran both performance and biases tests with the proposed patch.
>>
>>
>>This patch proposal was improved following feedback by Jonathan
>>Wakely. An earlier version used the __uint128_t type, which is widely
>>supported but not used in the C++ library, instead we now use unsigned
>>__int128. Furthermore, the previous patch was accidentally broken: it
>>was not computing the full product since a rhs cast was missing. These
>>issues are fixed and verified.
>
>After looking at GCC's internals, it looks like __uint128_t is
>actually fine to use, even though we never currently use it in the
>library. I didn't even know it was supported for C++ mode, sorry!
>
>>Reference: Fast Random Integer Generation in an Interval, ACM Transactions on
>>Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941
>
>>Index: libstdc++-v3/include/bits/uniform_int_dist.h
>>===================================================================
>>--- libstdc++-v3/include/bits/uniform_int_dist.h	(revision 276183)
>>+++ libstdc++-v3/include/bits/uniform_int_dist.h	(working copy)
>>@@ -33,7 +33,8 @@
>>
>>#include <type_traits>
>>#include <limits>
>>-
>>+#include <cstdint>
>>+#include <cstdio>
>>namespace std _GLIBCXX_VISIBILITY(default)
>>{
>>_GLIBCXX_BEGIN_NAMESPACE_VERSION
>>@@ -239,18 +240,61 @@
>>	  = __uctype(__param.b()) - __uctype(__param.a());
>>
>>	__uctype __ret;
>>-
>>-	if (__urngrange > __urange)
>>+ 	if (__urngrange > __urange)
>>	  {
>>-	    // downscaling
>>-	    const __uctype __uerange = __urange + 1; // __urange can be zero
>>-	    const __uctype __scaling = __urngrange / __uerange;
>>-	    const __uctype __past = __uerange * __scaling;
>>-	    do
>>-	      __ret = __uctype(__urng()) - __urngmin;
>>-	    while (__ret >= __past);
>>-	    __ret /= __scaling;
>>-	  }
>>+		const __uctype __uerange = __urange + 1; // __urange can be zero
>>+#if _GLIBCXX_USE_INT128 == 1
>>+    if(sizeof(__uctype) == sizeof(uint64_t) and
>>+      (__urngrange == numeric_limits<uint64_t>::max()))
>>+    {
>>+      // 64-bit case
>>+      // reference: Fast Random Integer Generation in an Interval
>>+      // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019
>>+      // https://arxiv.org/abs/1805.10941
>>+      unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
>
>Is subtracting  __urngmin necessary here?
>
>The condition above checks that __urngrange == 2**64-1 which means
>that U::max() - U::min() is the maximum 64-bit value, which means
>means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
>need to subtract it.
>
>Also, I think the casts to uint64_t are unnecessary. We know that
>__uctype is an unsigned integral type, and we've checked that it has
>exactly 64-bits, so I think we can just use __uctype. It's got the
>same width and signedness as uint64_t anyway.
>
>That said, the uint64_t(__uerange) above isn't redundant, because it
>should be (unsigned __int128)__uerange, I think.

Ah yes, you pointed out that last bit in your Sept 28 2019 email.