From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lemire@gmail.com>
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
 by sourceware.org (Postfix) with ESMTPS id 0ED913857013;
 Tue,  6 Oct 2020 19:56:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 0ED913857013
Received: by mail-wr1-x443.google.com with SMTP id m6so14857740wrn.0;
 Tue, 06 Oct 2020 12:56:06 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:reply-to
 :from:date:message-id:subject:to:cc;
 bh=zAWEdxXMRTII0nFuf3PirmAOmG70q0EuDDyC0iVDeAM=;
 b=pf1lkWnr3ueGxK6dhRcvYEQycJxFCCAcU90qldvBQdw3THZuBPWTM4GHN4gTvlzRXe
 H0S5Clr9Mpyy08Cyz/8qLiEW6CX/QQKZAo2TWJS+gBxMhwR9zkRgUuVe1apb6Dkyb6Sn
 s+E16BbPIGqpbUpSzajsf12FLrNqHY+lW+UWSUkaiQdK0w+a7OWlLJhU3G8d1+Ptq1gL
 LOkiowTQ9wbQe+HRaljBs6di/eEBfHh5j4n0rcDU5yIbZCxEMNqdj1CiJ2pDjU20M0mU
 6f4utCrOxg+1ZFRdg7lSn12SJQHeNpT91xIc/Jg+7fjFVg715VYjUGbmh3IRVbLMRub2
 sVMw==
X-Gm-Message-State: AOAM530j5J46DgSfQP2uZxDMJkhkGk1KJ+LchDluiHqjQ0WjjQICO+Uv
 oynk5WS3Ao0C20iEWZU/klTy2JY7cB9WCQ/t4jo=
X-Google-Smtp-Source: ABdhPJzbOrCXmcBGBNF12+WK/OWepCHn5wOsKvvCRKNlvq8LLNWU4LwfgFcooDUeZ/w1WsMpsqg4iecGfCw2YBs4xnY=
X-Received: by 2002:a05:6000:12c3:: with SMTP id
 l3mr7162511wrx.164.1602014165002; 
 Tue, 06 Oct 2020 12:56:05 -0700 (PDT)
MIME-Version: 1.0
References: <CAJ0XVj1YOGT-pnzwnge7Wr8rC-0DxaONw20YWasgHhDgL02ATw@mail.gmail.com>
 <CAJ0XVj2xcKmDppxQa6Lht758TWQpykJ1XX2+b4Z9fN2TFcZKzA@mail.gmail.com>
 <20201005232515.GD7004@redhat.com> <20201005234031.GF7004@redhat.com>
In-Reply-To: <20201005234031.GF7004@redhat.com>
Reply-To: lemire@gmail.com
From: Daniel Lemire <lemire@gmail.com>
Date: Tue, 6 Oct 2020 15:55:54 -0400
Message-ID: <CAJ0XVj18JZEcTZk8v5m5PSnMtTVv8EGLnLO4-Jj_Sq_-VmQ3Mg@mail.gmail.com>
Subject: Re: [PATCH,
 libstdc++] Improve the performance of std::uniform_int_distribution
 (fewer divisions)
To: Jonathan Wakely <jwakely@redhat.com>
Cc: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org
X-Spam-Status: No, score=-0.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, HTML_MESSAGE,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Oct 2020 19:56:08 -0000

The updated patch looks good to me. It is indeed cleaner to have a separate
(static) function.

It might be nice to add a comment to explain the _S_nd function maybe with
a comment like "returns a random value in [0,__range)
without any bias" (or something to that effect).

Otherwise, it is algorithmically correct.


On Mon, Oct 5, 2020 at 7:40 PM Jonathan Wakely <jwakely@redhat.com> wrote:

> On 06/10/20 00:25 +0100, Jonathan Wakely wrote:
> >I'm sorry it's taken a year to review this properly. Comments below ...
> >
> >On 27/09/19 14:18 -0400, Daniel Lemire wrote:
> >>(This is a revised patch proposal. I am revising both the description
> >>and the code itself.)
> >>
> >>Even on recent processors, integer division is relatively expensive.
> >>The current implementation of  std::uniform_int_distribution typically
> >>requires two divisions by invocation:
> >>
> >>       // downscaling
> >>        const __uctype __uerange = __urange + 1; // __urange can be zero
> >>        const __uctype __scaling = __urngrange / __uerange;
> >>        const __uctype __past = __uerange * __scaling;
> >>        do
> >>          __ret = __uctype(__urng()) - __urngmin;
> >>        while (__ret >= __past);
> >>        __ret /= __scaling;
> >>
> >>We can achieve the same algorithmic result with at most one division,
> >>and typically no division at all without requiring more calls to the
> >>random number generator.
> >>This was recently added to Swift (
> https://github.com/apple/swift/pull/25286)
> >>
> >>The main challenge is that we need to be able to compute the "full"
> >>product. E.g., given two 64-bit integers, we want the 128-bit result;
> >>given two 32-bit integers we want the 64-bit result. This is fast on
> >>common processors.
> >>The 128-bit product is not natively supported in C/C++ but can be
> >>achieved with the
> >>__int128 extension when it is available. The patch checks for
> >>__int128 support; when
> >>support is lacking, we fallback on the existing approach which uses
> >>two divisions per
> >>call.
> >>
> >>For example, if we replace the above code by the following, we get a
> substantial
> >>performance boost on skylake microarchitectures. E.g., it can
> >>be twice as fast to shuffle arrays of 1 million elements (e.g., using
> >>the followingbenchmark:
> https://github.com/lemire/simple_cpp_shuffle_benchmark )
> >>
> >>
> >>     unsigned __int128 __product = (unsigned
> >>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
> >>     uint64_t __lsb = uint64_t(__product);
> >>     if(__lsb < __uerange)
> >>     {
> >>       uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
> >>       while (__lsb < __threshold)
> >>       {
> >>         __product = (unsigned __int128)(__uctype(__urng()) -
> >>__urngmin) * (unsigned __int128)(__uerange);
> >>         __lsb = uint64_t(__product);
> >>       }
> >>     }
> >>     __ret = __product >> 64;
> >>
> >>Included is a patch that would bring better performance (e.g., 2x gain)
> to
> >>std::uniform_int_distribution  in some cases. Here are some actual
> numbers...
> >>
> >>With this patch:
> >>
> >>std::shuffle(testvalues, testvalues + size, g)              :  7952091
> >>ns total,  7.95 ns per input key
> >>
> >>Before this patch:
> >>
> >>std::shuffle(testvalues, testvalues + size, g)              :
> >>14954058 ns total,  14.95 ns per input key
> >>
> >>
> >>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700).
> >>
> >>Furthermore, the new algorithm is unbiased, so the randomness of the
> >>result is not affected.
> >>
> >>I ran both performance and biases tests with the proposed patch.
> >>
> >>
> >>This patch proposal was improved following feedback by Jonathan
> >>Wakely. An earlier version used the __uint128_t type, which is widely
> >>supported but not used in the C++ library, instead we now use unsigned
> >>__int128. Furthermore, the previous patch was accidentally broken: it
> >>was not computing the full product since a rhs cast was missing. These
> >>issues are fixed and verified.
> >
> >After looking at GCC's internals, it looks like __uint128_t is
> >actually fine to use, even though we never currently use it in the
> >library. I didn't even know it was supported for C++ mode, sorry!
> >
> >>Reference: Fast Random Integer Generation in an Interval, ACM
> Transactions on
> >>Modeling and Computer Simulation 29 (1), 2019
> https://arxiv.org/abs/1805.10941
> >
> >>Index: libstdc++-v3/include/bits/uniform_int_dist.h
> >>===================================================================
> >>--- libstdc++-v3/include/bits/uniform_int_dist.h      (revision 276183)
> >>+++ libstdc++-v3/include/bits/uniform_int_dist.h      (working copy)
> >>@@ -33,7 +33,8 @@
> >>
> >>#include <type_traits>
> >>#include <limits>
> >>-
> >>+#include <cstdint>
> >>+#include <cstdio>
> >>namespace std _GLIBCXX_VISIBILITY(default)
> >>{
> >>_GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>@@ -239,18 +240,61 @@
> >>        = __uctype(__param.b()) - __uctype(__param.a());
> >>
> >>      __uctype __ret;
> >>-
> >>-     if (__urngrange > __urange)
> >>+     if (__urngrange > __urange)
> >>        {
> >>-         // downscaling
> >>-         const __uctype __uerange = __urange + 1; // __urange can be
> zero
> >>-         const __uctype __scaling = __urngrange / __uerange;
> >>-         const __uctype __past = __uerange * __scaling;
> >>-         do
> >>-           __ret = __uctype(__urng()) - __urngmin;
> >>-         while (__ret >= __past);
> >>-         __ret /= __scaling;
> >>-       }
> >>+             const __uctype __uerange = __urange + 1; // __urange can
> be zero
> >>+#if _GLIBCXX_USE_INT128 == 1
> >>+    if(sizeof(__uctype) == sizeof(uint64_t) and
> >>+      (__urngrange == numeric_limits<uint64_t>::max()))
> >>+    {
> >>+      // 64-bit case
> >>+      // reference: Fast Random Integer Generation in an Interval
> >>+      // ACM Transactions on Modeling and Computer Simulation 29 (1),
> 2019
> >>+      // https://arxiv.org/abs/1805.10941
> >>+      unsigned __int128 __product = (unsigned
> __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange);
> >
> >Is subtracting  __urngmin necessary here?
> >
> >The condition above checks that __urngrange == 2**64-1 which means
> >that U::max() - U::min() is the maximum 64-bit value, which means
> >means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't
> >need to subtract it.
> >
> >Also, I think the casts to uint64_t are unnecessary. We know that
> >__uctype is an unsigned integral type, and we've checked that it has
> >exactly 64-bits, so I think we can just use __uctype. It's got the
> >same width and signedness as uint64_t anyway.
> >
> >That said, the uint64_t(__uerange) above isn't redundant, because it
> >should be (unsigned __int128)__uerange, I think.
>
> Ah yes, you pointed out that last bit in your Sept 28 2019 email.
>
>