From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com [IPv6:2a00:1450:4864:20::443]) by sourceware.org (Postfix) with ESMTPS id 0ED913857013; Tue, 6 Oct 2020 19:56:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 0ED913857013 Received: by mail-wr1-x443.google.com with SMTP id m6so14857740wrn.0; Tue, 06 Oct 2020 12:56:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc; bh=zAWEdxXMRTII0nFuf3PirmAOmG70q0EuDDyC0iVDeAM=; b=pf1lkWnr3ueGxK6dhRcvYEQycJxFCCAcU90qldvBQdw3THZuBPWTM4GHN4gTvlzRXe H0S5Clr9Mpyy08Cyz/8qLiEW6CX/QQKZAo2TWJS+gBxMhwR9zkRgUuVe1apb6Dkyb6Sn s+E16BbPIGqpbUpSzajsf12FLrNqHY+lW+UWSUkaiQdK0w+a7OWlLJhU3G8d1+Ptq1gL LOkiowTQ9wbQe+HRaljBs6di/eEBfHh5j4n0rcDU5yIbZCxEMNqdj1CiJ2pDjU20M0mU 6f4utCrOxg+1ZFRdg7lSn12SJQHeNpT91xIc/Jg+7fjFVg715VYjUGbmh3IRVbLMRub2 sVMw== X-Gm-Message-State: AOAM530j5J46DgSfQP2uZxDMJkhkGk1KJ+LchDluiHqjQ0WjjQICO+Uv oynk5WS3Ao0C20iEWZU/klTy2JY7cB9WCQ/t4jo= X-Google-Smtp-Source: ABdhPJzbOrCXmcBGBNF12+WK/OWepCHn5wOsKvvCRKNlvq8LLNWU4LwfgFcooDUeZ/w1WsMpsqg4iecGfCw2YBs4xnY= X-Received: by 2002:a05:6000:12c3:: with SMTP id l3mr7162511wrx.164.1602014165002; Tue, 06 Oct 2020 12:56:05 -0700 (PDT) MIME-Version: 1.0 References: <20201005232515.GD7004@redhat.com> <20201005234031.GF7004@redhat.com> In-Reply-To: <20201005234031.GF7004@redhat.com> Reply-To: lemire@gmail.com From: Daniel Lemire Date: Tue, 6 Oct 2020 15:55:54 -0400 Message-ID: Subject: Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions) To: Jonathan Wakely Cc: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org X-Spam-Status: No, score=-0.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Oct 2020 19:56:08 -0000 The updated patch looks good to me. It is indeed cleaner to have a separate (static) function. It might be nice to add a comment to explain the _S_nd function maybe with a comment like "returns a random value in [0,__range) without any bias" (or something to that effect). Otherwise, it is algorithmically correct. On Mon, Oct 5, 2020 at 7:40 PM Jonathan Wakely wrote: > On 06/10/20 00:25 +0100, Jonathan Wakely wrote: > >I'm sorry it's taken a year to review this properly. Comments below ... > > > >On 27/09/19 14:18 -0400, Daniel Lemire wrote: > >>(This is a revised patch proposal. I am revising both the description > >>and the code itself.) > >> > >>Even on recent processors, integer division is relatively expensive. > >>The current implementation of std::uniform_int_distribution typically > >>requires two divisions by invocation: > >> > >> // downscaling > >> const __uctype __uerange = __urange + 1; // __urange can be zero > >> const __uctype __scaling = __urngrange / __uerange; > >> const __uctype __past = __uerange * __scaling; > >> do > >> __ret = __uctype(__urng()) - __urngmin; > >> while (__ret >= __past); > >> __ret /= __scaling; > >> > >>We can achieve the same algorithmic result with at most one division, > >>and typically no division at all without requiring more calls to the > >>random number generator. > >>This was recently added to Swift ( > https://github.com/apple/swift/pull/25286) > >> > >>The main challenge is that we need to be able to compute the "full" > >>product. E.g., given two 64-bit integers, we want the 128-bit result; > >>given two 32-bit integers we want the 64-bit result. This is fast on > >>common processors. > >>The 128-bit product is not natively supported in C/C++ but can be > >>achieved with the > >>__int128 extension when it is available. The patch checks for > >>__int128 support; when > >>support is lacking, we fallback on the existing approach which uses > >>two divisions per > >>call. > >> > >>For example, if we replace the above code by the following, we get a > substantial > >>performance boost on skylake microarchitectures. E.g., it can > >>be twice as fast to shuffle arrays of 1 million elements (e.g., using > >>the followingbenchmark: > https://github.com/lemire/simple_cpp_shuffle_benchmark ) > >> > >> > >> unsigned __int128 __product = (unsigned > >>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); > >> uint64_t __lsb = uint64_t(__product); > >> if(__lsb < __uerange) > >> { > >> uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange); > >> while (__lsb < __threshold) > >> { > >> __product = (unsigned __int128)(__uctype(__urng()) - > >>__urngmin) * (unsigned __int128)(__uerange); > >> __lsb = uint64_t(__product); > >> } > >> } > >> __ret = __product >> 64; > >> > >>Included is a patch that would bring better performance (e.g., 2x gain) > to > >>std::uniform_int_distribution in some cases. Here are some actual > numbers... > >> > >>With this patch: > >> > >>std::shuffle(testvalues, testvalues + size, g) : 7952091 > >>ns total, 7.95 ns per input key > >> > >>Before this patch: > >> > >>std::shuffle(testvalues, testvalues + size, g) : > >>14954058 ns total, 14.95 ns per input key > >> > >> > >>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700). > >> > >>Furthermore, the new algorithm is unbiased, so the randomness of the > >>result is not affected. > >> > >>I ran both performance and biases tests with the proposed patch. > >> > >> > >>This patch proposal was improved following feedback by Jonathan > >>Wakely. An earlier version used the __uint128_t type, which is widely > >>supported but not used in the C++ library, instead we now use unsigned > >>__int128. Furthermore, the previous patch was accidentally broken: it > >>was not computing the full product since a rhs cast was missing. These > >>issues are fixed and verified. > > > >After looking at GCC's internals, it looks like __uint128_t is > >actually fine to use, even though we never currently use it in the > >library. I didn't even know it was supported for C++ mode, sorry! > > > >>Reference: Fast Random Integer Generation in an Interval, ACM > Transactions on > >>Modeling and Computer Simulation 29 (1), 2019 > https://arxiv.org/abs/1805.10941 > > > >>Index: libstdc++-v3/include/bits/uniform_int_dist.h > >>=================================================================== > >>--- libstdc++-v3/include/bits/uniform_int_dist.h (revision 276183) > >>+++ libstdc++-v3/include/bits/uniform_int_dist.h (working copy) > >>@@ -33,7 +33,8 @@ > >> > >>#include > >>#include > >>- > >>+#include > >>+#include > >>namespace std _GLIBCXX_VISIBILITY(default) > >>{ > >>_GLIBCXX_BEGIN_NAMESPACE_VERSION > >>@@ -239,18 +240,61 @@ > >> = __uctype(__param.b()) - __uctype(__param.a()); > >> > >> __uctype __ret; > >>- > >>- if (__urngrange > __urange) > >>+ if (__urngrange > __urange) > >> { > >>- // downscaling > >>- const __uctype __uerange = __urange + 1; // __urange can be > zero > >>- const __uctype __scaling = __urngrange / __uerange; > >>- const __uctype __past = __uerange * __scaling; > >>- do > >>- __ret = __uctype(__urng()) - __urngmin; > >>- while (__ret >= __past); > >>- __ret /= __scaling; > >>- } > >>+ const __uctype __uerange = __urange + 1; // __urange can > be zero > >>+#if _GLIBCXX_USE_INT128 == 1 > >>+ if(sizeof(__uctype) == sizeof(uint64_t) and > >>+ (__urngrange == numeric_limits::max())) > >>+ { > >>+ // 64-bit case > >>+ // reference: Fast Random Integer Generation in an Interval > >>+ // ACM Transactions on Modeling and Computer Simulation 29 (1), > 2019 > >>+ // https://arxiv.org/abs/1805.10941 > >>+ unsigned __int128 __product = (unsigned > __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); > > > >Is subtracting __urngmin necessary here? > > > >The condition above checks that __urngrange == 2**64-1 which means > >that U::max() - U::min() is the maximum 64-bit value, which means > >means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't > >need to subtract it. > > > >Also, I think the casts to uint64_t are unnecessary. We know that > >__uctype is an unsigned integral type, and we've checked that it has > >exactly 64-bits, so I think we can just use __uctype. It's got the > >same width and signedness as uint64_t anyway. > > > >That said, the uint64_t(__uerange) above isn't redundant, because it > >should be (unsigned __int128)__uerange, I think. > > Ah yes, you pointed out that last bit in your Sept 28 2019 email. > >