From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <adhemerval.zanella@linaro.org>
Received: from mail-vk1-xa2d.google.com (mail-vk1-xa2d.google.com [IPv6:2607:f8b0:4864:20::a2d])
	by sourceware.org (Postfix) with ESMTPS id EA740385DC04
	for <libc-alpha@sourceware.org>; Sat,  1 Oct 2022 20:16:26 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org EA740385DC04
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-vk1-xa2d.google.com with SMTP id bj3so681660vkb.5
        for <libc-alpha@sourceware.org>; Sat, 01 Oct 2022 13:16:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :to:content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date;
        bh=xQ5rL4bckCD9utxn/dcRUqR3sf53/CJ+frPKSSoKbj8=;
        b=JUZ6HxbQ5JDAUmMC2p4Aut+lNlSN9ezXHRXa0P1Cbm9OqMAQoN1Aa7D0cAI9Bnl/P+
         e4K8JtfSEuNOlHuv+/kZXxqBwSDGNQ5WRTpZiCWnigqpwSLptv2SHVgryzucHdtz8t5h
         jExXUPdTmQD/jXnRwzzQl23x6De8Ul3Df+qbto+v3YLyc1EgsttOauAQL81j7QzIO63S
         FUESWmso9KKRcljPzDx5f6ftUKX4hjiQ+KPWSycd6QfbpL8jixu72cWBiB9qr83tS7eW
         dWipR6A5DsBvXL3ZxcYPMK1ySAZxpcQgVY/+2Y2N5w1Aon5dfcctbdRJzer6vnnhMWCi
         qIlQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :to:content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date;
        bh=xQ5rL4bckCD9utxn/dcRUqR3sf53/CJ+frPKSSoKbj8=;
        b=2+11OpFwzHHKc0xJVY9xGsqYOUCGcextzrlR0mBLBdXklYjgLf+eyS84PJtjFf0kfX
         bl4OrpqZYEarx6aXrkTpePYb7aJiULeCH/ZgSwwgMJ1XXsysKYcvhIVTJHHCRZqTsZl+
         ippkgKi115yrZSwVW+25pkGtlSGLoAERKchJbTWz1C07n/RTZreGlq1lQZzb8V4aDEg8
         5e/D/2/Q+e3y2zm/iCuyFz8GH/aw9a6/FFggVHNB37sbjBye8x30EhI6bKnyhB7/3w1y
         JjPjc1gCmEPNjHmyv8BSZ6zV4Z7AS5T9qQLsDM41wdxycsR1uIEcQK2gusPyDXxmqgac
         pOxQ==
X-Gm-Message-State: ACrzQf3Y9aZnWrATYZmtWwiMeTK/pZ7zCIsQk1KHmsVv79jfK39mPOnq
	dstCEfxVnCbhhG3lylfGvmu5fA==
X-Google-Smtp-Source: AMsMyM6GBkrNTAbFOl7MAjrpUb8g0iGcLU28xullAkcn9Cc53HOZeNnJ/FGEL/5HtPcxrcYtbbhnQg==
X-Received: by 2002:a1f:f8cd:0:b0:3a2:9470:1ec with SMTP id w196-20020a1ff8cd000000b003a2947001ecmr7020605vkh.40.1664655386188;
        Sat, 01 Oct 2022 13:16:26 -0700 (PDT)
Received: from ?IPV6:2804:1b3:a7c2:3736:f9bd:29b3:268e:1aab? ([2804:1b3:a7c2:3736:f9bd:29b3:268e:1aab])
        by smtp.gmail.com with ESMTPSA id h15-20020a0561023d8f00b003982906d3e2sm3150078vsv.30.2022.10.01.13.16.25
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Sat, 01 Oct 2022 13:16:25 -0700 (PDT)
Message-ID: <fda43638-5f9d-2214-238e-37a2a37d010b@linaro.org>
Date: Sat, 1 Oct 2022 17:16:24 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.3.1
Subject: Re: More efficient fmod()
Content-Language: en-US
To: =?UTF-8?Q?Oliver_Sch=c3=a4dlich?= <oliver.schaedlich@gmail.com>,
 libc-alpha <libc-alpha@sourceware.org>
References: <211e8b64-519a-6037-62fc-7fcac1983ac4@gmail.com>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <211e8b64-519a-6037-62fc-7fcac1983ac4@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 01/10/22 13:34, Oliver Schädlich via Libc-help wrote:
> I found that fmod() could be faster.
> This is my implementation:
> 
> #include <stdint.h>
> #include <string.h>
> #include <fenv.h>
> #if defined(_MSC_VER)
>     #include <intrin.h>
> #endif
> 
> #if defined(__GNUC__) || defined(__clang__)
>     #define likely(x) __builtin_expect((x), 1)
>     #define unlikely(x) __builtin_expect((x), 0)
> #else
>     #define likely(x) (x)
>     #define unlikely(x) (x)
> #endif
> 
> inline uint64_t bin( double d )
> {
>     uint64_t u;
>     memcpy( &u, &d, sizeof d );
>     return u;
> }
> 
> inline double dbl( uint64_t u )
> {
>     double d;
>     memcpy( &d, &u, sizeof u );
>     return d;
> }
> 
> inline double invalid( uint64_t u )
> {
>     feraiseexcept( FE_INVALID );
>     return dbl( u );
> }
> 
> #define SIGN_BIT ((uint64_t)1 << 63)
> #define EXP_MASK ((uint64_t)0x7FF << 52)
> #define IMPLCIT_BIT ((uint64_t)1 << 52)
> #define MANT_MASK (IMPLCIT_BIT - 1)
> #define QNAN_BIT (IMPLCIT_BIT >> 1)
> 
> inline void normalize( uint64_t *mant, int *exp )
> {
> #if defined(__GNUC__) || defined(__clang__)
>     unsigned bits;
>     bits = __builtin_clz( *mant ) - 11;
>     *mant <<= bits;
> #elif defined(_MSC_VER)
>     unsigned long bits;
>     _BitScanReverse64( &bits, *mant );
>     *mant <<= bits - 11 ;
> #else
>     unsigned bits;
>     for( bits = 0; !(*mant & IMPLCIT_BIT); *mant <<= 1, ++bits );
> #endif
>     *exp -= bits;
> }
> 
> double myFmodC( double counter, double denominator )
> {
>     uint64_t const
>         bCounter = bin( counter ),
>         bDenom = bin( denominator );
>     uint64_t const sign = (bCounter ^ bDenom) & SIGN_BIT;
>     if( unlikely((bCounter & EXP_MASK) == EXP_MASK) )
>         // +/-[Inf|QNaN|SNaN] % ... = -QNaN
>         // follow SSE/AVX-rules, first NaN rules, i.e.
>         // first parameter determines non-SNaN/QNaN-bits
>         return invalid( SIGN_BIT | bCounter | QNAN_BIT );
>     if( unlikely((bDenom & EXP_MASK) == EXP_MASK) )
>         // +/-x % +/-[Inf|QNan|SNaN]
>         if( likely(!(bDenom & MANT_MASK)) )
>             // +/-x % +/-Inf = -/+x
>             return dbl( sign | bCounter & ~SIGN_BIT );
>         else
>             // +/-x % +/-[QNaN|SNaN] = -NaN
>             return invalid( SIGN_BIT | bDenom | QNAN_BIT );
>     int
>         counterExp = (bCounter & EXP_MASK) >> 52,
>         denomExp = (bDenom & EXP_MASK) >> 52;
>     uint64_t
>         counterMant = (uint64_t)!!counterExp << 52 | bCounter & MANT_MASK,
>         denomMant = (uint64_t)!!denomExp << 52 | bDenom & MANT_MASK;
>     if( unlikely(!counterExp) )
>         // counter is denormal
>         if( likely(!counterMant) )
>             // counter == +/-0.0
>             if( likely(denomMant) )
>                 // +/-0.0 % +/-x = -/+0.0
>                 return dbl( sign );
>             else
>                 // +/-0.0 % +/-0.0 = -QNaN
>                 return invalid( SIGN_BIT | EXP_MASK | QNAN_BIT );
>         else
>             // normalize counter
>             normalize( &counterMant, &counterExp ),
>             ++counterExp;
>     if( unlikely(!denomExp) )
>         // denominator is denormal
>         if( likely(!denomMant) )
>             // +/-x % +/-0.0 = -/+QNaN
>             return invalid( SIGN_BIT | EXP_MASK | QNAN_BIT );
>         else
>             // normalize denominator
>             normalize( &denomMant, &denomExp ),
>             ++denomExp;
>     int exp = counterExp;
>     uint64_t remainderMant = counterMant;
>     for( ; ; )
>     {
>         int below = remainderMant < denomMant;
>         if( unlikely(exp - below < denomExp) )
>             break;
>         exp -= below;
>         remainderMant <<= below;
>         if( unlikely(!(remainderMant -= denomMant)) )
>         {
>             exp = 0;
>             break;
>         }
>         normalize( &remainderMant, &exp );
>     };
>     if( unlikely(exp <= 0) )
>         // denormal result
>         remainderMant >>= -exp + 1,
>         exp = 0;
>     return dbl( sign | (uint64_t)exp << 52 | remainderMant & MANT_MASK );
> }
> 
> If I chose random pairs of doubles the above code takes nearly
> 40% less time on my Zen1-CPU than with the current glibc-Code.

This is interesting, if you could work on a patch for glibc I can help you
both adjusting the patch and on reviewing it.  Some remarks:

  1. How are you benchmarking the code? Are you comparing by static linking the
     myFmodC against the dynamic linked libc? If so this will not take in
     consideration the plt call overhead on the comparison.

  2. On x86_64 the fmod implementation will generate a tail call to 
     __ieee754_fmod (math/w_fmod_compat.c) plus some comparisons.  So it will
     be extra overhead the glibc has, where is not strictly required (on recent 
     math optimizations we follow what you did on your code, where errno is 
     ignore (it requiresa symbol version though).

  2. __builtin_clz might not the best option on all architecture, since it will
     occur in a libcall to the compiler supplied code.  We have some macros
     on longlong.h that avoids it (although for performance comparison on 
     x86_64 it won't matter since).  Also, __builtin_clz is not correct for
     32-bits.

So to have a better performance profile, you will need to replace the
implementation at sysdeps/ieee754/dbl-64/e_fmod.c, rebuild glibc, run the
math tests to check for regression, and then profile to see if this yield
better performance.  It would be good to have also a benchtest benchmark
that evaluates both throughput and latency, similar to what we have for
sin/cos/etc.