From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vs1-xe35.google.com (mail-vs1-xe35.google.com [IPv6:2607:f8b0:4864:20::e35]) by sourceware.org (Postfix) with ESMTPS id E6A2E3858C20 for ; Sat, 1 Oct 2022 20:19:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E6A2E3858C20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-vs1-xe35.google.com with SMTP id k6so7994538vsc.8 for ; Sat, 01 Oct 2022 13:19:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=content-transfer-encoding:in-reply-to:organization:content-language :references:to:subject:from:user-agent:mime-version:date:message-id :from:to:cc:subject:date; bh=xQ5rL4bckCD9utxn/dcRUqR3sf53/CJ+frPKSSoKbj8=; b=v//zTXb1CIYXrG1Rjm2HvJcca+yRpJV5mHQ8iKFNmlRtrtzqaqIQqANfsRtL+0gJ6z VLHnQUiBP9GEI4yrOtkTgksE4HVqe/qUSDdc2Bqg0KdBmuo/VOqZ1yWiK3HEjD+PXX7M qyAom41vs0L8Uc8RvNXZa+2x5o3Q891G1bRexSmN0GxOEyisSwfoajFvu9B4mROhTlTR L9hJlRO83YVzQkCkvmOL1gMzv4nO8+D1AJ+4LTADb48ssBUBxHSls5MwZW3ssTqWnXVJ mkKmsycgxUVjbk0GtGh59X7qdMnmoEZrHhwS7QOD3Wrm2GwodNzvFlfcJ5opgthyqQ/c fa1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:content-language :references:to:subject:from:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date; bh=xQ5rL4bckCD9utxn/dcRUqR3sf53/CJ+frPKSSoKbj8=; b=YZ5IgqTRJg5NmBOl15Ii1+EKM9hmIfRzuZGirp5znEnLWNrqtgMRgnDDjSe+w3cP/7 BYuwQw4x1YRNIRbkLs6WC2MP62PyqHiPOEDJNQ2Zg0cYCukOTlnvgGOhW1u5QiJtdgxm 63AosJhk8lhu30IOEkOT9X/cWxzpLmixb+jlp0j/Kk1COn35Md+l68eLTyTKPOSw58hW Djn9K9QwyXMFXErXVuIQDL72eKJz1stiKR4tJa2cP4+u0oA4XhgtlxH8XDDvdDOUrS+R eBTS+IuCcr8NXuXMGDXDnG1/QlyQE2oo6kDm5tGBZVbFUcW+/T3SgdpbA1K0JIIsj/q7 xoOA== X-Gm-Message-State: ACrzQf03ndRT7shLeLWV47vfpA/99bMhMfFx13O6mc9/YClBXbQiUWur Y+3Ju0o4EAK63PxAGfNkHzuP0MvyBLDx/szx X-Google-Smtp-Source: AMsMyM4Dhn1VFM3uPvXG9dU1/50RQ9f69BU0a2UkjI4kTJOfUOzZXnHt89bVwkgvu1UybJSX6pppRw== X-Received: by 2002:a67:a246:0:b0:398:a4f4:aed1 with SMTP id t6-20020a67a246000000b00398a4f4aed1mr6653535vsh.6.1664655595053; Sat, 01 Oct 2022 13:19:55 -0700 (PDT) Received: from ?IPV6:2804:1b3:a7c2:3736:f9bd:29b3:268e:1aab? ([2804:1b3:a7c2:3736:f9bd:29b3:268e:1aab]) by smtp.gmail.com with ESMTPSA id z26-20020ab05bda000000b003d1cc9239f6sm3948081uae.19.2022.10.01.13.19.54 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 01 Oct 2022 13:19:54 -0700 (PDT) Message-ID: Date: Sat, 1 Oct 2022 17:19:53 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.3.1 From: Adhemerval Zanella Netto Subject: Re: More efficient fmod() To: libc-help@sourceware.org References: <211e8b64-519a-6037-62fc-7fcac1983ac4@gmail.com> Content-Language: en-US Organization: Linaro In-Reply-To: <211e8b64-519a-6037-62fc-7fcac1983ac4@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 01/10/22 13:34, Oliver Schädlich via Libc-help wrote: > I found that fmod() could be faster. > This is my implementation: > > #include > #include > #include > #if defined(_MSC_VER) >     #include > #endif > > #if defined(__GNUC__) || defined(__clang__) >     #define likely(x) __builtin_expect((x), 1) >     #define unlikely(x) __builtin_expect((x), 0) > #else >     #define likely(x) (x) >     #define unlikely(x) (x) > #endif > > inline uint64_t bin( double d ) > { >     uint64_t u; >     memcpy( &u, &d, sizeof d ); >     return u; > } > > inline double dbl( uint64_t u ) > { >     double d; >     memcpy( &d, &u, sizeof u ); >     return d; > } > > inline double invalid( uint64_t u ) > { >     feraiseexcept( FE_INVALID ); >     return dbl( u ); > } > > #define SIGN_BIT ((uint64_t)1 << 63) > #define EXP_MASK ((uint64_t)0x7FF << 52) > #define IMPLCIT_BIT ((uint64_t)1 << 52) > #define MANT_MASK (IMPLCIT_BIT - 1) > #define QNAN_BIT (IMPLCIT_BIT >> 1) > > inline void normalize( uint64_t *mant, int *exp ) > { > #if defined(__GNUC__) || defined(__clang__) >     unsigned bits; >     bits = __builtin_clz( *mant ) - 11; >     *mant <<= bits; > #elif defined(_MSC_VER) >     unsigned long bits; >     _BitScanReverse64( &bits, *mant ); >     *mant <<= bits - 11 ; > #else >     unsigned bits; >     for( bits = 0; !(*mant & IMPLCIT_BIT); *mant <<= 1, ++bits ); > #endif >     *exp -= bits; > } > > double myFmodC( double counter, double denominator ) > { >     uint64_t const >         bCounter = bin( counter ), >         bDenom = bin( denominator ); >     uint64_t const sign = (bCounter ^ bDenom) & SIGN_BIT; >     if( unlikely((bCounter & EXP_MASK) == EXP_MASK) ) >         // +/-[Inf|QNaN|SNaN] % ... = -QNaN >         // follow SSE/AVX-rules, first NaN rules, i.e. >         // first parameter determines non-SNaN/QNaN-bits >         return invalid( SIGN_BIT | bCounter | QNAN_BIT ); >     if( unlikely((bDenom & EXP_MASK) == EXP_MASK) ) >         // +/-x % +/-[Inf|QNan|SNaN] >         if( likely(!(bDenom & MANT_MASK)) ) >             // +/-x % +/-Inf = -/+x >             return dbl( sign | bCounter & ~SIGN_BIT ); >         else >             // +/-x % +/-[QNaN|SNaN] = -NaN >             return invalid( SIGN_BIT | bDenom | QNAN_BIT ); >     int >         counterExp = (bCounter & EXP_MASK) >> 52, >         denomExp = (bDenom & EXP_MASK) >> 52; >     uint64_t >         counterMant = (uint64_t)!!counterExp << 52 | bCounter & MANT_MASK, >         denomMant = (uint64_t)!!denomExp << 52 | bDenom & MANT_MASK; >     if( unlikely(!counterExp) ) >         // counter is denormal >         if( likely(!counterMant) ) >             // counter == +/-0.0 >             if( likely(denomMant) ) >                 // +/-0.0 % +/-x = -/+0.0 >                 return dbl( sign ); >             else >                 // +/-0.0 % +/-0.0 = -QNaN >                 return invalid( SIGN_BIT | EXP_MASK | QNAN_BIT ); >         else >             // normalize counter >             normalize( &counterMant, &counterExp ), >             ++counterExp; >     if( unlikely(!denomExp) ) >         // denominator is denormal >         if( likely(!denomMant) ) >             // +/-x % +/-0.0 = -/+QNaN >             return invalid( SIGN_BIT | EXP_MASK | QNAN_BIT ); >         else >             // normalize denominator >             normalize( &denomMant, &denomExp ), >             ++denomExp; >     int exp = counterExp; >     uint64_t remainderMant = counterMant; >     for( ; ; ) >     { >         int below = remainderMant < denomMant; >         if( unlikely(exp - below < denomExp) ) >             break; >         exp -= below; >         remainderMant <<= below; >         if( unlikely(!(remainderMant -= denomMant)) ) >         { >             exp = 0; >             break; >         } >         normalize( &remainderMant, &exp ); >     }; >     if( unlikely(exp <= 0) ) >         // denormal result >         remainderMant >>= -exp + 1, >         exp = 0; >     return dbl( sign | (uint64_t)exp << 52 | remainderMant & MANT_MASK ); > } > > If I chose random pairs of doubles the above code takes nearly > 40% less time on my Zen1-CPU than with the current glibc-Code. This is interesting, if you could work on a patch for glibc I can help you both adjusting the patch and on reviewing it. Some remarks: 1. How are you benchmarking the code? Are you comparing by static linking the myFmodC against the dynamic linked libc? If so this will not take in consideration the plt call overhead on the comparison. 2. On x86_64 the fmod implementation will generate a tail call to __ieee754_fmod (math/w_fmod_compat.c) plus some comparisons. So it will be extra overhead the glibc has, where is not strictly required (on recent math optimizations we follow what you did on your code, where errno is ignore (it requiresa symbol version though). 2. __builtin_clz might not the best option on all architecture, since it will occur in a libcall to the compiler supplied code. We have some macros on longlong.h that avoids it (although for performance comparison on x86_64 it won't matter since). Also, __builtin_clz is not correct for 32-bits. So to have a better performance profile, you will need to replace the implementation at sysdeps/ieee754/dbl-64/e_fmod.c, rebuild glibc, run the math tests to check for regression, and then profile to see if this yield better performance. It would be good to have also a benchtest benchmark that evaluates both throughput and latency, similar to what we have for sin/cos/etc.