From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=A90N=7I=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-oa1-x2d.google.com (mail-oa1-x2d.google.com [IPv6:2001:4860:4864:20::2d])
	by sourceware.org (Postfix) with ESMTPS id D41203858D3C
	for <libc-alpha@sourceware.org>; Thu, 16 Mar 2023 14:29:43 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D41203858D3C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-oa1-x2d.google.com with SMTP id 586e51a60fabf-17aaa51a911so2369534fac.5
        for <libc-alpha@sourceware.org>; Thu, 16 Mar 2023 07:29:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1678976982;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:from:to:cc:subject:date:message-id:reply-to;
        bh=ql3paVeryQvlEFnyaIPtGtvWv9Cs5Uwuk+evwtnmMD0=;
        b=C0vPIlrpyz+pnU6cHZi4de5JFnZv0mtXoTK2WV61Tr/++4/tqjT8jJvOYCsM0h0Hqf
         G4RWDUk1JBRgv0ckU6M6jbaEAO9/hjZaaHgMB4Z2nnP5jRJ/xF93Aw7t4gkzi/EauB1G
         kmZjn5Q5Fi2SnZrwrS8p9+sc+/Yz728loeSlzhtPGtMdUIQOwQtUGRuRJUm+jr3+aXG0
         YOmziz1bzGnqZXoFomIx8yhAiRIX69qiHxrLB3nURDT9Wea8G5NC10AfQSD0ibng2bSU
         DQNgSCMXbCouTd54HckHPmIDlGHABtBEyZQIfSddOIPSpKaYr0QSombJ88xhUA11cdEI
         bOwA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1678976982;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=ql3paVeryQvlEFnyaIPtGtvWv9Cs5Uwuk+evwtnmMD0=;
        b=sBDWspQxHXxHd2H70yLr+O4k0PGQ1A6eZdMTB9wPvqkQ/KFsztx3jl1nTeplJhtKNG
         s5IqsGo5DjBiM3KLE1uY00rcHFBHSa615eKXfFU7kNUNNBBRS80x62UNdq8SB8IZypEZ
         jBc7nx/5MD7WdktzYFEuo4DayOmny54NJqIJCTHgVSjW1oyAbSY9sEzJOeQF9/moRNfy
         R3lY4Q6s/8rh625gU5uGUzpd8xuE8FTBQauR8V05ez1P42LmDIybJUEY9ONiXvKC1qeb
         /+TGgeEpslmORJPBaPcomTryUheYAr760EJEip4gFVq0daer/2kc2LZaQvMAtnYlNhLH
         upIQ==
X-Gm-Message-State: AO0yUKX6K8Pauci9JMGF04wLQOFtW9QFsIJJivnwhtumm/3hHNGnRjgh
	OSbVLAnFY2DW3tI19wlM59r7jA==
X-Google-Smtp-Source: AK7set+FEGadAYXAuKax+hsAH1ZFZ387f0+TeOZyy1dvEPPNAR9wuOcotrZH+uf3uFIyiVQWRSrweg==
X-Received: by 2002:a05:6870:8885:b0:177:84b6:95 with SMTP id m5-20020a056870888500b0017784b60095mr12520882oam.15.1678976982251;
        Thu, 16 Mar 2023 07:29:42 -0700 (PDT)
Received: from [192.168.15.100] ([177.103.118.188])
        by smtp.gmail.com with ESMTPSA id an36-20020a056871b1a400b00177c314a358sm3330693oac.22.2023.03.16.07.29.40
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 16 Mar 2023 07:29:41 -0700 (PDT)
Message-ID: <b17659bf-27fd-1cb6-a608-f5e381948917@linaro.org>
Date: Thu, 16 Mar 2023 11:28:22 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.8.0
Subject: Re: [PATCH v2 3/5] math: Improve fmod
Content-Language: en-US
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: libc-alpha@sourceware.org, Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
 kirill <kirill.okhotnikov@gmail.com>
References: <20230315205910.4120377-1-adhemerval.zanella@linaro.org>
 <20230315205910.4120377-4-adhemerval.zanella@linaro.org>
 <CAMe9rOrAzcLzqL5QUQ098aYc10pGX+z8V70KR+zt7Lc-C9ry=Q@mail.gmail.com>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <CAMe9rOrAzcLzqL5QUQ098aYc10pGX+z8V70KR+zt7Lc-C9ry=Q@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_BARRACUDACENTRAL,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 15/03/23 21:58, H.J. Lu wrote:
> On Wed, Mar 15, 2023 at 1:59 PM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>> This uses a new algorithm similar to already proposed earlier [1].
>> With x = mx * 2^ex and y = my * 2^ey (mx, my, ex, ey being integers),
>> the simplest implementation is:
>>
>>    mx * 2^ex == 2 * mx * 2^(ex - 1)
>>
>>    while (ex > ey)
>>      {
>>        mx *= 2;
>>        --ex;
>>        mx %= my;
>>      }
>>
>> With mx/my being mantissa of double floating pointer, on each step the
>> argument reduction can be improved 11 (which is sizeo of uint64_t minus
>> MANTISSA_WIDTH plus the signal bit):
>>
>>    while (ex > ey)
>>      {
>>        mx << 11;
>>        ex -= 11;
>>        mx %= my;
>>      }  */
>>
>> The implementation uses builtin clz and ctz, along with shifts to
>> convert hx/hy back to doubles.  Different than the original patch,
>> this path assume modulo/divide operation is slow, so use multiplication
>> with invert values.
>>
>> I see the following performance improvements using fmod benchtests
>> (result only show the 'mean' result):
>>
>>   Architecture     | Input           | master   | patch
>>   -----------------|-----------------|----------|--------
>>   x86_64 (Ryzen 9) | subnormals      | 19.1584  | 12.5049
>>   x86_64 (Ryzen 9) | normal          | 1016.51  | 296.939
>>   x86_64 (Ryzen 9) | close-exponents | 18.4428  | 16.0244
> 
> I tried it with the test in
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=30179
> 
> On Intel i7-10710U, I got
> 
> time ./sse
> 3.13user 0.00system 0:03.13elapsed 99%CPU (0avgtext+0avgdata 512maxresident)k
> 0inputs+0outputs (0major+37minor)pagefaults 0swaps
> time ./x87
> 0.24user 0.00system 0:00.24elapsed 100%CPU (0avgtext+0avgdata 512maxresident)k
> 0inputs+0outputs (0major+37minor)pagefaults 0swaps
> time ./generic
> 0.55user 0.00system 0:00.55elapsed 99%CPU (0avgtext+0avgdata 512maxresident)k
> 0inputs+0outputs (0major+37minor)pagefaults 0swaps
> 
> The new generic is still slower than x87.

I think it really depends of the underlying hardware and on the input range.
Using the benchmark from the patch set and patch 66182 [1], I see:

CPU              | Input           | patch    | 66182
-----------------|-----------------|----------|--------
Ryzen 9          | subnormals      | 12.5049  | 31.2822
Ryzen 9          | normal          | 296.939  | 592.489
Ryzen 9          | close-exponents | 16.0244  | 33.5172
E5-2640          | subnormals      | 34.5454  | 652.59
E5-2640          | normal          | 473.602  | 438.836
E5-2640          | close-exponents | 39.298   | 22.2742
i7-4510U         | subnormals      | 25.2624  | 666.964
i7-4510U         | normal          | 386.489  | 454.222
i7-4510U         | close-exponents | 29.463   | 22.8572

So it seems that fprem performance is not really consistent over x86 CPUs, and 
even for recent AMD is far from great.  So I still think the generic is better
for x86, and I think fprem should be used along with ifunc to select on CPUs
that really yields better numbers (and take in consideration that subnormals
numbers seems to be pretty bad).

You might get better x86 performance by remove the SVID wrapper as I did
on the last patch; but it will increase 66182 complexity (you will need to
check for NaN/INF/0.0 and set errno).  And I hardly think it will close the
gap on the AMD chip I use.

I am also checking a algorithm change to use simple loop for the normal inputs,
where integer modulo operation is used instead of inverse multiplication. 
But as far I am testing performance is really bad on all x86 Intel chips I 
tests (it is not as bad on AMD).

[1] https://patchwork.sourceware.org/project/glibc/patch/20230309183312.205763-1-hjl.tools@gmail.com/