From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oa1-x2d.google.com (mail-oa1-x2d.google.com [IPv6:2001:4860:4864:20::2d]) by sourceware.org (Postfix) with ESMTPS id D41203858D3C for ; Thu, 16 Mar 2023 14:29:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D41203858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-oa1-x2d.google.com with SMTP id 586e51a60fabf-17aaa51a911so2369534fac.5 for ; Thu, 16 Mar 2023 07:29:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1678976982; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:from:to:cc:subject:date:message-id:reply-to; bh=ql3paVeryQvlEFnyaIPtGtvWv9Cs5Uwuk+evwtnmMD0=; b=C0vPIlrpyz+pnU6cHZi4de5JFnZv0mtXoTK2WV61Tr/++4/tqjT8jJvOYCsM0h0Hqf G4RWDUk1JBRgv0ckU6M6jbaEAO9/hjZaaHgMB4Z2nnP5jRJ/xF93Aw7t4gkzi/EauB1G kmZjn5Q5Fi2SnZrwrS8p9+sc+/Yz728loeSlzhtPGtMdUIQOwQtUGRuRJUm+jr3+aXG0 YOmziz1bzGnqZXoFomIx8yhAiRIX69qiHxrLB3nURDT9Wea8G5NC10AfQSD0ibng2bSU DQNgSCMXbCouTd54HckHPmIDlGHABtBEyZQIfSddOIPSpKaYr0QSombJ88xhUA11cdEI bOwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678976982; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ql3paVeryQvlEFnyaIPtGtvWv9Cs5Uwuk+evwtnmMD0=; b=sBDWspQxHXxHd2H70yLr+O4k0PGQ1A6eZdMTB9wPvqkQ/KFsztx3jl1nTeplJhtKNG s5IqsGo5DjBiM3KLE1uY00rcHFBHSa615eKXfFU7kNUNNBBRS80x62UNdq8SB8IZypEZ jBc7nx/5MD7WdktzYFEuo4DayOmny54NJqIJCTHgVSjW1oyAbSY9sEzJOeQF9/moRNfy R3lY4Q6s/8rh625gU5uGUzpd8xuE8FTBQauR8V05ez1P42LmDIybJUEY9ONiXvKC1qeb /+TGgeEpslmORJPBaPcomTryUheYAr760EJEip4gFVq0daer/2kc2LZaQvMAtnYlNhLH upIQ== X-Gm-Message-State: AO0yUKX6K8Pauci9JMGF04wLQOFtW9QFsIJJivnwhtumm/3hHNGnRjgh OSbVLAnFY2DW3tI19wlM59r7jA== X-Google-Smtp-Source: AK7set+FEGadAYXAuKax+hsAH1ZFZ387f0+TeOZyy1dvEPPNAR9wuOcotrZH+uf3uFIyiVQWRSrweg== X-Received: by 2002:a05:6870:8885:b0:177:84b6:95 with SMTP id m5-20020a056870888500b0017784b60095mr12520882oam.15.1678976982251; Thu, 16 Mar 2023 07:29:42 -0700 (PDT) Received: from [192.168.15.100] ([177.103.118.188]) by smtp.gmail.com with ESMTPSA id an36-20020a056871b1a400b00177c314a358sm3330693oac.22.2023.03.16.07.29.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 16 Mar 2023 07:29:41 -0700 (PDT) Message-ID: Date: Thu, 16 Mar 2023 11:28:22 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Subject: Re: [PATCH v2 3/5] math: Improve fmod Content-Language: en-US To: "H.J. Lu" Cc: libc-alpha@sourceware.org, Wilco Dijkstra , kirill References: <20230315205910.4120377-1-adhemerval.zanella@linaro.org> <20230315205910.4120377-4-adhemerval.zanella@linaro.org> From: Adhemerval Zanella Netto Organization: Linaro In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_BARRACUDACENTRAL,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 15/03/23 21:58, H.J. Lu wrote: > On Wed, Mar 15, 2023 at 1:59 PM Adhemerval Zanella > wrote: >> >> This uses a new algorithm similar to already proposed earlier [1]. >> With x = mx * 2^ex and y = my * 2^ey (mx, my, ex, ey being integers), >> the simplest implementation is: >> >> mx * 2^ex == 2 * mx * 2^(ex - 1) >> >> while (ex > ey) >> { >> mx *= 2; >> --ex; >> mx %= my; >> } >> >> With mx/my being mantissa of double floating pointer, on each step the >> argument reduction can be improved 11 (which is sizeo of uint64_t minus >> MANTISSA_WIDTH plus the signal bit): >> >> while (ex > ey) >> { >> mx << 11; >> ex -= 11; >> mx %= my; >> } */ >> >> The implementation uses builtin clz and ctz, along with shifts to >> convert hx/hy back to doubles. Different than the original patch, >> this path assume modulo/divide operation is slow, so use multiplication >> with invert values. >> >> I see the following performance improvements using fmod benchtests >> (result only show the 'mean' result): >> >> Architecture | Input | master | patch >> -----------------|-----------------|----------|-------- >> x86_64 (Ryzen 9) | subnormals | 19.1584 | 12.5049 >> x86_64 (Ryzen 9) | normal | 1016.51 | 296.939 >> x86_64 (Ryzen 9) | close-exponents | 18.4428 | 16.0244 > > I tried it with the test in > > https://sourceware.org/bugzilla/show_bug.cgi?id=30179 > > On Intel i7-10710U, I got > > time ./sse > 3.13user 0.00system 0:03.13elapsed 99%CPU (0avgtext+0avgdata 512maxresident)k > 0inputs+0outputs (0major+37minor)pagefaults 0swaps > time ./x87 > 0.24user 0.00system 0:00.24elapsed 100%CPU (0avgtext+0avgdata 512maxresident)k > 0inputs+0outputs (0major+37minor)pagefaults 0swaps > time ./generic > 0.55user 0.00system 0:00.55elapsed 99%CPU (0avgtext+0avgdata 512maxresident)k > 0inputs+0outputs (0major+37minor)pagefaults 0swaps > > The new generic is still slower than x87. I think it really depends of the underlying hardware and on the input range. Using the benchmark from the patch set and patch 66182 [1], I see: CPU | Input | patch | 66182 -----------------|-----------------|----------|-------- Ryzen 9 | subnormals | 12.5049 | 31.2822 Ryzen 9 | normal | 296.939 | 592.489 Ryzen 9 | close-exponents | 16.0244 | 33.5172 E5-2640 | subnormals | 34.5454 | 652.59 E5-2640 | normal | 473.602 | 438.836 E5-2640 | close-exponents | 39.298 | 22.2742 i7-4510U | subnormals | 25.2624 | 666.964 i7-4510U | normal | 386.489 | 454.222 i7-4510U | close-exponents | 29.463 | 22.8572 So it seems that fprem performance is not really consistent over x86 CPUs, and even for recent AMD is far from great. So I still think the generic is better for x86, and I think fprem should be used along with ifunc to select on CPUs that really yields better numbers (and take in consideration that subnormals numbers seems to be pretty bad). You might get better x86 performance by remove the SVID wrapper as I did on the last patch; but it will increase 66182 complexity (you will need to check for NaN/INF/0.0 and set errno). And I hardly think it will close the gap on the AMD chip I use. I am also checking a algorithm change to use simple loop for the normal inputs, where integer modulo operation is used instead of inverse multiplication. But as far I am testing performance is really bad on all x86 Intel chips I tests (it is not as bad on AMD). [1] https://patchwork.sourceware.org/project/glibc/patch/20230309183312.205763-1-hjl.tools@gmail.com/