From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=NeGK=AE=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-ot1-x32e.google.com (mail-ot1-x32e.google.com [IPv6:2607:f8b0:4864:20::32e])
	by sourceware.org (Postfix) with ESMTPS id 372403858D20
	for <libc-alpha@sourceware.org>; Thu, 13 Apr 2023 20:56:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 372403858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-ot1-x32e.google.com with SMTP id k101-20020a9d19ee000000b006a14270bc7eso5351281otk.6
        for <libc-alpha@sourceware.org>; Thu, 13 Apr 2023 13:56:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1681419367; x=1684011367;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :to:content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=mMgwl/cnivjQaScduyuEa5k/sUm9fcXV7ZJ2DqlpxeQ=;
        b=uPEW1AFrpgw+3N+BFUyRQOnEV5Vtrre5dY11pi3qjwRowNzzkXYhA6r+1OwJvxOhjD
         ke4sIziGMz2Is+pd6w6yfN4gEcEOIFsJVDEQMnjgFM/uy3eAS/2TkViKUCs1XVXuX0xf
         N4z7nQq+SXztkTGFoDuskjgA8D3ktIJ/2KbgsnHdQd5E0la5OZ7wZynhemcycp7PsbhR
         j+ToUf/F/UyYj6R6wt5R/762kMbUGbCvzCV4AalfI7CHjrj+pllOqpj//D1j/lT7iGIa
         2BgwvTLhavlKTl6mDjay2TdPD+EZvz8WU0HhdirbTHK15tgMy7GiMWEtZURGQgAxvVFQ
         0fIA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1681419367; x=1684011367;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :to:content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=mMgwl/cnivjQaScduyuEa5k/sUm9fcXV7ZJ2DqlpxeQ=;
        b=fTrAKO7PjsXiTXZiAZmzNb+s5z1WkewMDb14p9lSgUZSeAG+oc8bbUVLc3rIlLQky6
         4gA/K+ZRtHFNpIAqszHYL9gPJzSt/zIaXPGEmMT0APpYMBueROmKgTngIYmWPPcH1FCl
         4AGBdS66ZAPxXPBNQkoF7zRlQcZdAXGaRdql7NQPPTcm5Af35hwwpJH3xHdiLu+G7XD9
         xW+rbTNMEJf28upj8HLkSXSZFl8CsT4pdDqbzRGlGA2P00vwgPA7r02OF8y1FrPTJK/C
         /1s+Gl8WqW3mYrabBiEeeNA7y9psWRJ/DyLa+sBTfNRuNsxDdiSL0ZgD9szNcxFbHqly
         t4dQ==
X-Gm-Message-State: AAQBX9fb4cTu5/mDr+INMZcqdgYLt0uwaIv9yn+PANnlEkT0BfJAfxul
	dN5MDb4Y+uokWsBB3IGWNlWuS+Ls+Rs+CXsSfHYYOQ==
X-Google-Smtp-Source: AKy350ak0SuFG+xRvUw7NHr0OgIi2pMRgVFBt3UfWrDwWtk56XXTGzglkAkYhjWtU/7J1mDIK8jvcQ==
X-Received: by 2002:a05:6830:1e07:b0:699:5ac8:17b9 with SMTP id s7-20020a0568301e0700b006995ac817b9mr1631311otr.26.1681419367486;
        Thu, 13 Apr 2023 13:56:07 -0700 (PDT)
Received: from ?IPV6:2804:1b3:a7c2:55a1:24f5:87d:bc38:ae5e? ([2804:1b3:a7c2:55a1:24f5:87d:bc38:ae5e])
        by smtp.gmail.com with ESMTPSA id q11-20020a9d4b0b000000b0069dd3d98ec6sm1099937otf.44.2023.04.13.13.56.05
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 13 Apr 2023 13:56:06 -0700 (PDT)
Message-ID: <64198947-61d6-9f15-17de-a5c8c8f1e71b@linaro.org>
Date: Thu, 13 Apr 2023 17:56:03 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.9.1
Subject: Re: [PATCH] math: Improve fmod(f) performance
Content-Language: en-US
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
 'GNU C Library' <libc-alpha@sourceware.org>
References: <PAWPR08MB8982428AD5DD18A711D4F31783989@PAWPR08MB8982.eurprd08.prod.outlook.com>
 <0baece75-8f99-da08-4094-18f99238cb12@linaro.org>
 <PAWPR08MB89828DFAF4C1BD00EFA5B25283989@PAWPR08MB8982.eurprd08.prod.outlook.com>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <PAWPR08MB89828DFAF4C1BD00EFA5B25283989@PAWPR08MB8982.eurprd08.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-4.7 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 13/04/23 17:45, Wilco Dijkstra wrote:
> Hi Adhemerval,
> 
>> So at least with current 'close-exponents' from bench-fmod, which was
>> generated from exponents between -10 and 10, the gain is more modest
>> (and normal inputs does show a small regression).  This should be ok, 
>> but I also think we need to outline that A72 gains might not show on
>> different hardware.
> 
> On a SkyLake I'm seeing this for fmod:
> 
>                   master   patch
> subnormals         51.34    45.92 (+11.8%)
> normal             436.9    420.5 (+3.9%)
> close-exponents    56.44    53.11 (+6.3%)
> 
> And on Zen2:
> 
>                   master   patch
> subnormals         10.83    10.39 (+4.2%)
> normal             336.1    335.8 (+0.01%)
> close-exponents    14.90    14.11 (+5.6%)
> 
> So it shows good improvements across the board. It's odd your results on AMD are
> worse than my Zen 2 results - are there large variations between runs? I did quite a
> few runs to get a fast result and increased iterations of the math benchmarks 10x.

I don't see much variation, but I think these numbers on multiple chips
are more than enough.  Could you include them on commit message?

> 
> I can't explain why the gains on AArch64 are so much larger - the reduced instruction
> counts and branches for the common cases seem to make a big difference. On x86
> there are still many MOVABS instructions which are problematic for decode> 
>> So maybe also add another bench-fmod set for |x/y| < 2^12 to show
>> the potential gains.
> 
> I'm not sure how that would improve things - ideally we need more realistic
> inputs (ie. actual traces) but we could change the existing inputs into workloads
> to give it a more difficult problem. Changing close-exponents into a workload
> shows 11.0% lower latency and 11.9% better throughput on my SkyLake. On Zen 2
> I see 1% lower latency and 7.4% better throughput. Neoverse V1 shows 25.1%
> lower latency and 23.9% better throughput.

Fair enough, I think the only small nit is the clz_uint64 usage then.