From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <adhemerval.zanella@linaro.org>
Received: from mail-qk1-x72b.google.com (mail-qk1-x72b.google.com
 [IPv6:2607:f8b0:4864:20::72b])
 by sourceware.org (Postfix) with ESMTPS id 4A581385BF9F
 for <libc-help@sourceware.org>; Mon, 23 Aug 2021 16:51:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4A581385BF9F
Received: by mail-qk1-x72b.google.com with SMTP id c10so17346906qko.11
 for <libc-help@sourceware.org>; Mon, 23 Aug 2021 09:51:30 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=rbZSJNyJlIFCLumgjBHRqZJr+/VIsRazZQb+KkiHMSM=;
 b=F2eRY62wxAjhf2EijV2iHiwGcZ8i2bjYByM+9vXF2A4D6lniw98RY2XR8SWrANTLXP
 AG3wUnhYFXYYikiTBoe6GXUrYBgkcQCxFrLoOnE6dLl6iKWcspwpj4l7K3R6B+4qDZd2
 33XNh08tKtn/ZDphgvkC+2VVjOwso9LFCaRz9LRQzbfdG/+mJR0wQe+KkP9WGVLa3CJc
 ZO7NZUlrsXlpyG6pTqh4TdxyCZzkAZL23++0HHb/wOFLOjsDlck77rtEI+C4vJ90yE1E
 plccER4FRfqFYXURonbMmf/DaqI0lzQeIMH4E+pfV+XNoubEICbndYxXmIr/8bfhMFKv
 paaw==
X-Gm-Message-State: AOAM532Z2fhydgjgIJDfaR99yZid/11jwFSBailnX4HihokcyfgayfHi
 nJcmFaOIvRh9+rf66ofzp1hRGVW8K3VBPQ==
X-Google-Smtp-Source: ABdhPJz9Y4LgiweeDxCoV/09tEaO9gmNFTJGL2ZLwwkSVlELlTJ+HpFfeOWPpjrDUjuxF6c3d9YKxA==
X-Received: by 2002:a05:620a:2008:: with SMTP id
 c8mr18351303qka.493.1629737489460; 
 Mon, 23 Aug 2021 09:51:29 -0700 (PDT)
Received: from ?IPv6:2804:431:c7ca:cd83:c38b:b50d:5d9a:43d4?
 ([2804:431:c7ca:cd83:c38b:b50d:5d9a:43d4])
 by smtp.gmail.com with ESMTPSA id b65sm8868555qkc.15.2021.08.23.09.51.28
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Mon, 23 Aug 2021 09:51:29 -0700 (PDT)
Subject: Re: Twiddling with 64-bit values as 2 ints;
To: Stefan Kanthak <stefan.kanthak@nexgo.de>, libc-help@sourceware.org
References: <4DD65B114A174A35AC6960DD2104BDE7@H270>
 <4c8ee26d-764e-736f-c3d6-5728e54c4c0f@linaro.org>
 <52E35AACEB174FDDAA3697DE66BB6ACA@H270>
 <f8460d66-dec6-9852-3710-8e5d6627df54@linaro.org>
 <3F07DF81FC2040E69CB83A78EDE05BB7@H270>
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Message-ID: <0978c043-b32b-ecf8-5cfe-de31c473bb4d@linaro.org>
Date: Mon, 23 Aug 2021 13:51:27 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <3F07DF81FC2040E69CB83A78EDE05BB7@H270>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-help@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-help mailing list <libc-help.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-help/>
List-Help: <mailto:libc-help-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2021 16:51:40 -0000


On 23/08/2021 12:37, Stefan Kanthak wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
>> On 23/08/2021 10:18, Stefan Kanthak wrote:
>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>>
>>>> On 21/08/2021 10:34, Stefan Kanthak wrote:
>>>>>
>>>>> (Heretic.-) questions:
>>>>> - why does glibc still employ such ugly code?
>>>>> - Why doesn't glibc take advantage of 64-bit integers in such code?
>>>>
>>>> Because no one cared to adjust the implementation.  Recently Wilco
>>>> has removed a lot of old code that still uses 32-bit instead of 64-bit
>>>> bo bit twinddling in floating-pointer implementation (check caa884dda7
>>>> and 9e97f239eae1f2).
>>>
>>> That's good to hear.
>>>
>>>> I think we should move to use a simplest code assuming 64-bit CPU
>>>
>>> D'accord.
>>> And there's a second direction where you might move: almost all CPUs
>>> have separate general purpose registers and floating-point registers.
>>> Bit-twiddling generally needs extra (and sometimes slow) transfers
>>> between them.
>>> In 32-bit environment, where arguments are typically passed on the
>>> stack, at least loading an argument from the stack into a GPR or FPR
>>> makes no difference.
>>> In 64-bit environment, where arguments are passed in registers, they
>>> should be operated on in these registers.
>>>
>>> So: why not implement routines like nextafter() without bit-twiddling,
>>> using floating-point as far as possible for architectures where this
>>> gives better results?
>>
>> Mainly because some math routines are not performance critical in the
>> sense they are usually not hotspots and for these I would prefer the 
>> simplest code that work with reasonable performance independently of
>> the underlying ABI or architecture
> 
> With this we're back at square 1: my initial post showed such simple(st)
> code.
> The performance gain I experienced in my use case was more than
> noticeable: on AMD64, the total runtime of my program decreased from
> 20s to 12s.
> 
>> (using integer operation might be be for soft-fp ABI for instance).
> 
> 
> 
>> For symbols are might be performance critical, we do have more optimized
>> version.  Szabolcs and Wilco spent considerable time to tune a lot of
>> math functions and to remove the slow code path; also for some routines
>> we have internal defines that map then to compiler builtin when we know
>> that compiler and architecture allows us to do so (check the rounding
>> routines or sqrt for instance).
>>
>> Recently we are aiming to avoid arch-specific code for complex routines,
>> and prefer C implementation that leverage the compiler support.  It makes
>> a *much* maintainable code and without the need to keep evaluating the 
>> routines on each architecture new iterations (as some routines proven to
>> be slower than more well coded generic implementation).
> 
> That was the goal of my patch: let the compiler operate on 64-bit integers
> instead the C implementation on pairs of 32-bit integers.
> 
>>> The simple implementation I showed in my initial post improved the
>>> throughput in my benchmark (on AMD64) by an order of magnitude.
>>> In Szabolcs Nagy benchmark measuring latency it took 0.04ns/call
>>> longer (5.72ns vs. 5.68ns) -- despite the POOR job GCC does on FP.
>>
>> Your implementation triggered a lot of regression,
> 
> The initial, FP-preferring code was a demonstration, not a patch.

Right, but it does do not much sense comparing performance numbers with
an implementation that adds a lot of regressions. 

> 
>> you will need to sort this out before considering performance numbers.
>> Also, we will need a proper benchmark to evaluate it, as Szabolcs and
>> Wilco has done for their math work.
>>
>>>
>>> Does GLIBC offer a macro like "PREFER_FP_IMPLEMENTATION" that can be
>>> used to select between the integer bit-twiddling code and FP-preferring
>>> code during compilation?
>>
>> No and I don't think we this would be a good addition.  As before, I would
>> prefer to have a simple generic implementation that give us a good
>> performance on modern hardware instead of a configurable one with many
>> tunables.  The later is increases the maintainable cost (with testing and
>> performance evaluation).
> 
> Having dedicated implementations for different architectures is even more
> costly!
> My intention/proposal is to have at most two different generic implementations,
> one using integer bit-twiddling wherever possible, thus supporting soft-fp well,
> the second using floating-point wherever possible, thus supporting modern
> hardware well.

The only reservation I have for such approach it it would add some more maintenance
and testing.  I added similar optimization for hypot on powerpc, mainly to avoid
a CPU pipeline hazard on some chips due the GPR to FP transfer; but I am working
on generic solution to just remove the powerpc specific implementation in favor
over a generic one.