From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stefan.kanthak@nexgo.de>
Received: from smtpout2.vodafonemail.de (smtpout2.vodafonemail.de
 [145.253.239.133])
 by sourceware.org (Postfix) with ESMTPS id B3C2B3858404
 for <libc-help@sourceware.org>; Mon, 23 Aug 2021 15:43:12 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B3C2B3858404
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=nexgo.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nexgo.de
Received: from smtp.vodafone.de (smtpa03.fra-mediabeam.com [10.2.0.34])
 by smtpout2.vodafonemail.de (Postfix) with ESMTP id 92DB0120A7B;
 Mon, 23 Aug 2021 17:43:11 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nexgo.de;
 s=vfde-smtpout-mb-15sep; t=1629733391;
 bh=79ruWoI4lBXe5HvW/Dd57A/hBSDvEgnKflUNILRzT2Y=;
 h=From:To:References:In-Reply-To:Subject:Date;
 b=WfyCvlwLE9tt0/6D22LF6YKmjUvgCwOSMplq+AzqVWRoezWav2t6kb+MIxsACmtYs
 pVsoYJqAL/DoN+5FT3h+iHwEvsumQAPeViNwhzFQKAkjATCf4ecdObtzwZFk1HpsUw
 gZt8b6gMtAJKpHk8Nx6zvpPC+0rBF2wwKhGMu2HM=
Received: from H270 (p5b38f1bc.dip0.t-ipconnect.de [91.56.241.188])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits))
 (No client certificate requested)
 by smtp.vodafone.de (Postfix) with ESMTPSA id 0B9D11401F1;
 Mon, 23 Aug 2021 15:43:11 +0000 (UTC)
Message-ID: <3F07DF81FC2040E69CB83A78EDE05BB7@H270>
From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: <libc-help@sourceware.org>,
 "Adhemerval Zanella" <adhemerval.zanella@linaro.org>
References: <4DD65B114A174A35AC6960DD2104BDE7@H270>
 <4c8ee26d-764e-736f-c3d6-5728e54c4c0f@linaro.org>
 <52E35AACEB174FDDAA3697DE66BB6ACA@H270>
 <f8460d66-dec6-9852-3710-8e5d6627df54@linaro.org>
In-Reply-To: <f8460d66-dec6-9852-3710-8e5d6627df54@linaro.org>
Subject: Re: Twiddling with 64-bit values as 2 ints;
Date: Mon, 23 Aug 2021 17:37:13 +0200
Organization: Me, myself & IT
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Windows Mail 6.0.6002.18197
X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158
X-purgate-type: clean
X-purgate-Ad: Categorized by eleven eXpurgate (R) http://www.eleven.de
X-purgate: This mail is considered clean (visit http://www.eleven.de for
 further information)
X-purgate: clean
X-purgate-size: 4411
X-purgate-ID: 155817::1629733391-00000B26-89386857/0/0
X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-help@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-help mailing list <libc-help.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-help/>
List-Help: <mailto:libc-help-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2021 15:43:23 -0000

Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:

> On 23/08/2021 10:18, Stefan Kanthak wrote:
>> Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>> 
>>> On 21/08/2021 10:34, Stefan Kanthak wrote:
>>>>
>>>> (Heretic.-) questions:
>>>> - why does glibc still employ such ugly code?
>>>> - Why doesn't glibc take advantage of 64-bit integers in such code?
>>>
>>> Because no one cared to adjust the implementation.  Recently Wilco
>>> has removed a lot of old code that still uses 32-bit instead of 64-bit
>>> bo bit twinddling in floating-pointer implementation (check caa884dda7
>>> and 9e97f239eae1f2).
>> 
>> That's good to hear.
>> 
>>> I think we should move to use a simplest code assuming 64-bit CPU
>> 
>> D'accord.
>> And there's a second direction where you might move: almost all CPUs
>> have separate general purpose registers and floating-point registers.
>> Bit-twiddling generally needs extra (and sometimes slow) transfers
>> between them.
>> In 32-bit environment, where arguments are typically passed on the
>> stack, at least loading an argument from the stack into a GPR or FPR
>> makes no difference.
>> In 64-bit environment, where arguments are passed in registers, they
>> should be operated on in these registers.
>> 
>> So: why not implement routines like nextafter() without bit-twiddling,
>> using floating-point as far as possible for architectures where this
>> gives better results?
> 
> Mainly because some math routines are not performance critical in the
> sense they are usually not hotspots and for these I would prefer the 
> simplest code that work with reasonable performance independently of
> the underlying ABI or architecture

With this we're back at square 1: my initial post showed such simple(st)
code.
The performance gain I experienced in my use case was more than
noticeable: on AMD64, the total runtime of my program decreased from
20s to 12s.

> (using integer operation might be be for soft-fp ABI for instance).


> For symbols are might be performance critical, we do have more optimized
> version.  Szabolcs and Wilco spent considerable time to tune a lot of
> math functions and to remove the slow code path; also for some routines
> we have internal defines that map then to compiler builtin when we know
> that compiler and architecture allows us to do so (check the rounding
> routines or sqrt for instance).
> 
> Recently we are aiming to avoid arch-specific code for complex routines,
> and prefer C implementation that leverage the compiler support.  It makes
> a *much* maintainable code and without the need to keep evaluating the 
> routines on each architecture new iterations (as some routines proven to
> be slower than more well coded generic implementation).

That was the goal of my patch: let the compiler operate on 64-bit integers
instead the C implementation on pairs of 32-bit integers.

>> The simple implementation I showed in my initial post improved the
>> throughput in my benchmark (on AMD64) by an order of magnitude.
>> In Szabolcs Nagy benchmark measuring latency it took 0.04ns/call
>> longer (5.72ns vs. 5.68ns) -- despite the POOR job GCC does on FP.
> 
> Your implementation triggered a lot of regression,

The initial, FP-preferring code was a demonstration, not a patch.

> you will need to sort this out before considering performance numbers.
> Also, we will need a proper benchmark to evaluate it, as Szabolcs and
> Wilco has done for their math work.
> 
>> 
>> Does GLIBC offer a macro like "PREFER_FP_IMPLEMENTATION" that can be
>> used to select between the integer bit-twiddling code and FP-preferring
>> code during compilation?
> 
> No and I don't think we this would be a good addition.  As before, I would
> prefer to have a simple generic implementation that give us a good
> performance on modern hardware instead of a configurable one with many
> tunables.  The later is increases the maintainable cost (with testing and
> performance evaluation).

Having dedicated implementations for different architectures is even more
costly!
My intention/proposal is to have at most two different generic implementations,
one using integer bit-twiddling wherever possible, thus supporting soft-fp well,
the second using floating-point wherever possible, thus supporting modern
hardware well.

Stefan