From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-x72b.google.com (mail-qk1-x72b.google.com [IPv6:2607:f8b0:4864:20::72b]) by sourceware.org (Postfix) with ESMTPS id 4A581385BF9F for ; Mon, 23 Aug 2021 16:51:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4A581385BF9F Received: by mail-qk1-x72b.google.com with SMTP id c10so17346906qko.11 for ; Mon, 23 Aug 2021 09:51:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=rbZSJNyJlIFCLumgjBHRqZJr+/VIsRazZQb+KkiHMSM=; b=F2eRY62wxAjhf2EijV2iHiwGcZ8i2bjYByM+9vXF2A4D6lniw98RY2XR8SWrANTLXP AG3wUnhYFXYYikiTBoe6GXUrYBgkcQCxFrLoOnE6dLl6iKWcspwpj4l7K3R6B+4qDZd2 33XNh08tKtn/ZDphgvkC+2VVjOwso9LFCaRz9LRQzbfdG/+mJR0wQe+KkP9WGVLa3CJc ZO7NZUlrsXlpyG6pTqh4TdxyCZzkAZL23++0HHb/wOFLOjsDlck77rtEI+C4vJ90yE1E plccER4FRfqFYXURonbMmf/DaqI0lzQeIMH4E+pfV+XNoubEICbndYxXmIr/8bfhMFKv paaw== X-Gm-Message-State: AOAM532Z2fhydgjgIJDfaR99yZid/11jwFSBailnX4HihokcyfgayfHi nJcmFaOIvRh9+rf66ofzp1hRGVW8K3VBPQ== X-Google-Smtp-Source: ABdhPJz9Y4LgiweeDxCoV/09tEaO9gmNFTJGL2ZLwwkSVlELlTJ+HpFfeOWPpjrDUjuxF6c3d9YKxA== X-Received: by 2002:a05:620a:2008:: with SMTP id c8mr18351303qka.493.1629737489460; Mon, 23 Aug 2021 09:51:29 -0700 (PDT) Received: from ?IPv6:2804:431:c7ca:cd83:c38b:b50d:5d9a:43d4? ([2804:431:c7ca:cd83:c38b:b50d:5d9a:43d4]) by smtp.gmail.com with ESMTPSA id b65sm8868555qkc.15.2021.08.23.09.51.28 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 23 Aug 2021 09:51:29 -0700 (PDT) Subject: Re: Twiddling with 64-bit values as 2 ints; To: Stefan Kanthak , libc-help@sourceware.org References: <4DD65B114A174A35AC6960DD2104BDE7@H270> <4c8ee26d-764e-736f-c3d6-5728e54c4c0f@linaro.org> <52E35AACEB174FDDAA3697DE66BB6ACA@H270> <3F07DF81FC2040E69CB83A78EDE05BB7@H270> From: Adhemerval Zanella Message-ID: <0978c043-b32b-ecf8-5cfe-de31c473bb4d@linaro.org> Date: Mon, 23 Aug 2021 13:51:27 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <3F07DF81FC2040E69CB83A78EDE05BB7@H270> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-help@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-help mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2021 16:51:40 -0000 On 23/08/2021 12:37, Stefan Kanthak wrote: > Adhemerval Zanella wrote: > >> On 23/08/2021 10:18, Stefan Kanthak wrote: >>> Adhemerval Zanella wrote: >>> >>>> On 21/08/2021 10:34, Stefan Kanthak wrote: >>>>> >>>>> (Heretic.-) questions: >>>>> - why does glibc still employ such ugly code? >>>>> - Why doesn't glibc take advantage of 64-bit integers in such code? >>>> >>>> Because no one cared to adjust the implementation. Recently Wilco >>>> has removed a lot of old code that still uses 32-bit instead of 64-bit >>>> bo bit twinddling in floating-pointer implementation (check caa884dda7 >>>> and 9e97f239eae1f2). >>> >>> That's good to hear. >>> >>>> I think we should move to use a simplest code assuming 64-bit CPU >>> >>> D'accord. >>> And there's a second direction where you might move: almost all CPUs >>> have separate general purpose registers and floating-point registers. >>> Bit-twiddling generally needs extra (and sometimes slow) transfers >>> between them. >>> In 32-bit environment, where arguments are typically passed on the >>> stack, at least loading an argument from the stack into a GPR or FPR >>> makes no difference. >>> In 64-bit environment, where arguments are passed in registers, they >>> should be operated on in these registers. >>> >>> So: why not implement routines like nextafter() without bit-twiddling, >>> using floating-point as far as possible for architectures where this >>> gives better results? >> >> Mainly because some math routines are not performance critical in the >> sense they are usually not hotspots and for these I would prefer the >> simplest code that work with reasonable performance independently of >> the underlying ABI or architecture > > With this we're back at square 1: my initial post showed such simple(st) > code. > The performance gain I experienced in my use case was more than > noticeable: on AMD64, the total runtime of my program decreased from > 20s to 12s. > >> (using integer operation might be be for soft-fp ABI for instance). > > > >> For symbols are might be performance critical, we do have more optimized >> version. Szabolcs and Wilco spent considerable time to tune a lot of >> math functions and to remove the slow code path; also for some routines >> we have internal defines that map then to compiler builtin when we know >> that compiler and architecture allows us to do so (check the rounding >> routines or sqrt for instance). >> >> Recently we are aiming to avoid arch-specific code for complex routines, >> and prefer C implementation that leverage the compiler support. It makes >> a *much* maintainable code and without the need to keep evaluating the >> routines on each architecture new iterations (as some routines proven to >> be slower than more well coded generic implementation). > > That was the goal of my patch: let the compiler operate on 64-bit integers > instead the C implementation on pairs of 32-bit integers. > >>> The simple implementation I showed in my initial post improved the >>> throughput in my benchmark (on AMD64) by an order of magnitude. >>> In Szabolcs Nagy benchmark measuring latency it took 0.04ns/call >>> longer (5.72ns vs. 5.68ns) -- despite the POOR job GCC does on FP. >> >> Your implementation triggered a lot of regression, > > The initial, FP-preferring code was a demonstration, not a patch. Right, but it does do not much sense comparing performance numbers with an implementation that adds a lot of regressions. > >> you will need to sort this out before considering performance numbers. >> Also, we will need a proper benchmark to evaluate it, as Szabolcs and >> Wilco has done for their math work. >> >>> >>> Does GLIBC offer a macro like "PREFER_FP_IMPLEMENTATION" that can be >>> used to select between the integer bit-twiddling code and FP-preferring >>> code during compilation? >> >> No and I don't think we this would be a good addition. As before, I would >> prefer to have a simple generic implementation that give us a good >> performance on modern hardware instead of a configurable one with many >> tunables. The later is increases the maintainable cost (with testing and >> performance evaluation). > > Having dedicated implementations for different architectures is even more > costly! > My intention/proposal is to have at most two different generic implementations, > one using integer bit-twiddling wherever possible, thus supporting soft-fp well, > the second using floating-point wherever possible, thus supporting modern > hardware well. The only reservation I have for such approach it it would add some more maintenance and testing. I added similar optimization for hypot on powerpc, mainly to avoid a CPU pipeline hazard on some chips due the GPR to FP transfer; but I am working on generic solution to just remove the powerpc specific implementation in favor over a generic one.