From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-86163-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 124243 invoked by alias); 19 Oct 2017 22:31:59 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 124213 invoked by uid 89); 19 Oct 2017 22:31:59 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-26.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_LOTSOFHASH,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=3040, spite
X-HELO: userp1040.oracle.com
Subject: Re: [PATCH] Improves __ieee754_exp() performance by greater than 5x
 on sparc/x86.
To: Joseph Myers <joseph@codesourcery.com>
Cc: libc-alpha@sourceware.org
References: <1508172962-97543-1-git-send-email-patrick.mcgehearty@oracle.com>
 <alpine.DEB.2.20.1710181706300.28577@digraph.polyomino.org.uk>
From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Message-ID: <6d6104ad-b846-68b3-8f87-3216d1e52412@oracle.com>
Date: Thu, 19 Oct 2017 22:31:00 -0000
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <alpine.DEB.2.20.1710181706300.28577@digraph.polyomino.org.uk>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-SW-Source: 2017-10/txt/msg00975.txt.bz2

On 10/18/2017 12:22 PM, Joseph Myers wrote:

Thank you Joseph for your detailed review.
Comments below.
> On Mon, 16 Oct 2017, Patrick McGehearty wrote:
>
>> diff --git a/sysdeps/ieee754/dbl-64/e_exp.c b/sysdeps/ieee754/dbl-64/e_exp.c
>> index 6757a14..555a47f 100644
>> --- a/sysdeps/ieee754/dbl-64/e_exp.c
>> +++ b/sysdeps/ieee754/dbl-64/e_exp.c
>> @@ -1,238 +1,452 @@
>> +/* EXP function - Compute double precision exponential
>> +   Copyright (C) 2017 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>>   /*
>> - * IBM Accurate Mathematical Library
>> - * written by International Business Machines Corp.
>> - * Copyright (C) 2001-2017 Free Software Foundation, Inc.
> The file still contains __exp1, used by pow.  Thus, I do not think
> removing the existing copyright dates is appropriate unless as a
> preliminary patch you move __exp1 to another file (e_pow.c being the
> obvious one; watch out for the sysdeps/x86_64/fpu/multiarch/e_exp-*.c with
> their own macro definitions of __exp1 that would no longer be appropriate
> after such a move).
For the copyright issue, would it be appropriate to move the
previous copyright notice to right before __exp1?
I'm reluctant to move __exp1 as that might also require changes
to Makefiles which are not currently required.


>
> Does this patch remove all calls to __slowexp?  If so, I'd expect
> slowexp.c to be removed as part of the patch (including
> architecture-specific / multiarch variants, and Makefile references to
> special options etc. for building slowexp.c).
For slowexp, I have some reservations that removing slowexp.o
might cause older object modules to break due to the missing
reference. I know they should not be directly referencing
an internal function, but still...
Anyway, I can't find any other direct usage of __slowexp.
I will remove all references to __slowexp and see
if anything breaks to show I missed a reference.

I find the following files have references, including some more
files to be removed.

sysdeps/generic/math_private.h

manual/probes.texi - simply remove references to slowexp_p6 and slowexp_p32
math/Makefile - remove slowexp references
sysdeps/generic/math_private.h
sysdeps/ieee754/dbl-64/e_exp.cÂ Â  (removed in the new code)
sysdeps/ieee754/dbl-64/slowexp.c (file to be removed)
sysdeps/ieee754/dbl-64/e_pow.c:Â  (comment only)

The following include ieee754/e_exp.c; can remove references to slowexp
sysdeps/x86_64/fpu/multiarch/e_exp-avx.c
sysdeps/x86_64/fpu/multiarch/e_exp-fma.c
sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c

The following include ieee754/dbl-64/slowexp.c; no longer needed
sysdeps/x86_64/fpu/multiarch/slowexp-avx.c
sysdeps/x86_64/fpu/multiarch/slowexp-fma.c
sysdeps/x86_64/fpu/multiarch/slowexp-fma4.c

sysdeps/x86_64/fpu/multiarch/Makefile - remove slowexp references


benchtests/strcol1-inputs/filelist#C and filelist#en_US.UTF-8
have references to slowexp*.c
I'm supposing those should also be removed.

>
>
>> +extern double __ieee754_exp (double);
>> +
>> +
>> +static const double TBL[] = {
>> +  1.00000000000000000000e+00, 0.00000000000000000000e+00,
>> +  1.02189714865411662714e+00, 5.10922502897344389359e-17,
>> +  1.04427378242741375480e+00, 8.55188970553796365958e-17,
> As per previous comments, this sort of table of precomputed values should
> use hex float constants.  You can use before-and-after comparison of
> object files to make sure the hex floats do represent the same values as
> the decimal constants.
>
>> +static const double
>> +  half		=0.5,
>> +/* Following three values used to scale x to primary range */
>> +  invln2_32	=4.61662413084468283841e+01,	/* 0x40471547, 0x652b82fe */
>> +  ln2_32hi	=2.16608493865351192653e-02,	/* 0x3f962e42, 0xfee00000 */
>> +  ln2_32lo	=5.96317165397058656257e-12,	/* 0x3d9a39ef, 0x35793c76 */
>> +/* t2-t5 terms used for polynomial computation */
>> +  t2		=1.6666666666526086527e-1,	/* 3fc5555555548f7c */
>> +  t3		=4.1666666666226079285e-2,	/* 3fa5555555545d4e */
>> +  t4		=8.3333679843421958056e-3,	/* 3f811115b7aa905e */
>> +  t5		=1.3888949086377719040e-3,	/* 3f56c1728d739765 */
>> +  one		=1.0,
>> +/* maximum value for x to not overflow */
>> +  threshold1	=7.09782712893383973096e+02,	/* 0x40862E42, 0xFEFA39EF */
>> +/* maximum value for -x to not underflow */
>> +  threshold2	=7.45133219101941108420e+02,	/* 0x40874910, 0xD52D3051 */
>> +/* scaling factor used when result near zero*/
>> +  twom54	=5.55111512312578270212e-17,	/* 0x3c900000, 0x00000000 */
>> +/* value used to force inexact condition */
>> +  small		=1.0e-100;
> Likewise hex floats for these values (other than half / one / small).
> Also, the formatting is far from GNU standard; I'd expect e.g.:
>
> /* Scaling factor used when result near zero.  */
> static const double twom54 = 0x1p-54;
>
> (single space either side of =, comment before each variable definition
> starting with a capital letter and ending with ".  ").

Table of hex float constants. I can readily adjust the formating. What
you see is the formating used in the original source.
I've been uncomfortable with hex floats approach
as it only works for ieee754 representations
that use base 2. I admit that is most current machines.
And the prior ieee754 exp table uses hex format.
My second reason for resisting the change is my philosophy
when porting code is that every change without a good reason
is an opportunity to introduce errors without corresponding benefit.

If you feel strongly about suing hex constants, I will make an effort
to convert these values to hex format.Â  This conversion seems likely
to require the most effort on my part.


>
>> +	  /*
>> +	     Use FE_TONEAREST rounding mode for computing yy.y
>> +	     Avoid set/reset of rounding mode if already in FE_TONEAREST mode
>> +	  */
>> +	  fe_val = __fegetround();
>> +	  if (fe_val == FE_TONEAREST) {
>> +	    t = xx.x * xx.x;
>> +	    yy.y = xx.x + (t * (half + xx.x * t2) +
>> +			   (t * t) * (t3 + xx.x * t4 + t * t5));
>> +	    retval = one + yy.y;
>> +	  } else {
>> +	    __fesetround(FE_TONEAREST);
>> +	    t = xx.x * xx.x;
>> +	    yy.y = xx.x + (t * (half + xx.x * t2) +
>> +			   (t * t) * (t3 + xx.x * t4 + t * t5));
>> +	    retval = one + yy.y;
>> +	    __fesetround(fe_val);
> The formatting here is off, missing space before '(' in function calls.
> And you should be using the optimized SET_RESTORE_ROUND (FE_TONEAREST),
> which in addition to avoiding setting the rounding mode to a value it
> already has, also e.g. arranges on x86_64 to set only the SSE rounding
> mode and not the x87 mode because only the SSE mode needs setting for
> types other than long double.
>
> (I'm not clear why you actually need to set the rounding mode here.  Does
> excessive inaccuracy result in this particular case if you don't set it?)
>
>> +      fe_val = __fegetround();
Failing to use __fegetround(),__fesetround() causes over 40 math test
accuracy failures for other rounding modes in exp, cexp, cpow, cosh,
ccos, ccosh, csin, and csinh.Â  If the glibc/Linux community were to
declare that the only rounding mode that was fully supported was
FE_TONEAREST, we could simplify/speed up a lot of code. :-)
If I were wearing a benchmarker's hat, I could argue for that
approach. As a math librarian developer, I felt compelled to add the
rounding mode tests to maintain accuracy for the other rounding modes
in spite of the performance cost.

While SET_RESTORE_ROUND is reasonably optimized, it is not ideal,
especially on Sparc. Empirically, using SET_RESTORE_ROUND adds more
than twice the overhead compared to using __fegetround() on Sparc.
Interestingly, for x86, the two approaches have similar performance
perhaps due to more attention paid to x86 optimization with gcc over
the years.

Underneath the definitions of SET_RESTORE_ROUND, it ultimately relies
on __fegetround() and __fesetround(). The extra cost for
SET_RESTORE_ROUND is that it requires a flag to always be set (mode
changed or did not change). A short time the flag must tested to
determine if the rounding mode needs to be restored. If the compiler
puts that flag on the stack or in memory, the reading of a value that
was just written to cache triggers a "Read after Write" HW hazard,
causing a typical delay of 30-40 cycles. Avoiding the test also
avoids a possible mis-predicted branch.Â  For M7 and earlier, Sparc
branch prediction is not ideal and mis-prediction is expensive.Â  The
code I provided avoids the need to set the flag or test it by
duplicating small segments of code for each case.


> Likewise again, and subsequently.
>
>> +      if (fe_val == FE_TONEAREST) {
>> +	z = xx.x - TBL2[j];
>> +	t = z * z;
>> +	/* the "small" term below guarantees inexact will be raised */
> Guaranteeing "inexact" is not part of the goals for most libm functions,
> so I expect you can remove that term.
The "inexact" test was required to pass the (make check) math lib tests.

>
>> +      if (-xx.x > threshold2)
>> +	{			/* set underflow error condition */
>> +	  double force_underflow = tiny * tiny;
>> +	  math_force_eval (force_underflow);
>> +	  retval = zero;
>> +	  return retval;
> I'd expect the force_underflow value to be returned in this case (so the
> return value is appropriate in round-upward mode).
When -xx.x > threshold2, e**x = 0.Â  I'll investigate when/whether 
round-upward
expects the ulp to be 1 instead of zero. My first impression is that you
are right, but I want to think about it a bit more.