Re: [PATCH] v12 Improves __ieee754_exp() performance by 6-11% on aarch64/sparc/x86.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
To: libc-alpha@sourceware.org
Subject: Re: [PATCH] v12 Improves __ieee754_exp() performance by 6-11% on aarch64/sparc/x86.
Date: Sat, 17 Mar 2018 00:56:00 -0000	[thread overview]
Message-ID: <10a52463-35af-cc4a-ebcd-ac87ef7f7edf@oracle.com> (raw)
In-Reply-To: <b7282890-674a-88fb-07cb-c7ad0c85d251@arm.com>

On 3/15/2018 7:00 AM, Szabolcs Nagy wrote:
> On 15/03/18 04:22, Patrick McGehearty wrote:
>> New with this version:
>> Only updates to e_exp.c and eexp.tbl plus revised
>> libm-test-ulps for aarch64/sparc/x86_64 as removal of slowexp()
>> was accomplished by prior patch.
>>
>> Summary of patch rationale
>>
>> These changes will be active for all platforms that don't provide
>> their own exp() routines. They will also be active for ieee754
>> versions of ccos, ccosh, cosh, csin, csinh, sinh, exp10, gamma, and
>> erf.
>>
>> Typical performance gains are 6% on aarch64, 28% on Sparc s7 and 11%
>> on x86_64 based on the glibc_perf tests.
>>
>
> i think this is the wrong measurement for this algorithm:
>
> it uses two different methods for about |x|<1 and |x|>1
> the first is fast (but uses yet another table) the second
> is slow (!) and the branches that decide can easily
> mispredict in sensible workloads.
>
> so i think on all targets (including sparc) one could do
> better by using a single method (assuming that can give
> similar speed to the fast method but on the entire input
> range).
>
>> Glibc correctness tests for exp() and expf() were run. Within the test
>> suite 1 input value was found to cause a 1 ulp difference when
>> "FE_TONEAREST" rounding mode is set. No differences in exp()
>> were seen for the tested values for the other rounding modes.
>>
>> When tested over a range of 10 million input values, the new code
>> gets a 1 ulp error approximately 1.6 times per 1000 values.
>> That rate was similar for all four rounding modes.
>> The patch uses a 64 entry scaling table. The existing
>> code uses a 512 entry table.
>>
>> Further optimization is possible in the handling of rounding
>> modes. Using get_rounding_mode and libc_fesetround() instead of
>> SET_RESTORE_ROUND provides a measurable gain for Sparc.
>> Unfortunately, on x86, one works with sse fp unit rounding mode while
>> the other works on x87 fp unit rounding mode.Â  Adding libc_fegetround,
>> libc_fegetroundf and libc_fegetroundl to to match libc_fesetround()
>> should not be too large a task but outside the scope of this patch.
>
> the rounding mode setting should be completely removed.
> (after analysis of the worst-case non-nearest rounding errors)
>
> i think non-nearest error of this algorithm should be at most 1ulp
> without rounding mode change and functions using exp may see 1-2ulp
> error increase in non-nearest rounding mode (but if that's too
> high that should be fixed on the call site).
>
> but i dont want you to spend time changing the code, i'll
> post my exp variant soon, so you can benchmark it on
> sparc then we can continue the discussion depending on
> the results.
If the "make bench" does not use a realistic range of values
for measuring exp() performance, we should add values until
it approximates what are considered common in real applications.

The fast method is faster because it does not require exponent
scaling. As such, I don't see an easy way to use it for all values.
However, it may be possible to revise it to share TBL[] and
eliminate TBL2[]. For a quick test and as a comparison with
Szabolcs Nagy's work on avoiding testing Rounding modes,
I've run some tests removing the fast path and SET_RESTORE_ROUND.
I've also run the same tests with the current upstream code
except with SET_RESTORE_ROUND removed.

The current upsteam code without SET_RESTORE_ROUND gets over 700 errors
in the FE_TONEAREST rounding modes, especially the complex functions
that call exp() [ccos, ccosh, cexp, cpow, csin, csinh, ...].
Some of these errors are much more than one additional ulp.
I expect Szabolcs has been investigating ways to reduce that issue.

My approach also gets some additional errors in the non-FE_TONEAREST
rounding modes, but only about 50 and all are only 1 ulp greater than
the current accepted limits.Â  If we decide to remove SET_RESTORE_ROUND
from exp(), I would recommend placing it inside the various functions
that call __ieee754_exp() which would otherwise need increased ulp
values to pass the "make check" tests.

I expanded my correctness testing to range over 20million values
for each of the four rounding modes and to measure both 1 ulp
differences and differences larger than 1 ulp.

 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  currentÂ Â Â Â Â Â Â Â Â Â Â Â Â  my version
 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  1 ulpÂ Â Â  >1 ulpÂ Â Â Â Â  1 ulpÂ Â Â  >1 ulp
Nearest Rounding:Â Â Â Â Â Â Â Â Â Â  13Â Â Â Â Â Â Â Â  0Â Â Â Â Â  33645Â Â Â Â Â  0
Upward Rounding:Â Â Â Â Â  14993991Â Â  9985177Â Â  10086656Â Â Â Â Â  0
Downward Rounding:Â Â Â  14993697Â Â  9919060Â Â  10082580Â Â Â Â Â  0
Towards Zero Rnding:Â  14991995Â Â  9919060Â Â Â  9964926Â Â Â Â Â  0

The "1 ulp" column is the number of values out of 20M that
differ by 1 ulp or more from the true result.
The ">1 ulp" column is the number of values out of 20M that
differ by more from the true result.

As you can see, that for existing code, exp() would start
generating substantial differences if we remove SET_RESTORE_ROUND.
I was quite surprised by the number of differences greater than
1 ulp.
My code would show a 1 ulp difference approximately 1/2 the time
and any 2 ulp differences would be extremely rare.
If it is desirable to follow this approach, I can extend
my testing to a much larger set of numbers.

I ran "make bench" as a performance comparison on x86 and arm.
On x86, my code without a fast path and without SET_RESTORE_ROUND
ran 1% faster the current code without SET_RESTORE_ROUND.
On arm, my code without a fast path and without SET_RESTORE_ROUND
ran 24% slower the current code without SET_RESTORE_ROUND.

I am considering rewriting the fast path to use the same TBL[] values
but without exponent scaling.Â  That should have no effect on the
normal path, but provide a performance edge for the fast path.

- patrick

next prev parent reply	other threads:[~2018-03-17  0:56 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-15  4:22 Patrick McGehearty
2018-03-15 12:01 ` Szabolcs Nagy
2018-03-17  0:56   ` Patrick McGehearty [this message]
2018-03-20 22:59     ` Patrick McGehearty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10a52463-35af-cc4a-ebcd-ac87ef7f7edf@oracle.com \
    --to=patrick.mcgehearty@oracle.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).