From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-397187-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 79611 invoked by alias); 8 May 2015 08:15:53 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 79563 invoked by uid 89); 8 May 2015 08:15:53 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 08 May 2015 08:15:50 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by uk-mta-15.uk.mimecast.lan; Fri, 08 May 2015 09:15:47 +0100
Received: from [10.2.207.50] ([10.1.2.79]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Fri, 8 May 2015 09:15:47 +0100
Message-ID: <554C70B3.6030104@arm.com>
Date: Fri, 08 May 2015 08:15:00 -0000
From: Kyrill Tkachov <kyrylo.tkachov@arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>,  GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
References: <5543A3AB.1090309@foss.arm.com>
In-Reply-To: <5543A3AB.1090309@foss.arm.com>
X-MC-Unique: uYVR05IWTKKBLIl4GPjJIg-1
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2015-05/txt/msg00625.txt.bz2

Ping.
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00071.html

Thanks,
Kyrill
On 01/05/15 17:02, Kyrill Tkachov wrote:
> Hi all,
>
> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (=
x, (int)k + 0.5)
> using square roots. So, for the above examples it would generate sqrt (x)=
 * sqrt (sqrt (x)),
> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it =
will calculate the
> reciprocal of that).
>
> However, the implementation of these optimisations is done on a bit of an=
 ad-hoc basis with
> the 0.25, 0.5, 0.75 cases hardcoded.
> Judging by https://gcc.gnu.org/wiki/summit2010?action=3DAttachFile&do=3Dg=
et&target=3Dmeissner2.pdf
> these are the most commonly used exponents (at least in SPEC ;))
>
> This patch generalises this optimisation into a (hopefully) more robust a=
lgorithm.
> In particular, it expands calls to pow (x, CST) by expanding the integer =
part of CST
> using a powi, like it does already, and then expanding the fractional par=
t as a product
> of repeated applications of a square root if the fractional part can be e=
xpressed
> as a multiple of a power of 0.5.
>
> I try to explain the algorithm in more detail in the comments in the patc=
h but, for example:
>
> pow (x, 5.625) is not currently handled, but with this patch will be expa=
nded
> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 =3D=3D 5=
.0 + 0.5 + 0.5**3
>
> Negative exponents are handled in either of two ways, depending on the ex=
ponent value:
> * Using a simple reciprocal.
>     For example:
>     pow (x, -5.625) =3D=3D 1.0 / pow (x, 5.625)
>       --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>
> * For pow (x, EXP) with negative exponent EXP with integer part INT and f=
ractional part FRAC:
> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>     For example:
>     pow (x, -5.875) =3D=3D pow (x, 0.125) / powi (X, 6)
>       --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>
>
> Since hardware square root instructions tend to be expensive, we may want=
 to reduce the number
> of square roots we are willing to calculate. Since we reuse intermediate =
square root results,
> this boils down to restricting the depth of the square root chains. In al=
l the examples above
> that depth is 3. I've made this maximum depth parametrisable in params.de=
f. By adjusting that
> parameter we can adjust the resolution of this optimisation. So, if it's =
set to '4' then we
> will synthesize every exponent that is a multiple of 0.5**4 =3D=3D 0.0625=
, including negative
> multiples. Currently, GCC will not try to expand negative multiples of an=
ything else than 0.5
>
> I have tried to keep the existing functionality intact and activate this =
only for
> -funsafe-math-optimizations and only when the target has a sqrt instructi=
on.
>    An exception to that is pow (x, 0.5) which we prefer to transform to s=
qrt even
> when a hardware sqrt is not available, presumably because the library fun=
ction for
> sqrt is usually faster than pow (?).
>
>
> Having seen the glibc implementation of a fully IEEE-754-compliant pow fu=
nction, I think we
> would prefer synthesising the pow call whenever we can for -ffast-math.
>
> I have seen this optimisation trigger a few times in SPEC2k6, in particul=
ar in 447.dealII
> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) an=
d pow (x, 0.875)
> with square roots, multiplies and, in the case of -0.25, divides.
> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>
> On 481.wrf on aarch64 I saw about a 1% improvement.
> The cycle count on x86_64 was also smaller, but not by a significant amou=
nt (the same calls to
> pow were eliminated).
>
> In general, I think this can shine if multiple expandable calls to pow ap=
pear together.
> So, for example for code:
> double
> baz (double a)
> {
>     return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin=
_pow (a, 3.375);
> }
>
> we can generate:
> baz:
>           fsqrt   d3, d0
>           fmul    d4, d0, d0
>           fmov    d5, 1.0e+0
>           fmul    d6, d0, d4
>           fsqrt   d2, d3
>           fmul    d1, d0, d2
>           fsqrt   d0, d2
>           fmul    d3, d3, d2
>           fdiv    d1, d5, d1
>           fmul    d3, d3, d6
>           fmul    d2, d2, d0
>           fmadd   d0, d4, d3, d1
>           fmsub   d0, d6, d2, d0
>           ret
>
> reusing the sqrt results and doing more optimisations rather than the cur=
rent:
> baz:
>           stp     x29, x30, [sp, -48]!
>           fmov    d1, -1.25e+0
>           add     x29, sp, 0
>           stp     d8, d9, [sp, 16]
>           fmov    d9, d0
>           str     d10, [sp, 32]
>           bl      pow
>           fmov    d8, d0
>           fmov    d0, d9
>           fmov    d1, 5.75e+0
>           bl      pow
>           fmov    d10, d0
>           fmov    d0, d9
>           fmov    d1, 3.375e+0
>           bl      pow
>           fadd    d8, d8, d10
>           ldr     d10, [sp, 32]
>           fsub    d0, d8, d0
>           ldp     d8, d9, [sp, 16]
>           ldp     x29, x30, [sp], 48
>           ret
>
>
> Of course gcc could already do that if the exponents were to fall in the =
set {0.25, 0.75, k+0.5}
> but with this patch that set can be greatly expanded.
>
> I suppose if we're really lucky we might even open up new vectorisation o=
pportunities.
> For example:
> void
> vecfoo (float *a, float *b)
> {
>     for (int i =3D 0; i < N; i++)
>       a[i] =3D __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625=
);
> }
>
> will now be vectorisable if we have a hardware vector sqrt instruction. T=
hough I'm not sure
> how often this would appear in real code.
>
>
> I would like advice on some implementation details:
> - The new function representable_as_half_series_p is used to check if the=
 fractional
> part of an exponent can be represented as a multiple of a power of 0.5. I=
t does so
> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is=
 a smarter
> way of doing this, considering that REAL_VALUE_TYPE holds the bit represe=
ntation of the
> floating point number?
>
> - Are there any correctness cases that I may have missed? This patch gate=
s the optimisation
> on -funsafe-math-optimizations, but maybe there are some edge cases that =
I missed?
>
> - What should be the default value of the max-pow-sqrt-depth param? In th=
is patch it's 5 which
> on second thought strikes me as a bit aggressive. To catch exponents that=
 are multiples of
> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I s=
uppose it depends on
> how likely more fine-grained powers are to appear in real code, how expen=
sive the C library
> implementation of pow is, and how expensive are the sqrt instructions in =
hardware.
>
>
> Bootstrapped and tested on x86_64, aarch64, arm (pending)
> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase th=
at might
> be of interest to look at?
>
> Thanks,
> Kyrill
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>       * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>       * tree-ssa-math-opts.c: Include params.h
>       (pow_synth_sqrt_info): New struct.
>       (representable_as_half_series_p): New function.
>       (get_fn_chain): Likewise.
>       (print_nested_fn): Likewise.
>       (dump_fractional_sqrt_sequence): Likewise.
>       (dump_integer_part): Likewise.
>       (expand_pow_as_sqrts): Likewise.
>       (gimple_expand_builtin_pow): Use above to attempt to expand
>       pow as series of square roots.  Removed now unused variables.
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>       * gcc.dg/pow-sqrt.x: New file.
>       * gcc.dg/pow-sqrt-1.c: New test.
>       * gcc.dg/pow-sqrt-2.c: Likewise.
>       * gcc.dg/pow-sqrt-3.c: Likewise.