From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 79611 invoked by alias); 8 May 2015 08:15:53 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 79563 invoked by uid 89); 8 May 2015 08:15:53 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 08 May 2015 08:15:50 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by uk-mta-15.uk.mimecast.lan; Fri, 08 May 2015 09:15:47 +0100 Received: from [10.2.207.50] ([10.1.2.79]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 8 May 2015 09:15:47 +0100 Message-ID: <554C70B3.6030104@arm.com> Date: Fri, 08 May 2015 08:15:00 -0000 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Kyrill Tkachov , GCC Patches Subject: Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible References: <5543A3AB.1090309@foss.arm.com> In-Reply-To: <5543A3AB.1090309@foss.arm.com> X-MC-Unique: uYVR05IWTKKBLIl4GPjJIg-1 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2015-05/txt/msg00625.txt.bz2 Ping. https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00071.html Thanks, Kyrill On 01/05/15 17:02, Kyrill Tkachov wrote: > Hi all, > > GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (= x, (int)k + 0.5) > using square roots. So, for the above examples it would generate sqrt (x)= * sqrt (sqrt (x)), > sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it = will calculate the > reciprocal of that). > > However, the implementation of these optimisations is done on a bit of an= ad-hoc basis with > the 0.25, 0.5, 0.75 cases hardcoded. > Judging by https://gcc.gnu.org/wiki/summit2010?action=3DAttachFile&do=3Dg= et&target=3Dmeissner2.pdf > these are the most commonly used exponents (at least in SPEC ;)) > > This patch generalises this optimisation into a (hopefully) more robust a= lgorithm. > In particular, it expands calls to pow (x, CST) by expanding the integer = part of CST > using a powi, like it does already, and then expanding the fractional par= t as a product > of repeated applications of a square root if the fractional part can be e= xpressed > as a multiple of a power of 0.5. > > I try to explain the algorithm in more detail in the comments in the patc= h but, for example: > > pow (x, 5.625) is not currently handled, but with this patch will be expa= nded > to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 =3D=3D 5= .0 + 0.5 + 0.5**3 > > Negative exponents are handled in either of two ways, depending on the ex= ponent value: > * Using a simple reciprocal. > For example: > pow (x, -5.625) =3D=3D 1.0 / pow (x, 5.625) > --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x)))) > > * For pow (x, EXP) with negative exponent EXP with integer part INT and f= ractional part FRAC: > pow (1.0 - FRAC) / powi (ceil (abs (EXP))). > For example: > pow (x, -5.875) =3D=3D pow (x, 0.125) / powi (X, 6) > --> sqrt (sqrt (sqrt (x))) / (powi (x, 6)) > > > Since hardware square root instructions tend to be expensive, we may want= to reduce the number > of square roots we are willing to calculate. Since we reuse intermediate = square root results, > this boils down to restricting the depth of the square root chains. In al= l the examples above > that depth is 3. I've made this maximum depth parametrisable in params.de= f. By adjusting that > parameter we can adjust the resolution of this optimisation. So, if it's = set to '4' then we > will synthesize every exponent that is a multiple of 0.5**4 =3D=3D 0.0625= , including negative > multiples. Currently, GCC will not try to expand negative multiples of an= ything else than 0.5 > > I have tried to keep the existing functionality intact and activate this = only for > -funsafe-math-optimizations and only when the target has a sqrt instructi= on. > An exception to that is pow (x, 0.5) which we prefer to transform to s= qrt even > when a hardware sqrt is not available, presumably because the library fun= ction for > sqrt is usually faster than pow (?). > > > Having seen the glibc implementation of a fully IEEE-754-compliant pow fu= nction, I think we > would prefer synthesising the pow call whenever we can for -ffast-math. > > I have seen this optimisation trigger a few times in SPEC2k6, in particul= ar in 447.dealII > and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) an= d pow (x, 0.875) > with square roots, multiplies and, in the case of -0.25, divides. > On 481.wrf I saw it remove a total of 22 out of 322 calls to pow > > On 481.wrf on aarch64 I saw about a 1% improvement. > The cycle count on x86_64 was also smaller, but not by a significant amou= nt (the same calls to > pow were eliminated). > > In general, I think this can shine if multiple expandable calls to pow ap= pear together. > So, for example for code: > double > baz (double a) > { > return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin= _pow (a, 3.375); > } > > we can generate: > baz: > fsqrt d3, d0 > fmul d4, d0, d0 > fmov d5, 1.0e+0 > fmul d6, d0, d4 > fsqrt d2, d3 > fmul d1, d0, d2 > fsqrt d0, d2 > fmul d3, d3, d2 > fdiv d1, d5, d1 > fmul d3, d3, d6 > fmul d2, d2, d0 > fmadd d0, d4, d3, d1 > fmsub d0, d6, d2, d0 > ret > > reusing the sqrt results and doing more optimisations rather than the cur= rent: > baz: > stp x29, x30, [sp, -48]! > fmov d1, -1.25e+0 > add x29, sp, 0 > stp d8, d9, [sp, 16] > fmov d9, d0 > str d10, [sp, 32] > bl pow > fmov d8, d0 > fmov d0, d9 > fmov d1, 5.75e+0 > bl pow > fmov d10, d0 > fmov d0, d9 > fmov d1, 3.375e+0 > bl pow > fadd d8, d8, d10 > ldr d10, [sp, 32] > fsub d0, d8, d0 > ldp d8, d9, [sp, 16] > ldp x29, x30, [sp], 48 > ret > > > Of course gcc could already do that if the exponents were to fall in the = set {0.25, 0.75, k+0.5} > but with this patch that set can be greatly expanded. > > I suppose if we're really lucky we might even open up new vectorisation o= pportunities. > For example: > void > vecfoo (float *a, float *b) > { > for (int i =3D 0; i < N; i++) > a[i] =3D __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625= ); > } > > will now be vectorisable if we have a hardware vector sqrt instruction. T= hough I'm not sure > how often this would appear in real code. > > > I would like advice on some implementation details: > - The new function representable_as_half_series_p is used to check if the= fractional > part of an exponent can be represented as a multiple of a power of 0.5. I= t does so > by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is= a smarter > way of doing this, considering that REAL_VALUE_TYPE holds the bit represe= ntation of the > floating point number? > > - Are there any correctness cases that I may have missed? This patch gate= s the optimisation > on -funsafe-math-optimizations, but maybe there are some edge cases that = I missed? > > - What should be the default value of the max-pow-sqrt-depth param? In th= is patch it's 5 which > on second thought strikes me as a bit aggressive. To catch exponents that= are multiples of > 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I s= uppose it depends on > how likely more fine-grained powers are to appear in real code, how expen= sive the C library > implementation of pow is, and how expensive are the sqrt instructions in = hardware. > > > Bootstrapped and tested on x86_64, aarch64, arm (pending) > SPEC2k6 built and ran fine. Can anyone suggest anything other codebase th= at might > be of interest to look at? > > Thanks, > Kyrill > > 2015-05-01 Kyrylo Tkachov > > * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param. > * tree-ssa-math-opts.c: Include params.h > (pow_synth_sqrt_info): New struct. > (representable_as_half_series_p): New function. > (get_fn_chain): Likewise. > (print_nested_fn): Likewise. > (dump_fractional_sqrt_sequence): Likewise. > (dump_integer_part): Likewise. > (expand_pow_as_sqrts): Likewise. > (gimple_expand_builtin_pow): Use above to attempt to expand > pow as series of square roots. Removed now unused variables. > > 2015-05-01 Kyrylo Tkachov > > * gcc.dg/pow-sqrt.x: New file. > * gcc.dg/pow-sqrt-1.c: New test. > * gcc.dg/pow-sqrt-2.c: Likewise. > * gcc.dg/pow-sqrt-3.c: Likewise.