[PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
@ 2015-05-01 16:02 Kyrill Tkachov
  2015-05-08  8:15 ` Kyrill Tkachov
  2015-05-08 10:18 ` Richard Biener
  0 siblings, 2 replies; 9+ messages in thread
From: Kyrill Tkachov @ 2015-05-01 16:02 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 7209 bytes --]

Hi all,

GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x, (int)k + 0.5)
using square roots. So, for the above examples it would generate sqrt (x) * sqrt (sqrt (x)),
sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it will calculate the
reciprocal of that).

However, the implementation of these optimisations is done on a bit of an ad-hoc basis with
the 0.25, 0.5, 0.75 cases hardcoded.
Judging by https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
these are the most commonly used exponents (at least in SPEC ;))

This patch generalises this optimisation into a (hopefully) more robust algorithm.
In particular, it expands calls to pow (x, CST) by expanding the integer part of CST
using a powi, like it does already, and then expanding the fractional part as a product
of repeated applications of a square root if the fractional part can be expressed
as a multiple of a power of 0.5.

I try to explain the algorithm in more detail in the comments in the patch but, for example:

pow (x, 5.625) is not currently handled, but with this patch will be expanded
to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + 0.5 + 0.5**3

Negative exponents are handled in either of two ways, depending on the exponent value:
* Using a simple reciprocal.
   For example:
   pow (x, -5.625) == 1.0 / pow (x, 5.625)
     --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))

* For pow (x, EXP) with negative exponent EXP with integer part INT and fractional part FRAC:
pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
   For example:
   pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
     --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))


Since hardware square root instructions tend to be expensive, we may want to reduce the number
of square roots we are willing to calculate. Since we reuse intermediate square root results,
this boils down to restricting the depth of the square root chains. In all the examples above
that depth is 3. I've made this maximum depth parametrisable in params.def. By adjusting that
parameter we can adjust the resolution of this optimisation. So, if it's set to '4' then we
will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, including negative
multiples. Currently, GCC will not try to expand negative multiples of anything else than 0.5

I have tried to keep the existing functionality intact and activate this only for
-funsafe-math-optimizations and only when the target has a sqrt instruction.
  An exception to that is pow (x, 0.5) which we prefer to transform to sqrt even
when a hardware sqrt is not available, presumably because the library function for
sqrt is usually faster than pow (?).


Having seen the glibc implementation of a fully IEEE-754-compliant pow function, I think we
would prefer synthesising the pow call whenever we can for -ffast-math.

I have seen this optimisation trigger a few times in SPEC2k6, in particular in 447.dealII
and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and pow (x, 0.875)
with square roots, multiplies and, in the case of -0.25, divides.
On 481.wrf I saw it remove a total of 22 out of 322 calls to pow

On 481.wrf on aarch64 I saw about a 1% improvement.
The cycle count on x86_64 was also smaller, but not by a significant amount (the same calls to
pow were eliminated).

In general, I think this can shine if multiple expandable calls to pow appear together.
So, for example for code:
double
baz (double a)
{
   return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
}

we can generate:
baz:
         fsqrt   d3, d0
         fmul    d4, d0, d0
         fmov    d5, 1.0e+0
         fmul    d6, d0, d4
         fsqrt   d2, d3
         fmul    d1, d0, d2
         fsqrt   d0, d2
         fmul    d3, d3, d2
         fdiv    d1, d5, d1
         fmul    d3, d3, d6
         fmul    d2, d2, d0
         fmadd   d0, d4, d3, d1
         fmsub   d0, d6, d2, d0
         ret

reusing the sqrt results and doing more optimisations rather than the current:
baz:
         stp     x29, x30, [sp, -48]!
         fmov    d1, -1.25e+0
         add     x29, sp, 0
         stp     d8, d9, [sp, 16]
         fmov    d9, d0
         str     d10, [sp, 32]
         bl      pow
         fmov    d8, d0
         fmov    d0, d9
         fmov    d1, 5.75e+0
         bl      pow
         fmov    d10, d0
         fmov    d0, d9
         fmov    d1, 3.375e+0
         bl      pow
         fadd    d8, d8, d10
         ldr     d10, [sp, 32]
         fsub    d0, d8, d0
         ldp     d8, d9, [sp, 16]
         ldp     x29, x30, [sp], 48
         ret


Of course gcc could already do that if the exponents were to fall in the set {0.25, 0.75, k+0.5}
but with this patch that set can be greatly expanded.

I suppose if we're really lucky we might even open up new vectorisation opportunities.
For example:
void
vecfoo (float *a, float *b)
{
   for (int i = 0; i < N; i++)
     a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
}

will now be vectorisable if we have a hardware vector sqrt instruction. Though I'm not sure
how often this would appear in real code.


I would like advice on some implementation details:
- The new function representable_as_half_series_p is used to check if the fractional
part of an exponent can be represented as a multiple of a power of 0.5. It does so
by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a smarter
way of doing this, considering that REAL_VALUE_TYPE holds the bit representation of the
floating point number?

- Are there any correctness cases that I may have missed? This patch gates the optimisation
on -funsafe-math-optimizations, but maybe there are some edge cases that I missed?

- What should be the default value of the max-pow-sqrt-depth param? In this patch it's 5 which
on second thought strikes me as a bit aggressive. To catch exponents that are multiples of
0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I suppose it depends on
how likely more fine-grained powers are to appear in real code, how expensive the C library
implementation of pow is, and how expensive are the sqrt instructions in hardware.


Bootstrapped and tested on x86_64, aarch64, arm (pending)
SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that might
be of interest to look at?

Thanks,
Kyrill

2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
     * tree-ssa-math-opts.c: Include params.h
     (pow_synth_sqrt_info): New struct.
     (representable_as_half_series_p): New function.
     (get_fn_chain): Likewise.
     (print_nested_fn): Likewise.
     (dump_fractional_sqrt_sequence): Likewise.
     (dump_integer_part): Likewise.
     (expand_pow_as_sqrts): Likewise.
     (gimple_expand_builtin_pow): Use above to attempt to expand
     pow as series of square roots.  Removed now unused variables.

2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.dg/pow-sqrt.x: New file.
     * gcc.dg/pow-sqrt-1.c: New test.
     * gcc.dg/pow-sqrt-2.c: Likewise.
     * gcc.dg/pow-sqrt-3.c: Likewise.

[-- Attachment #2: pow-sqrt.patch --]
[-- Type: text/x-patch, Size: 19847 bytes --]

commit 097f1526dcc09dcacb83fd0c70043a7f50b2f078
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Wed Apr 29 13:27:07 2015 +0100

    [tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

diff --git a/gcc/params.def b/gcc/params.def
index 48b39a2..3e4ba3a 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -262,6 +262,14 @@ DEFPARAM(PARAM_MAX_HOIST_DEPTH,
 	 "Maximum depth of search in the dominator tree for expressions to hoist",
 	 30, 0, 0)
 
+
+/* When synthesizing expnonentiation by a real constant operations using square
+   roots, this controls how deep sqrt chains we are willing to generate.  */
+DEFPARAM(PARAM_MAX_POW_SQRT_DEPTH,
+	 "max-pow-sqrt-depth",
+	 "Maximum depth of sqrt chains to use when synthesizing exponentiation by a real constant",
+	 5, 1, 32)
+
 /* This parameter limits the number of insns in a loop that will be unrolled,
    and by how much the loop is unrolled.
 
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-1.c b/gcc/testsuite/gcc.dg/pow-sqrt-1.c
new file mode 100644
index 0000000..0793b6f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-1.c
@@ -0,0 +1,6 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=5" } */
+
+#define EXPN (-6 * (0.5*0.5*0.5*0.5))
+
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-2.c b/gcc/testsuite/gcc.dg/pow-sqrt-2.c
new file mode 100644
index 0000000..b2fada4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-2.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=5" } */
+
+#define EXPN (-5.875)
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-3.c b/gcc/testsuite/gcc.dg/pow-sqrt-3.c
new file mode 100644
index 0000000..18c7231
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-3.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=3" } */
+
+#define EXPN (1.25)
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt.x b/gcc/testsuite/gcc.dg/pow-sqrt.x
new file mode 100644
index 0000000..bd744c6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt.x
@@ -0,0 +1,30 @@
+
+extern void abort (void);
+
+
+__attribute__((noinline)) double
+real_pow (double x, double pow_exp)
+{
+  return __builtin_pow (x, pow_exp);
+}
+
+#define EPS (0.000000000000000000001)
+
+#define SYNTH_POW(X, Y) __builtin_pow (X, Y)
+volatile double arg;
+
+int
+main (void)
+{
+  double i_arg = 0.1;
+
+  for (arg = i_arg; arg < 100.0; arg += 1.0)
+    {
+      double synth_res = SYNTH_POW (arg, EXPN);
+      double real_res = real_pow (arg, EXPN);
+
+      if (__builtin_abs (SYNTH_POW (arg, EXPN) - real_pow (arg, EXPN)) > EPS)
+	abort ();
+    }
+  return 0;
+}
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index c22a677..36549d9 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -143,6 +143,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "gimple-pretty-print.h"
 #include "builtins.h"
+#include "params.h"
 
 /* FIXME: RTL headers have to be included here for optabs.  */
 #include "rtl.h"		/* Because optabs.h wants enum rtx_code.  */
@@ -1148,6 +1149,370 @@ build_and_insert_cast (gimple_stmt_iterator *gsi, location_t loc,
   return result;
 }
 
+struct pow_synth_sqrt_info
+{
+  bool *factors;
+  unsigned int deepest;
+  unsigned int num_mults;
+};
+
+/* Return true iff the real value C can be represented as a
+   sum of powers of 0.5 up to N.  That is:
+   C == SUM<i from 1..N> (a[i]*(0.5**i)) where a[i] is either 0 or 1.
+   Record in INFO the various parameters of the synthesis algorithm such
+   as the factors a[i], the maximum 0.5 power and the number of
+   multiplications that will be required.  */
+
+bool
+representable_as_half_series_p (REAL_VALUE_TYPE c, unsigned n,
+				 struct pow_synth_sqrt_info *info)
+{
+  REAL_VALUE_TYPE factor = dconsthalf;
+  REAL_VALUE_TYPE remainder = c;
+
+  info->deepest = 0;
+  info->num_mults = 0;
+  memset (info->factors, 0, n * sizeof (bool));
+
+  for (unsigned i = 0; i < n; i++)
+    {
+      REAL_VALUE_TYPE res;
+
+      /* If something inexact happened bail out now.  */
+      if (REAL_ARITHMETIC (res, MINUS_EXPR, remainder, factor))
+	return false;
+
+      /* We have hit zero.  The number is representable as a sum
+         of powers of 0.5.  */
+      if (REAL_VALUES_EQUAL (res, dconst0))
+	{
+	  info->factors[i] = true;
+	  info->deepest = i + 1;
+	  return true;
+	}
+      else if (!REAL_VALUE_NEGATIVE (res))
+	{
+	  remainder = res;
+	  info->factors[i] = true;
+	  info->num_mults++;
+	}
+      else
+	info->factors[i] = false;
+
+      REAL_ARITHMETIC (factor, MULT_EXPR, factor, dconsthalf);
+    }
+  return false;
+}
+
+/* Return the tree corresponding to FN being applied
+   to ARG N times at GSI and LOC.
+   Look up previous results from CACHE if need be.
+   cache[0] should contain just plain ARG i.e. FN applied to ARG 0 times.  */
+
+static tree
+get_fn_chain (tree arg, unsigned int n, gimple_stmt_iterator *gsi,
+	      tree fn, location_t loc, tree *cache)
+{
+  tree res = cache[n];
+  if (!res)
+    {
+      tree prev = get_fn_chain (arg, n - 1, gsi, fn, loc, cache);
+      res = build_and_insert_call (gsi, loc, fn, prev);
+      cache[n] = res;
+    }
+
+  return res;
+}
+
+/* Print to STREAM the repeated application of function FNAME to ARG
+   N times.  So, for FNAME = "foo", ARG = "x", N = 2 it would print:
+   "foo (foo (x))".  */
+
+static void
+print_nested_fn (FILE* stream, const char *fname, const char* arg,
+		 unsigned int n)
+{
+  if (n == 0)
+    fprintf (stream, "%s", arg);
+  else
+    {
+      fprintf (stream, "%s (", fname);
+      print_nested_fn (stream, fname, arg, n - 1);
+      fprintf (stream, ")");
+    }
+}
+
+/* Print to STREAM the fractional sequence of sqrt chains
+   applied to ARG, described by INFO.  Used for the dump file.  */
+
+static void
+dump_fractional_sqrt_sequence (FILE *stream, const char *arg,
+			        struct pow_synth_sqrt_info *info)
+{
+  for (unsigned int i = 0; i < info->deepest; i++)
+    {
+      bool is_set = info->factors[i];
+      if (is_set)
+	{
+	  print_nested_fn (stream, "sqrt", arg, i + 1);
+	  if (i != info->deepest - 1)
+	    fprintf (stream, " * ");
+	}
+    }
+}
+
+/* Print to STREAM a representation of raising ARG to an integer
+   power N.  Used for the dump file.  */
+
+static void
+dump_integer_part (FILE *stream, const char* arg, HOST_WIDE_INT n)
+{
+  if (n > 1)
+    fprintf (stream, "powi (%s, " HOST_WIDE_INT_PRINT_DEC ")", arg, n);
+  else if (n == 1)
+    fprintf (stream, "%s", arg);
+}
+
+/* Attempt to synthesize a POW[F] (ARG0, ARG1) call using chains of
+   square roots.  Place at GSI and LOC.  Limit the maximum depth
+   of the sqrt chains to MAX_DEPTH.  Return the tree holding the
+   result of the expanded sequence or NULL_TREE if the expansion failed.
+
+   This routine assumes that ARG1 is a real number with a fractional part
+   (the integer exponent case will have been handled earlier in
+   gimple_expand_builtin_pow).
+
+   For ARG1 > 0.0:
+   * For ARG1 composed of a whole part WHOLE_PART and a fractional part
+     FRAC_PART i.e. WHOLE_PART == floor (ARG1) and
+                    FRAC_PART := ARG1 - WHOLE_PART:
+     Produce POWI (ARG0, WHOLE_PART) * POW (ARG0, FRAC_PART) where
+     POW (ARG0, FRAC_PART) is expanded as a product of square root chains
+     if it can be expressed as such, that is if FRAC_PART satisfies:
+     FRAC_PART == <SUM from i = 1 until MAX_DEPTH> (a[i] * (0.5**i))
+     where integer a[i] is either 0 or 1.
+
+     Example:
+     POW (x, 3.625) == POWI (x, 3) * POW (x, 0.625)
+       --> POWI (x, 3) * SQRT (x) * SQRT (SQRT (SQRT (x)))
+
+   For ARG1 < 0.0 there are two approaches:
+   * (A) Expand to 1.0 / POW (ARG0, -ARG1) where POW (ARG0, -ARG1)
+         is calculated as above.
+
+     Example:
+     POW (x, -5.625) == 1.0 / POW (x, 5.625)
+       -->  1.0 / (POWI (x, 5) * SQRT (x) * SQRT (SQRT (SQRT (x))))
+
+   * (B) : WHOLE_PART := - ceil (abs (ARG1))
+           FRAC_PART  := ARG1 - WHOLE_PART
+     and expand to POW (x, FRAC_PART) / POWI (x, WHOLE_PART).
+     Example:
+     POW (x, -5.875) == POW (x, 0.125) / POWI (X, 6)
+       --> SQRT (SQRT (SQRT (x))) / (POWI (x, 6))
+
+   For ARG1 < 0.0 we choose between (A) and (B) depending on
+   how many multiplications we'd have to do.
+   So, for the example in (B): POW (x, -5.875), if we were to
+   follow algorithm (A) we would produce:
+   1.0 / POWI (X, 5) * SQRT (X) * SQRT (SQRT (X)) * SQRT (SQRT (SQRT (X)))
+   which contains more multiplications than approach (B).
+
+   Hopefully, this approach will eliminate potentially expensive POW library
+   calls when unsafe floating point math is enabled and allow the compiler to
+   further optimise the multiplies, square roots and divides produced by this
+   function.  When optimising for size restrict the maximum number of allowed
+   multiplications and the depth of square root chains.  */
+
+static tree
+expand_pow_as_sqrts (gimple_stmt_iterator *gsi, location_t loc,
+		      tree arg0, tree arg1, HOST_WIDE_INT max_depth)
+{
+  tree type = TREE_TYPE (arg0);
+  machine_mode mode = TYPE_MODE (type);
+  tree sqrtfn = mathfn_built_in (type, BUILT_IN_SQRT);
+  bool one_over = true;
+
+  bool speed_p = optimize_bb_for_speed_p (gsi_bb (*gsi));
+  if (!sqrtfn)
+    return NULL_TREE;
+
+  if (TREE_CODE (arg1) != REAL_CST)
+    return NULL_TREE;
+
+  REAL_VALUE_TYPE exp_init = TREE_REAL_CST (arg1);
+
+  gcc_assert (max_depth > 0);
+  tree *cache = XALLOCAVEC (tree, max_depth + 1);
+
+  struct pow_synth_sqrt_info synth_info;
+  synth_info.factors = XALLOCAVEC (bool, max_depth + 1);
+  synth_info.deepest = 0;
+  synth_info.num_mults = 0;
+
+  bool neg_exp = REAL_VALUE_NEGATIVE (exp_init);
+  REAL_VALUE_TYPE exp = real_value_abs (&exp_init);
+
+  /* If optimising for size don't try to expand negative
+     exponents as they include an extra divide step.  */
+  if (neg_exp && !speed_p)
+    return NULL_TREE;
+
+  /* The whole and fractional parts of exp.  */
+  REAL_VALUE_TYPE whole_part;
+  REAL_VALUE_TYPE frac_part;
+
+  real_floor (&whole_part, mode, &exp);
+  REAL_ARITHMETIC (frac_part, MINUS_EXPR, exp, whole_part);
+
+
+  REAL_VALUE_TYPE ceil_whole = dconst0;
+  REAL_VALUE_TYPE ceil_fract = dconst0;
+
+  if (neg_exp)
+    {
+      real_ceil (&ceil_whole, mode, &exp);
+      REAL_ARITHMETIC (ceil_fract, MINUS_EXPR, ceil_whole, exp);
+    }
+
+  if (!representable_as_half_series_p (frac_part, max_depth, &synth_info))
+    return NULL_TREE;
+
+  /* Check whether it's more profitable to not use 1.0 / ...  */
+  if (neg_exp)
+    {
+      struct pow_synth_sqrt_info alt_synth_info;
+      alt_synth_info.factors = XALLOCAVEC (bool, max_depth + 1);
+      alt_synth_info.deepest = 0;
+      alt_synth_info.num_mults = 0;
+
+      if (representable_as_half_series_p (ceil_fract, max_depth,
+					   &alt_synth_info)
+	  && alt_synth_info.deepest <= synth_info.deepest
+	  && alt_synth_info.num_mults < synth_info.num_mults)
+	{
+	  whole_part = ceil_whole;
+	  frac_part = ceil_fract;
+	  synth_info.deepest = alt_synth_info.deepest;
+	  synth_info.num_mults = alt_synth_info.num_mults;
+	  memcpy (synth_info.factors, alt_synth_info.factors,
+		  (max_depth + 1) * sizeof (bool));
+	  one_over = false;
+	}
+    }
+
+  /* When optimising for size don't allow deep sqrt chains
+     or a large number of multiplies.  */
+  if (!speed_p
+      && (synth_info.deepest + synth_info.num_mults) > 3)
+    return NULL_TREE;
+
+  HOST_WIDE_INT n = real_to_integer (&whole_part);
+  REAL_VALUE_TYPE cint;
+  real_from_integer (&cint, VOIDmode, n, SIGNED);
+
+  if (!real_identical (&whole_part, &cint))
+    return NULL_TREE;
+
+  if (powi_cost (n) + synth_info.num_mults > POWI_MAX_MULTS)
+    return NULL_TREE;
+
+  memset (cache, 0, (max_depth + 1) * sizeof (tree));
+
+  tree integer_res = n == 0 ? build_real (type, dconst1) : arg0;
+
+  /* Calculate the integer part of the exponent.  */
+  if (n > 1)
+    {
+      integer_res = gimple_expand_builtin_powi (gsi, loc, arg0, n);
+      if (!integer_res)
+	return NULL_TREE;
+    }
+
+  if (dump_file)
+    {
+      char string[64];
+
+      real_to_decimal (string, &exp_init, sizeof (string), 0, 1);
+      fprintf (dump_file, "synthesizing pow (x, %s) as:\n", string);
+
+      if (neg_exp)
+	{
+	  if (one_over)
+	    {
+	      fprintf (dump_file, "1.0 / (");
+	      dump_integer_part (dump_file, "x", n);
+	      if (n > 0)
+	        fprintf (dump_file, " * ");
+	      dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	      fprintf (dump_file, ")");
+	    }
+	  else
+	    {
+	      dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	      fprintf (dump_file, " / (");
+	      dump_integer_part (dump_file, "x", n);
+	      fprintf (dump_file, ")");
+	    }
+	}
+      else
+	{
+	  dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	  if (n > 0)
+	    fprintf (dump_file, " * ");
+	  dump_integer_part (dump_file, "x", n);
+	}
+
+      fprintf (dump_file, "\ndeepest sqrt chain: %d\n", synth_info.deepest);
+    }
+
+
+  tree fract_res = NULL_TREE;
+  cache[0] = arg0;
+
+  /* Calculate the fractional part of the exponent.  */
+  for (unsigned i = 0; i < synth_info.deepest; i++)
+    {
+      if (synth_info.factors[i])
+	{
+	  tree sqrt_chain = get_fn_chain (arg0, i + 1, gsi, sqrtfn, loc, cache);
+
+	  if (!fract_res)
+	      fract_res = sqrt_chain;
+
+	  else
+	    fract_res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+					   fract_res, sqrt_chain);
+	}
+    }
+
+  tree res = NULL_TREE;
+
+  if (neg_exp)
+    {
+      if (one_over)
+	{
+	  if (n > 0)
+	    res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+					   fract_res, integer_res);
+	  else
+	    res = fract_res;
+
+	  res = build_and_insert_binop (gsi, loc, "powrootrecip", RDIV_EXPR,
+					  build_real (type, dconst1), res);
+	}
+      else
+	{
+	  res = build_and_insert_binop (gsi, loc, "powroot", RDIV_EXPR,
+					 fract_res, integer_res);
+	}
+    }
+  else
+    res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+				   fract_res, integer_res);
+  return res;
+}
+
 /* ARG0 and ARG1 are the two arguments to a pow builtin call in GSI
    with location info LOC.  If possible, create an equivalent and
    less expensive sequence of statements prior to GSI, and return an
@@ -1157,10 +1522,10 @@ static tree
 gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc, 
 			   tree arg0, tree arg1)
 {
-  REAL_VALUE_TYPE c, cint, dconst1_4, dconst3_4, dconst1_3, dconst1_6;
+  REAL_VALUE_TYPE c, cint, dconst1_3, dconst1_6;
   REAL_VALUE_TYPE c2, dconst3;
   HOST_WIDE_INT n;
-  tree type, sqrtfn, cbrtfn, sqrt_arg0, sqrt_sqrt, result, cbrt_x, powi_cbrt_x;
+  tree type, sqrtfn, cbrtfn, sqrt_arg0, result, cbrt_x, powi_cbrt_x;
   machine_mode mode;
   bool hw_sqrt_exists, c_is_int, c2_is_int;
 
@@ -1188,57 +1553,8 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
   mode = TYPE_MODE (type);
   sqrtfn = mathfn_built_in (type, BUILT_IN_SQRT);
 
-  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
-     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
-     sqrt(-0) = -0.  */
-  if (sqrtfn
-      && REAL_VALUES_EQUAL (c, dconsthalf)
-      && !HONOR_SIGNED_ZEROS (mode))
-    return build_and_insert_call (gsi, loc, sqrtfn, arg0);
-
-  /* Optimize pow(x,0.25) = sqrt(sqrt(x)).  Assume on most machines that
-     a builtin sqrt instruction is smaller than a call to pow with 0.25,
-     so do this optimization even if -Os.  Don't do this optimization
-     if we don't have a hardware sqrt insn.  */
-  dconst1_4 = dconst1;
-  SET_REAL_EXP (&dconst1_4, REAL_EXP (&dconst1_4) - 2);
   hw_sqrt_exists = optab_handler (sqrt_optab, mode) != CODE_FOR_nothing;
 
-  if (flag_unsafe_math_optimizations
-      && sqrtfn
-      && REAL_VALUES_EQUAL (c, dconst1_4)
-      && hw_sqrt_exists)
-    {
-      /* sqrt(x)  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
-
-      /* sqrt(sqrt(x))  */
-      return build_and_insert_call (gsi, loc, sqrtfn, sqrt_arg0);
-    }
-      
-  /* Optimize pow(x,0.75) = sqrt(x) * sqrt(sqrt(x)) unless we are
-     optimizing for space.  Don't do this optimization if we don't have
-     a hardware sqrt insn.  */
-  real_from_integer (&dconst3_4, VOIDmode, 3, SIGNED);
-  SET_REAL_EXP (&dconst3_4, REAL_EXP (&dconst3_4) - 2);
-
-  if (flag_unsafe_math_optimizations
-      && sqrtfn
-      && optimize_function_for_speed_p (cfun)
-      && REAL_VALUES_EQUAL (c, dconst3_4)
-      && hw_sqrt_exists)
-    {
-      /* sqrt(x)  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
-
-      /* sqrt(sqrt(x))  */
-      sqrt_sqrt = build_and_insert_call (gsi, loc, sqrtfn, sqrt_arg0);
-
-      /* sqrt(x) * sqrt(sqrt(x))  */
-      return build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
-				     sqrt_arg0, sqrt_sqrt);
-    }
-
   /* Optimize pow(x,1./3.) = cbrt(x).  This requires unsafe math
      optimizations since 1./3. is not exactly representable.  If x
      is negative and finite, the correct value of pow(x,1./3.) is
@@ -1274,54 +1590,29 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
       return build_and_insert_call (gsi, loc, cbrtfn, sqrt_arg0);
     }
 
-  /* Optimize pow(x,c), where n = 2c for some nonzero integer n
-     and c not an integer, into
-
-       sqrt(x) * powi(x, n/2),                n > 0;
-       1.0 / (sqrt(x) * powi(x, abs(n/2))),   n < 0.
-
-     Do not calculate the powi factor when n/2 = 0.  */
-  real_arithmetic (&c2, MULT_EXPR, &c, &dconst2);
-  n = real_to_integer (&c2);
-  real_from_integer (&cint, VOIDmode, n, SIGNED);
-  c2_is_int = real_identical (&c2, &cint);
 
+  /* Attempt to expand the POW as a product of square root chains.
+     Do not try this if there is no hardware sqrt instruction *except*
+     in the case where the exponent is 0.5, where we assume it will always
+     be beneficial.  */
   if (flag_unsafe_math_optimizations
       && sqrtfn
-      && c2_is_int
-      && !c_is_int
-      && optimize_function_for_speed_p (cfun))
+      && (hw_sqrt_exists || REAL_VALUES_EQUAL (c, dconsthalf))
+      && !HONOR_SIGNED_ZEROS (mode))
     {
-      tree powi_x_ndiv2 = NULL_TREE;
-
-      /* Attempt to fold powi(arg0, abs(n/2)) into multiplies.  If not
-         possible or profitable, give up.  Skip the degenerate case when
-         n is 1 or -1, where the result is always 1.  */
-      if (absu_hwi (n) != 1)
-	{
-	  powi_x_ndiv2 = gimple_expand_builtin_powi (gsi, loc, arg0,
-						     abs_hwi (n / 2));
-	  if (!powi_x_ndiv2)
-	    return NULL_TREE;
-	}
-
-      /* Calculate sqrt(x).  When n is not 1 or -1, multiply it by the
-	 result of the optimal multiply sequence just calculated.  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
+      tree expand_with_sqrts
+	= expand_pow_as_sqrts (gsi, loc, arg0, arg1,
+			       PARAM_VALUE (PARAM_MAX_POW_SQRT_DEPTH));
 
-      if (absu_hwi (n) == 1)
-	result = sqrt_arg0;
-      else
-	result = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
-					 sqrt_arg0, powi_x_ndiv2);
-
-      /* If n is negative, reciprocate the result.  */
-      if (n < 0)
-	result = build_and_insert_binop (gsi, loc, "powroot", RDIV_EXPR,
-					 build_real (type, dconst1), result);
-      return result;
+      if (expand_with_sqrts)
+	return expand_with_sqrts;
     }
 
+  real_arithmetic (&c2, MULT_EXPR, &c, &dconst2);
+  n = real_to_integer (&c2);
+  real_from_integer (&cint, VOIDmode, n, SIGNED);
+  c2_is_int = real_identical (&c2, &cint);
+
   /* Optimize pow(x,c), where 3c = n for some nonzero integer n, into
 
      powi(x, n/3) * powi(cbrt(x), n%3),                    n > 0;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-01 16:02 [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible Kyrill Tkachov
@ 2015-05-08  8:15 ` Kyrill Tkachov
  2015-05-08 10:18 ` Richard Biener
  1 sibling, 0 replies; 9+ messages in thread
From: Kyrill Tkachov @ 2015-05-08  8:15 UTC (permalink / raw)
  To: Kyrill Tkachov, GCC Patches

Ping.
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00071.html

Thanks,
Kyrill
On 01/05/15 17:02, Kyrill Tkachov wrote:
> Hi all,
>
> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x, (int)k + 0.5)
> using square roots. So, for the above examples it would generate sqrt (x) * sqrt (sqrt (x)),
> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it will calculate the
> reciprocal of that).
>
> However, the implementation of these optimisations is done on a bit of an ad-hoc basis with
> the 0.25, 0.5, 0.75 cases hardcoded.
> Judging by https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
> these are the most commonly used exponents (at least in SPEC ;))
>
> This patch generalises this optimisation into a (hopefully) more robust algorithm.
> In particular, it expands calls to pow (x, CST) by expanding the integer part of CST
> using a powi, like it does already, and then expanding the fractional part as a product
> of repeated applications of a square root if the fractional part can be expressed
> as a multiple of a power of 0.5.
>
> I try to explain the algorithm in more detail in the comments in the patch but, for example:
>
> pow (x, 5.625) is not currently handled, but with this patch will be expanded
> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + 0.5 + 0.5**3
>
> Negative exponents are handled in either of two ways, depending on the exponent value:
> * Using a simple reciprocal.
>     For example:
>     pow (x, -5.625) == 1.0 / pow (x, 5.625)
>       --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>
> * For pow (x, EXP) with negative exponent EXP with integer part INT and fractional part FRAC:
> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>     For example:
>     pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>       --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>
>
> Since hardware square root instructions tend to be expensive, we may want to reduce the number
> of square roots we are willing to calculate. Since we reuse intermediate square root results,
> this boils down to restricting the depth of the square root chains. In all the examples above
> that depth is 3. I've made this maximum depth parametrisable in params.def. By adjusting that
> parameter we can adjust the resolution of this optimisation. So, if it's set to '4' then we
> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, including negative
> multiples. Currently, GCC will not try to expand negative multiples of anything else than 0.5
>
> I have tried to keep the existing functionality intact and activate this only for
> -funsafe-math-optimizations and only when the target has a sqrt instruction.
>    An exception to that is pow (x, 0.5) which we prefer to transform to sqrt even
> when a hardware sqrt is not available, presumably because the library function for
> sqrt is usually faster than pow (?).
>
>
> Having seen the glibc implementation of a fully IEEE-754-compliant pow function, I think we
> would prefer synthesising the pow call whenever we can for -ffast-math.
>
> I have seen this optimisation trigger a few times in SPEC2k6, in particular in 447.dealII
> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and pow (x, 0.875)
> with square roots, multiplies and, in the case of -0.25, divides.
> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>
> On 481.wrf on aarch64 I saw about a 1% improvement.
> The cycle count on x86_64 was also smaller, but not by a significant amount (the same calls to
> pow were eliminated).
>
> In general, I think this can shine if multiple expandable calls to pow appear together.
> So, for example for code:
> double
> baz (double a)
> {
>     return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
> }
>
> we can generate:
> baz:
>           fsqrt   d3, d0
>           fmul    d4, d0, d0
>           fmov    d5, 1.0e+0
>           fmul    d6, d0, d4
>           fsqrt   d2, d3
>           fmul    d1, d0, d2
>           fsqrt   d0, d2
>           fmul    d3, d3, d2
>           fdiv    d1, d5, d1
>           fmul    d3, d3, d6
>           fmul    d2, d2, d0
>           fmadd   d0, d4, d3, d1
>           fmsub   d0, d6, d2, d0
>           ret
>
> reusing the sqrt results and doing more optimisations rather than the current:
> baz:
>           stp     x29, x30, [sp, -48]!
>           fmov    d1, -1.25e+0
>           add     x29, sp, 0
>           stp     d8, d9, [sp, 16]
>           fmov    d9, d0
>           str     d10, [sp, 32]
>           bl      pow
>           fmov    d8, d0
>           fmov    d0, d9
>           fmov    d1, 5.75e+0
>           bl      pow
>           fmov    d10, d0
>           fmov    d0, d9
>           fmov    d1, 3.375e+0
>           bl      pow
>           fadd    d8, d8, d10
>           ldr     d10, [sp, 32]
>           fsub    d0, d8, d0
>           ldp     d8, d9, [sp, 16]
>           ldp     x29, x30, [sp], 48
>           ret
>
>
> Of course gcc could already do that if the exponents were to fall in the set {0.25, 0.75, k+0.5}
> but with this patch that set can be greatly expanded.
>
> I suppose if we're really lucky we might even open up new vectorisation opportunities.
> For example:
> void
> vecfoo (float *a, float *b)
> {
>     for (int i = 0; i < N; i++)
>       a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
> }
>
> will now be vectorisable if we have a hardware vector sqrt instruction. Though I'm not sure
> how often this would appear in real code.
>
>
> I would like advice on some implementation details:
> - The new function representable_as_half_series_p is used to check if the fractional
> part of an exponent can be represented as a multiple of a power of 0.5. It does so
> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a smarter
> way of doing this, considering that REAL_VALUE_TYPE holds the bit representation of the
> floating point number?
>
> - Are there any correctness cases that I may have missed? This patch gates the optimisation
> on -funsafe-math-optimizations, but maybe there are some edge cases that I missed?
>
> - What should be the default value of the max-pow-sqrt-depth param? In this patch it's 5 which
> on second thought strikes me as a bit aggressive. To catch exponents that are multiples of
> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I suppose it depends on
> how likely more fine-grained powers are to appear in real code, how expensive the C library
> implementation of pow is, and how expensive are the sqrt instructions in hardware.
>
>
> Bootstrapped and tested on x86_64, aarch64, arm (pending)
> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that might
> be of interest to look at?
>
> Thanks,
> Kyrill
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>       * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>       * tree-ssa-math-opts.c: Include params.h
>       (pow_synth_sqrt_info): New struct.
>       (representable_as_half_series_p): New function.
>       (get_fn_chain): Likewise.
>       (print_nested_fn): Likewise.
>       (dump_fractional_sqrt_sequence): Likewise.
>       (dump_integer_part): Likewise.
>       (expand_pow_as_sqrts): Likewise.
>       (gimple_expand_builtin_pow): Use above to attempt to expand
>       pow as series of square roots.  Removed now unused variables.
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>       * gcc.dg/pow-sqrt.x: New file.
>       * gcc.dg/pow-sqrt-1.c: New test.
>       * gcc.dg/pow-sqrt-2.c: Likewise.
>       * gcc.dg/pow-sqrt-3.c: Likewise.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-01 16:02 [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible Kyrill Tkachov
  2015-05-08  8:15 ` Kyrill Tkachov
@ 2015-05-08 10:18 ` Richard Biener
  2015-05-08 13:57   ` Kyrill Tkachov
  1 sibling, 1 reply; 9+ messages in thread
From: Richard Biener @ 2015-05-08 10:18 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches

On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
> (int)k + 0.5)
> using square roots. So, for the above examples it would generate sqrt (x) *
> sqrt (sqrt (x)),
> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
> will calculate the
> reciprocal of that).
>
> However, the implementation of these optimisations is done on a bit of an
> ad-hoc basis with
> the 0.25, 0.5, 0.75 cases hardcoded.
> Judging by
> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
> these are the most commonly used exponents (at least in SPEC ;))
>
> This patch generalises this optimisation into a (hopefully) more robust
> algorithm.
> In particular, it expands calls to pow (x, CST) by expanding the integer
> part of CST
> using a powi, like it does already, and then expanding the fractional part
> as a product
> of repeated applications of a square root if the fractional part can be
> expressed
> as a multiple of a power of 0.5.
>
> I try to explain the algorithm in more detail in the comments in the patch
> but, for example:
>
> pow (x, 5.625) is not currently handled, but with this patch will be
> expanded
> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
> 0.5 + 0.5**3
>
> Negative exponents are handled in either of two ways, depending on the
> exponent value:
> * Using a simple reciprocal.
>   For example:
>   pow (x, -5.625) == 1.0 / pow (x, 5.625)
>     --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>
> * For pow (x, EXP) with negative exponent EXP with integer part INT and
> fractional part FRAC:
> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>   For example:
>   pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>     --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>
>
> Since hardware square root instructions tend to be expensive, we may want to
> reduce the number
> of square roots we are willing to calculate. Since we reuse intermediate
> square root results,
> this boils down to restricting the depth of the square root chains. In all
> the examples above
> that depth is 3. I've made this maximum depth parametrisable in params.def.
> By adjusting that
> parameter we can adjust the resolution of this optimisation. So, if it's set
> to '4' then we
> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
> including negative
> multiples. Currently, GCC will not try to expand negative multiples of
> anything else than 0.5
>
> I have tried to keep the existing functionality intact and activate this
> only for
> -funsafe-math-optimizations and only when the target has a sqrt instruction.
>  An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
> even
> when a hardware sqrt is not available, presumably because the library
> function for
> sqrt is usually faster than pow (?).

Yes.  It's also a safe transform - which you seem to put under
flag_unsafe_math_optimizations only with your patch.

It would be clearer to just leave the special-case

-  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
-     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
-     sqrt(-0) = -0.  */
-  if (sqrtfn
-      && REAL_VALUES_EQUAL (c, dconsthalf)
-      && !HONOR_SIGNED_ZEROS (mode))
-    return build_and_insert_call (gsi, loc, sqrtfn, arg0);

in as-is.

You also removed the Os constraint which you should put back in.
Basically if !optimize_function_for_speed_p then generate at most
two calls to sqrt (iff the HW has a sqrt instruction).

You fail to add a testcase that checks that the optimization applies.

Otherwise the idea looks good though there must be a better way
to compute the series than by using real-arithmetic and forcefully
trying out all possibilities...

Richard.

>
>
> Having seen the glibc implementation of a fully IEEE-754-compliant pow
> function, I think we
> would prefer synthesising the pow call whenever we can for -ffast-math.
>
> I have seen this optimisation trigger a few times in SPEC2k6, in particular
> in 447.dealII
> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and
> pow (x, 0.875)
> with square roots, multiplies and, in the case of -0.25, divides.
> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>
> On 481.wrf on aarch64 I saw about a 1% improvement.
> The cycle count on x86_64 was also smaller, but not by a significant amount
> (the same calls to
> pow were eliminated).
>
> In general, I think this can shine if multiple expandable calls to pow
> appear together.
> So, for example for code:
> double
> baz (double a)
> {
>   return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow
> (a, 3.375);
> }
>
> we can generate:
> baz:
>         fsqrt   d3, d0
>         fmul    d4, d0, d0
>         fmov    d5, 1.0e+0
>         fmul    d6, d0, d4
>         fsqrt   d2, d3
>         fmul    d1, d0, d2
>         fsqrt   d0, d2
>         fmul    d3, d3, d2
>         fdiv    d1, d5, d1
>         fmul    d3, d3, d6
>         fmul    d2, d2, d0
>         fmadd   d0, d4, d3, d1
>         fmsub   d0, d6, d2, d0
>         ret
>
> reusing the sqrt results and doing more optimisations rather than the
> current:
> baz:
>         stp     x29, x30, [sp, -48]!
>         fmov    d1, -1.25e+0
>         add     x29, sp, 0
>         stp     d8, d9, [sp, 16]
>         fmov    d9, d0
>         str     d10, [sp, 32]
>         bl      pow
>         fmov    d8, d0
>         fmov    d0, d9
>         fmov    d1, 5.75e+0
>         bl      pow
>         fmov    d10, d0
>         fmov    d0, d9
>         fmov    d1, 3.375e+0
>         bl      pow
>         fadd    d8, d8, d10
>         ldr     d10, [sp, 32]
>         fsub    d0, d8, d0
>         ldp     d8, d9, [sp, 16]
>         ldp     x29, x30, [sp], 48
>         ret
>
>
> Of course gcc could already do that if the exponents were to fall in the set
> {0.25, 0.75, k+0.5}
> but with this patch that set can be greatly expanded.
>
> I suppose if we're really lucky we might even open up new vectorisation
> opportunities.
> For example:
> void
> vecfoo (float *a, float *b)
> {
>   for (int i = 0; i < N; i++)
>     a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
> }
>
> will now be vectorisable if we have a hardware vector sqrt instruction.
> Though I'm not sure
> how often this would appear in real code.
>
>
> I would like advice on some implementation details:
> - The new function representable_as_half_series_p is used to check if the
> fractional
> part of an exponent can be represented as a multiple of a power of 0.5. It
> does so
> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a
> smarter
> way of doing this, considering that REAL_VALUE_TYPE holds the bit
> representation of the
> floating point number?
>
> - Are there any correctness cases that I may have missed? This patch gates
> the optimisation
> on -funsafe-math-optimizations, but maybe there are some edge cases that I
> missed?
>
> - What should be the default value of the max-pow-sqrt-depth param? In this
> patch it's 5 which
> on second thought strikes me as a bit aggressive. To catch exponents that
> are multiples of
> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I
> suppose it depends on
> how likely more fine-grained powers are to appear in real code, how
> expensive the C library
> implementation of pow is, and how expensive are the sqrt instructions in
> hardware.
>
>
> Bootstrapped and tested on x86_64, aarch64, arm (pending)
> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that
> might
> be of interest to look at?
>
> Thanks,
> Kyrill
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>     * tree-ssa-math-opts.c: Include params.h
>     (pow_synth_sqrt_info): New struct.
>     (representable_as_half_series_p): New function.
>     (get_fn_chain): Likewise.
>     (print_nested_fn): Likewise.
>     (dump_fractional_sqrt_sequence): Likewise.
>     (dump_integer_part): Likewise.
>     (expand_pow_as_sqrts): Likewise.
>     (gimple_expand_builtin_pow): Use above to attempt to expand
>     pow as series of square roots.  Removed now unused variables.
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.dg/pow-sqrt.x: New file.
>     * gcc.dg/pow-sqrt-1.c: New test.
>     * gcc.dg/pow-sqrt-2.c: Likewise.
>     * gcc.dg/pow-sqrt-3.c: Likewise.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-08 10:18 ` Richard Biener
@ 2015-05-08 13:57   ` Kyrill Tkachov
  2015-05-08 15:09     ` Kyrill Tkachov
  0 siblings, 1 reply; 9+ messages in thread
From: Kyrill Tkachov @ 2015-05-08 13:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches


On 08/05/15 11:18, Richard Biener wrote:
> On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi all,
>>
>> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
>> (int)k + 0.5)
>> using square roots. So, for the above examples it would generate sqrt (x) *
>> sqrt (sqrt (x)),
>> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
>> will calculate the
>> reciprocal of that).
>>
>> However, the implementation of these optimisations is done on a bit of an
>> ad-hoc basis with
>> the 0.25, 0.5, 0.75 cases hardcoded.
>> Judging by
>> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
>> these are the most commonly used exponents (at least in SPEC ;))
>>
>> This patch generalises this optimisation into a (hopefully) more robust
>> algorithm.
>> In particular, it expands calls to pow (x, CST) by expanding the integer
>> part of CST
>> using a powi, like it does already, and then expanding the fractional part
>> as a product
>> of repeated applications of a square root if the fractional part can be
>> expressed
>> as a multiple of a power of 0.5.
>>
>> I try to explain the algorithm in more detail in the comments in the patch
>> but, for example:
>>
>> pow (x, 5.625) is not currently handled, but with this patch will be
>> expanded
>> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
>> 0.5 + 0.5**3
>>
>> Negative exponents are handled in either of two ways, depending on the
>> exponent value:
>> * Using a simple reciprocal.
>>    For example:
>>    pow (x, -5.625) == 1.0 / pow (x, 5.625)
>>      --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>>
>> * For pow (x, EXP) with negative exponent EXP with integer part INT and
>> fractional part FRAC:
>> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>>    For example:
>>    pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>>      --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>>
>>
>> Since hardware square root instructions tend to be expensive, we may want to
>> reduce the number
>> of square roots we are willing to calculate. Since we reuse intermediate
>> square root results,
>> this boils down to restricting the depth of the square root chains. In all
>> the examples above
>> that depth is 3. I've made this maximum depth parametrisable in params.def.
>> By adjusting that
>> parameter we can adjust the resolution of this optimisation. So, if it's set
>> to '4' then we
>> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
>> including negative
>> multiples. Currently, GCC will not try to expand negative multiples of
>> anything else than 0.5
>>
>> I have tried to keep the existing functionality intact and activate this
>> only for
>> -funsafe-math-optimizations and only when the target has a sqrt instruction.
>>   An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
>> even
>> when a hardware sqrt is not available, presumably because the library
>> function for
>> sqrt is usually faster than pow (?).
> Yes.  It's also a safe transform - which you seem to put under
> flag_unsafe_math_optimizations only with your patch.
>
> It would be clearer to just leave the special-case
>
> -  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
> -     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
> -     sqrt(-0) = -0.  */
> -  if (sqrtfn
> -      && REAL_VALUES_EQUAL (c, dconsthalf)
> -      && !HONOR_SIGNED_ZEROS (mode))
> -    return build_and_insert_call (gsi, loc, sqrtfn, arg0);
>
> in as-is.

Ok, I'll leave that case explicit.

>
> You also removed the Os constraint which you should put back in.
> Basically if !optimize_function_for_speed_p then generate at most
> two calls to sqrt (iff the HW has a sqrt instruction).

I tried to move that logic into expand_with_sqrts but
I'll move it outside it. It seems that this boils down to
only 0.25, as any other 2xsqrt chain will also involve a
multiply or a divide which we currently avoid.

>
> You fail to add a testcase that checks that the optimization applies.

I'll add one to scan the sincos dump.
I notice that we don't have a testuite check that the target has
a hw sqrt instructions. Would you like me to add one? Or can I make
the testcase aarch64-specific?

>
> Otherwise the idea looks good though there must be a better way
> to compute the series than by using real-arithmetic and forcefully
> trying out all possibilities...

I get that feeling too. What I need is not only a way
of figuring out if the fractional part of the exponent can be
represented in this way, but also compute the depth of the
sqrt chain and the number of multiplies...
That being said, the current approach is O(maximum depth) and
I don't expect the depth to go much beyond 3 or 4 in practice.

Thanks for looking at it!
I'll respin the patch.

Kyrill

>
> Richard.
>
>>
>> Having seen the glibc implementation of a fully IEEE-754-compliant pow
>> function, I think we
>> would prefer synthesising the pow call whenever we can for -ffast-math.
>>
>> I have seen this optimisation trigger a few times in SPEC2k6, in particular
>> in 447.dealII
>> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and
>> pow (x, 0.875)
>> with square roots, multiplies and, in the case of -0.25, divides.
>> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>>
>> On 481.wrf on aarch64 I saw about a 1% improvement.
>> The cycle count on x86_64 was also smaller, but not by a significant amount
>> (the same calls to
>> pow were eliminated).
>>
>> In general, I think this can shine if multiple expandable calls to pow
>> appear together.
>> So, for example for code:
>> double
>> baz (double a)
>> {
>>    return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow
>> (a, 3.375);
>> }
>>
>> we can generate:
>> baz:
>>          fsqrt   d3, d0
>>          fmul    d4, d0, d0
>>          fmov    d5, 1.0e+0
>>          fmul    d6, d0, d4
>>          fsqrt   d2, d3
>>          fmul    d1, d0, d2
>>          fsqrt   d0, d2
>>          fmul    d3, d3, d2
>>          fdiv    d1, d5, d1
>>          fmul    d3, d3, d6
>>          fmul    d2, d2, d0
>>          fmadd   d0, d4, d3, d1
>>          fmsub   d0, d6, d2, d0
>>          ret
>>
>> reusing the sqrt results and doing more optimisations rather than the
>> current:
>> baz:
>>          stp     x29, x30, [sp, -48]!
>>          fmov    d1, -1.25e+0
>>          add     x29, sp, 0
>>          stp     d8, d9, [sp, 16]
>>          fmov    d9, d0
>>          str     d10, [sp, 32]
>>          bl      pow
>>          fmov    d8, d0
>>          fmov    d0, d9
>>          fmov    d1, 5.75e+0
>>          bl      pow
>>          fmov    d10, d0
>>          fmov    d0, d9
>>          fmov    d1, 3.375e+0
>>          bl      pow
>>          fadd    d8, d8, d10
>>          ldr     d10, [sp, 32]
>>          fsub    d0, d8, d0
>>          ldp     d8, d9, [sp, 16]
>>          ldp     x29, x30, [sp], 48
>>          ret
>>
>>
>> Of course gcc could already do that if the exponents were to fall in the set
>> {0.25, 0.75, k+0.5}
>> but with this patch that set can be greatly expanded.
>>
>> I suppose if we're really lucky we might even open up new vectorisation
>> opportunities.
>> For example:
>> void
>> vecfoo (float *a, float *b)
>> {
>>    for (int i = 0; i < N; i++)
>>      a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
>> }
>>
>> will now be vectorisable if we have a hardware vector sqrt instruction.
>> Though I'm not sure
>> how often this would appear in real code.
>>
>>
>> I would like advice on some implementation details:
>> - The new function representable_as_half_series_p is used to check if the
>> fractional
>> part of an exponent can be represented as a multiple of a power of 0.5. It
>> does so
>> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a
>> smarter
>> way of doing this, considering that REAL_VALUE_TYPE holds the bit
>> representation of the
>> floating point number?
>>
>> - Are there any correctness cases that I may have missed? This patch gates
>> the optimisation
>> on -funsafe-math-optimizations, but maybe there are some edge cases that I
>> missed?
>>
>> - What should be the default value of the max-pow-sqrt-depth param? In this
>> patch it's 5 which
>> on second thought strikes me as a bit aggressive. To catch exponents that
>> are multiples of
>> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I
>> suppose it depends on
>> how likely more fine-grained powers are to appear in real code, how
>> expensive the C library
>> implementation of pow is, and how expensive are the sqrt instructions in
>> hardware.
>>
>>
>> Bootstrapped and tested on x86_64, aarch64, arm (pending)
>> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that
>> might
>> be of interest to look at?
>>
>> Thanks,
>> Kyrill
>>
>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>>      * tree-ssa-math-opts.c: Include params.h
>>      (pow_synth_sqrt_info): New struct.
>>      (representable_as_half_series_p): New function.
>>      (get_fn_chain): Likewise.
>>      (print_nested_fn): Likewise.
>>      (dump_fractional_sqrt_sequence): Likewise.
>>      (dump_integer_part): Likewise.
>>      (expand_pow_as_sqrts): Likewise.
>>      (gimple_expand_builtin_pow): Use above to attempt to expand
>>      pow as series of square roots.  Removed now unused variables.
>>
>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * gcc.dg/pow-sqrt.x: New file.
>>      * gcc.dg/pow-sqrt-1.c: New test.
>>      * gcc.dg/pow-sqrt-2.c: Likewise.
>>      * gcc.dg/pow-sqrt-3.c: Likewise.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-08 13:57   ` Kyrill Tkachov
@ 2015-05-08 15:09     ` Kyrill Tkachov
  2015-05-13 11:32       ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Kyrill Tkachov @ 2015-05-08 15:09 UTC (permalink / raw)
  To: Kyrill Tkachov, Richard Biener; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 10368 bytes --]


On 08/05/15 14:56, Kyrill Tkachov wrote:
> On 08/05/15 11:18, Richard Biener wrote:
>> On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>>> Hi all,
>>>
>>> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
>>> (int)k + 0.5)
>>> using square roots. So, for the above examples it would generate sqrt (x) *
>>> sqrt (sqrt (x)),
>>> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
>>> will calculate the
>>> reciprocal of that).
>>>
>>> However, the implementation of these optimisations is done on a bit of an
>>> ad-hoc basis with
>>> the 0.25, 0.5, 0.75 cases hardcoded.
>>> Judging by
>>> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
>>> these are the most commonly used exponents (at least in SPEC ;))
>>>
>>> This patch generalises this optimisation into a (hopefully) more robust
>>> algorithm.
>>> In particular, it expands calls to pow (x, CST) by expanding the integer
>>> part of CST
>>> using a powi, like it does already, and then expanding the fractional part
>>> as a product
>>> of repeated applications of a square root if the fractional part can be
>>> expressed
>>> as a multiple of a power of 0.5.
>>>
>>> I try to explain the algorithm in more detail in the comments in the patch
>>> but, for example:
>>>
>>> pow (x, 5.625) is not currently handled, but with this patch will be
>>> expanded
>>> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
>>> 0.5 + 0.5**3
>>>
>>> Negative exponents are handled in either of two ways, depending on the
>>> exponent value:
>>> * Using a simple reciprocal.
>>>     For example:
>>>     pow (x, -5.625) == 1.0 / pow (x, 5.625)
>>>       --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>>>
>>> * For pow (x, EXP) with negative exponent EXP with integer part INT and
>>> fractional part FRAC:
>>> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>>>     For example:
>>>     pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>>>       --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>>>
>>>
>>> Since hardware square root instructions tend to be expensive, we may want to
>>> reduce the number
>>> of square roots we are willing to calculate. Since we reuse intermediate
>>> square root results,
>>> this boils down to restricting the depth of the square root chains. In all
>>> the examples above
>>> that depth is 3. I've made this maximum depth parametrisable in params.def.
>>> By adjusting that
>>> parameter we can adjust the resolution of this optimisation. So, if it's set
>>> to '4' then we
>>> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
>>> including negative
>>> multiples. Currently, GCC will not try to expand negative multiples of
>>> anything else than 0.5
>>>
>>> I have tried to keep the existing functionality intact and activate this
>>> only for
>>> -funsafe-math-optimizations and only when the target has a sqrt instruction.
>>>    An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
>>> even
>>> when a hardware sqrt is not available, presumably because the library
>>> function for
>>> sqrt is usually faster than pow (?).
>> Yes.  It's also a safe transform - which you seem to put under
>> flag_unsafe_math_optimizations only with your patch.
>>
>> It would be clearer to just leave the special-case
>>
>> -  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
>> -     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
>> -     sqrt(-0) = -0.  */
>> -  if (sqrtfn
>> -      && REAL_VALUES_EQUAL (c, dconsthalf)
>> -      && !HONOR_SIGNED_ZEROS (mode))
>> -    return build_and_insert_call (gsi, loc, sqrtfn, arg0);
>>
>> in as-is.
> Ok, I'll leave that case explicit.
>
>> You also removed the Os constraint which you should put back in.
>> Basically if !optimize_function_for_speed_p then generate at most
>> two calls to sqrt (iff the HW has a sqrt instruction).
> I tried to move that logic into expand_with_sqrts but
> I'll move it outside it. It seems that this boils down to
> only 0.25, as any other 2xsqrt chain will also involve a
> multiply or a divide which we currently avoid.
>
>> You fail to add a testcase that checks that the optimization applies.
> I'll add one to scan the sincos dump.
> I notice that we don't have a testuite check that the target has
> a hw sqrt instructions. Would you like me to add one? Or can I make
> the testcase aarch64-specific?
>
>> Otherwise the idea looks good though there must be a better way
>> to compute the series than by using real-arithmetic and forcefully
>> trying out all possibilities...
> I get that feeling too. What I need is not only a way
> of figuring out if the fractional part of the exponent can be
> represented in this way, but also compute the depth of the
> sqrt chain and the number of multiplies...
> That being said, the current approach is O(maximum depth) and
> I don't expect the depth to go much beyond 3 or 4 in practice.
>
> Thanks for looking at it!
> I'll respin the patch.

And here it is, with my above comments implemented.
Bootstrapped on x86_64 and tested on aarch64.
Full testing on arm and aarch64 ongoing.

Is this ok if testing comes clean?

Thanks,
Kyrill

> Kyrill
>
>> Richard.
>>
>>> Having seen the glibc implementation of a fully IEEE-754-compliant pow
>>> function, I think we
>>> would prefer synthesising the pow call whenever we can for -ffast-math.
>>>
>>> I have seen this optimisation trigger a few times in SPEC2k6, in particular
>>> in 447.dealII
>>> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and
>>> pow (x, 0.875)
>>> with square roots, multiplies and, in the case of -0.25, divides.
>>> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>>>
>>> On 481.wrf on aarch64 I saw about a 1% improvement.
>>> The cycle count on x86_64 was also smaller, but not by a significant amount
>>> (the same calls to
>>> pow were eliminated).
>>>
>>> In general, I think this can shine if multiple expandable calls to pow
>>> appear together.
>>> So, for example for code:
>>> double
>>> baz (double a)
>>> {
>>>     return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow
>>> (a, 3.375);
>>> }
>>>
>>> we can generate:
>>> baz:
>>>           fsqrt   d3, d0
>>>           fmul    d4, d0, d0
>>>           fmov    d5, 1.0e+0
>>>           fmul    d6, d0, d4
>>>           fsqrt   d2, d3
>>>           fmul    d1, d0, d2
>>>           fsqrt   d0, d2
>>>           fmul    d3, d3, d2
>>>           fdiv    d1, d5, d1
>>>           fmul    d3, d3, d6
>>>           fmul    d2, d2, d0
>>>           fmadd   d0, d4, d3, d1
>>>           fmsub   d0, d6, d2, d0
>>>           ret
>>>
>>> reusing the sqrt results and doing more optimisations rather than the
>>> current:
>>> baz:
>>>           stp     x29, x30, [sp, -48]!
>>>           fmov    d1, -1.25e+0
>>>           add     x29, sp, 0
>>>           stp     d8, d9, [sp, 16]
>>>           fmov    d9, d0
>>>           str     d10, [sp, 32]
>>>           bl      pow
>>>           fmov    d8, d0
>>>           fmov    d0, d9
>>>           fmov    d1, 5.75e+0
>>>           bl      pow
>>>           fmov    d10, d0
>>>           fmov    d0, d9
>>>           fmov    d1, 3.375e+0
>>>           bl      pow
>>>           fadd    d8, d8, d10
>>>           ldr     d10, [sp, 32]
>>>           fsub    d0, d8, d0
>>>           ldp     d8, d9, [sp, 16]
>>>           ldp     x29, x30, [sp], 48
>>>           ret
>>>
>>>
>>> Of course gcc could already do that if the exponents were to fall in the set
>>> {0.25, 0.75, k+0.5}
>>> but with this patch that set can be greatly expanded.
>>>
>>> I suppose if we're really lucky we might even open up new vectorisation
>>> opportunities.
>>> For example:
>>> void
>>> vecfoo (float *a, float *b)
>>> {
>>>     for (int i = 0; i < N; i++)
>>>       a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
>>> }
>>>
>>> will now be vectorisable if we have a hardware vector sqrt instruction.
>>> Though I'm not sure
>>> how often this would appear in real code.
>>>
>>>
>>> I would like advice on some implementation details:
>>> - The new function representable_as_half_series_p is used to check if the
>>> fractional
>>> part of an exponent can be represented as a multiple of a power of 0.5. It
>>> does so
>>> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a
>>> smarter
>>> way of doing this, considering that REAL_VALUE_TYPE holds the bit
>>> representation of the
>>> floating point number?
>>>
>>> - Are there any correctness cases that I may have missed? This patch gates
>>> the optimisation
>>> on -funsafe-math-optimizations, but maybe there are some edge cases that I
>>> missed?
>>>
>>> - What should be the default value of the max-pow-sqrt-depth param? In this
>>> patch it's 5 which
>>> on second thought strikes me as a bit aggressive. To catch exponents that
>>> are multiples of
>>> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I
>>> suppose it depends on
>>> how likely more fine-grained powers are to appear in real code, how
>>> expensive the C library
>>> implementation of pow is, and how expensive are the sqrt instructions in
>>> hardware.
>>>
>>>
>>> Bootstrapped and tested on x86_64, aarch64, arm (pending)
>>> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that
>>> might
>>> be of interest to look at?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>       * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>>>       * tree-ssa-math-opts.c: Include params.h
>>>       (pow_synth_sqrt_info): New struct.
>>>       (representable_as_half_series_p): New function.
>>>       (get_fn_chain): Likewise.
>>>       (print_nested_fn): Likewise.
>>>       (dump_fractional_sqrt_sequence): Likewise.
>>>       (dump_integer_part): Likewise.
>>>       (expand_pow_as_sqrts): Likewise.
>>>       (gimple_expand_builtin_pow): Use above to attempt to expand
>>>       pow as series of square roots.  Removed now unused variables.
>>>
>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>       * gcc.dg/pow-sqrt.x: New file.
>>>       * gcc.dg/pow-sqrt-1.c: New test.
>>>       * gcc.dg/pow-sqrt-2.c: Likewise.
>>>       * gcc.dg/pow-sqrt-3.c: Likewise.


[-- Attachment #2: pow-sqrt.patch --]
[-- Type: text/x-patch, Size: 20892 bytes --]

commit 1c55f44e3294b17ba113032c2d05bb9e09c8fc69
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Wed Apr 29 13:27:07 2015 +0100

    [tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

diff --git a/gcc/params.def b/gcc/params.def
index 48b39a2..3e4ba3a 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -262,6 +262,14 @@ DEFPARAM(PARAM_MAX_HOIST_DEPTH,
 	 "Maximum depth of search in the dominator tree for expressions to hoist",
 	 30, 0, 0)
 
+
+/* When synthesizing expnonentiation by a real constant operations using square
+   roots, this controls how deep sqrt chains we are willing to generate.  */
+DEFPARAM(PARAM_MAX_POW_SQRT_DEPTH,
+	 "max-pow-sqrt-depth",
+	 "Maximum depth of sqrt chains to use when synthesizing exponentiation by a real constant",
+	 5, 1, 32)
+
 /* This parameter limits the number of insns in a loop that will be unrolled,
    and by how much the loop is unrolled.
 
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-1.c b/gcc/testsuite/gcc.dg/pow-sqrt-1.c
new file mode 100644
index 0000000..0793b6f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-1.c
@@ -0,0 +1,6 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=5" } */
+
+#define EXPN (-6 * (0.5*0.5*0.5*0.5))
+
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-2.c b/gcc/testsuite/gcc.dg/pow-sqrt-2.c
new file mode 100644
index 0000000..b2fada4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-2.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=5" } */
+
+#define EXPN (-5.875)
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-3.c b/gcc/testsuite/gcc.dg/pow-sqrt-3.c
new file mode 100644
index 0000000..18c7231
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-3.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=3" } */
+
+#define EXPN (1.25)
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt.x b/gcc/testsuite/gcc.dg/pow-sqrt.x
new file mode 100644
index 0000000..bd744c6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt.x
@@ -0,0 +1,30 @@
+
+extern void abort (void);
+
+
+__attribute__((noinline)) double
+real_pow (double x, double pow_exp)
+{
+  return __builtin_pow (x, pow_exp);
+}
+
+#define EPS (0.000000000000000000001)
+
+#define SYNTH_POW(X, Y) __builtin_pow (X, Y)
+volatile double arg;
+
+int
+main (void)
+{
+  double i_arg = 0.1;
+
+  for (arg = i_arg; arg < 100.0; arg += 1.0)
+    {
+      double synth_res = SYNTH_POW (arg, EXPN);
+      double real_res = real_pow (arg, EXPN);
+
+      if (__builtin_abs (SYNTH_POW (arg, EXPN) - real_pow (arg, EXPN)) > EPS)
+	abort ();
+    }
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
new file mode 100644
index 0000000..52514fb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
+
+
+double
+foo (double a)
+{
+  return __builtin_pow (a, -5.875);
+}
+
+double
+foof (double a)
+{
+  return __builtin_pow (a, 0.75f);
+}
+
+double
+bar (double a)
+{
+  return __builtin_pow (a, 1.0 + 0.00390625);
+}
+
+double
+baz (double a)
+{
+  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
+}
+
+#define N 256
+void
+vecfoo (double *a)
+{
+  for (int i = 0; i < N; i++)
+    a[i] = __builtin_pow (a[i], 1.25);
+}
+
+/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
+/* { dg-final { cleanup-tree-dump "sincos" } } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index c22a677..11288b1 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -143,6 +143,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "gimple-pretty-print.h"
 #include "builtins.h"
+#include "params.h"
 
 /* FIXME: RTL headers have to be included here for optabs.  */
 #include "rtl.h"		/* Because optabs.h wants enum rtx_code.  */
@@ -1148,6 +1149,357 @@ build_and_insert_cast (gimple_stmt_iterator *gsi, location_t loc,
   return result;
 }
 
+struct pow_synth_sqrt_info
+{
+  bool *factors;
+  unsigned int deepest;
+  unsigned int num_mults;
+};
+
+/* Return true iff the real value C can be represented as a
+   sum of powers of 0.5 up to N.  That is:
+   C == SUM<i from 1..N> (a[i]*(0.5**i)) where a[i] is either 0 or 1.
+   Record in INFO the various parameters of the synthesis algorithm such
+   as the factors a[i], the maximum 0.5 power and the number of
+   multiplications that will be required.  */
+
+bool
+representable_as_half_series_p (REAL_VALUE_TYPE c, unsigned n,
+				 struct pow_synth_sqrt_info *info)
+{
+  REAL_VALUE_TYPE factor = dconsthalf;
+  REAL_VALUE_TYPE remainder = c;
+
+  info->deepest = 0;
+  info->num_mults = 0;
+  memset (info->factors, 0, n * sizeof (bool));
+
+  for (unsigned i = 0; i < n; i++)
+    {
+      REAL_VALUE_TYPE res;
+
+      /* If something inexact happened bail out now.  */
+      if (REAL_ARITHMETIC (res, MINUS_EXPR, remainder, factor))
+	return false;
+
+      /* We have hit zero.  The number is representable as a sum
+         of powers of 0.5.  */
+      if (REAL_VALUES_EQUAL (res, dconst0))
+	{
+	  info->factors[i] = true;
+	  info->deepest = i + 1;
+	  return true;
+	}
+      else if (!REAL_VALUE_NEGATIVE (res))
+	{
+	  remainder = res;
+	  info->factors[i] = true;
+	  info->num_mults++;
+	}
+      else
+	info->factors[i] = false;
+
+      REAL_ARITHMETIC (factor, MULT_EXPR, factor, dconsthalf);
+    }
+  return false;
+}
+
+/* Return the tree corresponding to FN being applied
+   to ARG N times at GSI and LOC.
+   Look up previous results from CACHE if need be.
+   cache[0] should contain just plain ARG i.e. FN applied to ARG 0 times.  */
+
+static tree
+get_fn_chain (tree arg, unsigned int n, gimple_stmt_iterator *gsi,
+	      tree fn, location_t loc, tree *cache)
+{
+  tree res = cache[n];
+  if (!res)
+    {
+      tree prev = get_fn_chain (arg, n - 1, gsi, fn, loc, cache);
+      res = build_and_insert_call (gsi, loc, fn, prev);
+      cache[n] = res;
+    }
+
+  return res;
+}
+
+/* Print to STREAM the repeated application of function FNAME to ARG
+   N times.  So, for FNAME = "foo", ARG = "x", N = 2 it would print:
+   "foo (foo (x))".  */
+
+static void
+print_nested_fn (FILE* stream, const char *fname, const char* arg,
+		 unsigned int n)
+{
+  if (n == 0)
+    fprintf (stream, "%s", arg);
+  else
+    {
+      fprintf (stream, "%s (", fname);
+      print_nested_fn (stream, fname, arg, n - 1);
+      fprintf (stream, ")");
+    }
+}
+
+/* Print to STREAM the fractional sequence of sqrt chains
+   applied to ARG, described by INFO.  Used for the dump file.  */
+
+static void
+dump_fractional_sqrt_sequence (FILE *stream, const char *arg,
+			        struct pow_synth_sqrt_info *info)
+{
+  for (unsigned int i = 0; i < info->deepest; i++)
+    {
+      bool is_set = info->factors[i];
+      if (is_set)
+	{
+	  print_nested_fn (stream, "sqrt", arg, i + 1);
+	  if (i != info->deepest - 1)
+	    fprintf (stream, " * ");
+	}
+    }
+}
+
+/* Print to STREAM a representation of raising ARG to an integer
+   power N.  Used for the dump file.  */
+
+static void
+dump_integer_part (FILE *stream, const char* arg, HOST_WIDE_INT n)
+{
+  if (n > 1)
+    fprintf (stream, "powi (%s, " HOST_WIDE_INT_PRINT_DEC ")", arg, n);
+  else if (n == 1)
+    fprintf (stream, "%s", arg);
+}
+
+/* Attempt to synthesize a POW[F] (ARG0, ARG1) call using chains of
+   square roots.  Place at GSI and LOC.  Limit the maximum depth
+   of the sqrt chains to MAX_DEPTH.  Return the tree holding the
+   result of the expanded sequence or NULL_TREE if the expansion failed.
+
+   This routine assumes that ARG1 is a real number with a fractional part
+   (the integer exponent case will have been handled earlier in
+   gimple_expand_builtin_pow).
+
+   For ARG1 > 0.0:
+   * For ARG1 composed of a whole part WHOLE_PART and a fractional part
+     FRAC_PART i.e. WHOLE_PART == floor (ARG1) and
+                    FRAC_PART == ARG1 - WHOLE_PART:
+     Produce POWI (ARG0, WHOLE_PART) * POW (ARG0, FRAC_PART) where
+     POW (ARG0, FRAC_PART) is expanded as a product of square root chains
+     if it can be expressed as such, that is if FRAC_PART satisfies:
+     FRAC_PART == <SUM from i = 1 until MAX_DEPTH> (a[i] * (0.5**i))
+     where integer a[i] is either 0 or 1.
+
+     Example:
+     POW (x, 3.625) == POWI (x, 3) * POW (x, 0.625)
+       --> POWI (x, 3) * SQRT (x) * SQRT (SQRT (SQRT (x)))
+
+   For ARG1 < 0.0 there are two approaches:
+   * (A) Expand to 1.0 / POW (ARG0, -ARG1) where POW (ARG0, -ARG1)
+         is calculated as above.
+
+     Example:
+     POW (x, -5.625) == 1.0 / POW (x, 5.625)
+       -->  1.0 / (POWI (x, 5) * SQRT (x) * SQRT (SQRT (SQRT (x))))
+
+   * (B) : WHOLE_PART := - ceil (abs (ARG1))
+           FRAC_PART  := ARG1 - WHOLE_PART
+     and expand to POW (x, FRAC_PART) / POWI (x, WHOLE_PART).
+     Example:
+     POW (x, -5.875) == POW (x, 0.125) / POWI (X, 6)
+       --> SQRT (SQRT (SQRT (x))) / (POWI (x, 6))
+
+   For ARG1 < 0.0 we choose between (A) and (B) depending on
+   how many multiplications we'd have to do.
+   So, for the example in (B): POW (x, -5.875), if we were to
+   follow algorithm (A) we would produce:
+   1.0 / POWI (X, 5) * SQRT (X) * SQRT (SQRT (X)) * SQRT (SQRT (SQRT (X)))
+   which contains more multiplications than approach (B).
+
+   Hopefully, this approach will eliminate potentially expensive POW library
+   calls when unsafe floating point math is enabled and allow the compiler to
+   further optimise the multiplies, square roots and divides produced by this
+   function.  */
+
+static tree
+expand_pow_as_sqrts (gimple_stmt_iterator *gsi, location_t loc,
+		     tree arg0, tree arg1, HOST_WIDE_INT max_depth)
+{
+  tree type = TREE_TYPE (arg0);
+  machine_mode mode = TYPE_MODE (type);
+  tree sqrtfn = mathfn_built_in (type, BUILT_IN_SQRT);
+  bool one_over = true;
+
+  if (!sqrtfn)
+    return NULL_TREE;
+
+  if (TREE_CODE (arg1) != REAL_CST)
+    return NULL_TREE;
+
+  REAL_VALUE_TYPE exp_init = TREE_REAL_CST (arg1);
+
+  gcc_assert (max_depth > 0);
+  tree *cache = XALLOCAVEC (tree, max_depth + 1);
+
+  struct pow_synth_sqrt_info synth_info;
+  synth_info.factors = XALLOCAVEC (bool, max_depth + 1);
+  synth_info.deepest = 0;
+  synth_info.num_mults = 0;
+
+  bool neg_exp = REAL_VALUE_NEGATIVE (exp_init);
+  REAL_VALUE_TYPE exp = real_value_abs (&exp_init);
+
+  /* The whole and fractional parts of exp.  */
+  REAL_VALUE_TYPE whole_part;
+  REAL_VALUE_TYPE frac_part;
+
+  real_floor (&whole_part, mode, &exp);
+  REAL_ARITHMETIC (frac_part, MINUS_EXPR, exp, whole_part);
+
+
+  REAL_VALUE_TYPE ceil_whole = dconst0;
+  REAL_VALUE_TYPE ceil_fract = dconst0;
+
+  if (neg_exp)
+    {
+      real_ceil (&ceil_whole, mode, &exp);
+      REAL_ARITHMETIC (ceil_fract, MINUS_EXPR, ceil_whole, exp);
+    }
+
+  if (!representable_as_half_series_p (frac_part, max_depth, &synth_info))
+    return NULL_TREE;
+
+  /* Check whether it's more profitable to not use 1.0 / ...  */
+  if (neg_exp)
+    {
+      struct pow_synth_sqrt_info alt_synth_info;
+      alt_synth_info.factors = XALLOCAVEC (bool, max_depth + 1);
+      alt_synth_info.deepest = 0;
+      alt_synth_info.num_mults = 0;
+
+      if (representable_as_half_series_p (ceil_fract, max_depth,
+					   &alt_synth_info)
+	  && alt_synth_info.deepest <= synth_info.deepest
+	  && alt_synth_info.num_mults < synth_info.num_mults)
+	{
+	  whole_part = ceil_whole;
+	  frac_part = ceil_fract;
+	  synth_info.deepest = alt_synth_info.deepest;
+	  synth_info.num_mults = alt_synth_info.num_mults;
+	  memcpy (synth_info.factors, alt_synth_info.factors,
+		  (max_depth + 1) * sizeof (bool));
+	  one_over = false;
+	}
+    }
+
+  HOST_WIDE_INT n = real_to_integer (&whole_part);
+  REAL_VALUE_TYPE cint;
+  real_from_integer (&cint, VOIDmode, n, SIGNED);
+
+  if (!real_identical (&whole_part, &cint))
+    return NULL_TREE;
+
+  if (powi_cost (n) + synth_info.num_mults > POWI_MAX_MULTS)
+    return NULL_TREE;
+
+  memset (cache, 0, (max_depth + 1) * sizeof (tree));
+
+  tree integer_res = n == 0 ? build_real (type, dconst1) : arg0;
+
+  /* Calculate the integer part of the exponent.  */
+  if (n > 1)
+    {
+      integer_res = gimple_expand_builtin_powi (gsi, loc, arg0, n);
+      if (!integer_res)
+	return NULL_TREE;
+    }
+
+  if (dump_file)
+    {
+      char string[64];
+
+      real_to_decimal (string, &exp_init, sizeof (string), 0, 1);
+      fprintf (dump_file, "synthesizing pow (x, %s) as:\n", string);
+
+      if (neg_exp)
+	{
+	  if (one_over)
+	    {
+	      fprintf (dump_file, "1.0 / (");
+	      dump_integer_part (dump_file, "x", n);
+	      if (n > 0)
+	        fprintf (dump_file, " * ");
+	      dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	      fprintf (dump_file, ")");
+	    }
+	  else
+	    {
+	      dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	      fprintf (dump_file, " / (");
+	      dump_integer_part (dump_file, "x", n);
+	      fprintf (dump_file, ")");
+	    }
+	}
+      else
+	{
+	  dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	  if (n > 0)
+	    fprintf (dump_file, " * ");
+	  dump_integer_part (dump_file, "x", n);
+	}
+
+      fprintf (dump_file, "\ndeepest sqrt chain: %d\n", synth_info.deepest);
+    }
+
+
+  tree fract_res = NULL_TREE;
+  cache[0] = arg0;
+
+  /* Calculate the fractional part of the exponent.  */
+  for (unsigned i = 0; i < synth_info.deepest; i++)
+    {
+      if (synth_info.factors[i])
+	{
+	  tree sqrt_chain = get_fn_chain (arg0, i + 1, gsi, sqrtfn, loc, cache);
+
+	  if (!fract_res)
+	      fract_res = sqrt_chain;
+
+	  else
+	    fract_res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+					   fract_res, sqrt_chain);
+	}
+    }
+
+  tree res = NULL_TREE;
+
+  if (neg_exp)
+    {
+      if (one_over)
+	{
+	  if (n > 0)
+	    res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+					   fract_res, integer_res);
+	  else
+	    res = fract_res;
+
+	  res = build_and_insert_binop (gsi, loc, "powrootrecip", RDIV_EXPR,
+					  build_real (type, dconst1), res);
+	}
+      else
+	{
+	  res = build_and_insert_binop (gsi, loc, "powroot", RDIV_EXPR,
+					 fract_res, integer_res);
+	}
+    }
+  else
+    res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+				   fract_res, integer_res);
+  return res;
+}
+
 /* ARG0 and ARG1 are the two arguments to a pow builtin call in GSI
    with location info LOC.  If possible, create an equivalent and
    less expensive sequence of statements prior to GSI, and return an
@@ -1157,13 +1509,17 @@ static tree
 gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc, 
 			   tree arg0, tree arg1)
 {
-  REAL_VALUE_TYPE c, cint, dconst1_4, dconst3_4, dconst1_3, dconst1_6;
+  REAL_VALUE_TYPE c, cint, dconst1_3, dconst1_4, dconst1_6;
   REAL_VALUE_TYPE c2, dconst3;
   HOST_WIDE_INT n;
-  tree type, sqrtfn, cbrtfn, sqrt_arg0, sqrt_sqrt, result, cbrt_x, powi_cbrt_x;
+  tree type, sqrtfn, cbrtfn, sqrt_arg0, result, cbrt_x, powi_cbrt_x;
   machine_mode mode;
+  bool speed_p = optimize_bb_for_speed_p (gsi_bb (*gsi));
   bool hw_sqrt_exists, c_is_int, c2_is_int;
 
+  dconst1_4 = dconst1;
+  SET_REAL_EXP (&dconst1_4, REAL_EXP (&dconst1_4) - 2);
+
   /* If the exponent isn't a constant, there's nothing of interest
      to be done.  */
   if (TREE_CODE (arg1) != REAL_CST)
@@ -1179,7 +1535,7 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
   if (c_is_int
       && ((n >= -1 && n <= 2)
 	  || (flag_unsafe_math_optimizations
-	      && optimize_bb_for_speed_p (gsi_bb (*gsi))
+	      && speed_p
 	      && powi_cost (n) <= POWI_MAX_MULTS)))
     return gimple_expand_builtin_powi (gsi, loc, arg0, n);
 
@@ -1196,49 +1552,8 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
       && !HONOR_SIGNED_ZEROS (mode))
     return build_and_insert_call (gsi, loc, sqrtfn, arg0);
 
-  /* Optimize pow(x,0.25) = sqrt(sqrt(x)).  Assume on most machines that
-     a builtin sqrt instruction is smaller than a call to pow with 0.25,
-     so do this optimization even if -Os.  Don't do this optimization
-     if we don't have a hardware sqrt insn.  */
-  dconst1_4 = dconst1;
-  SET_REAL_EXP (&dconst1_4, REAL_EXP (&dconst1_4) - 2);
   hw_sqrt_exists = optab_handler (sqrt_optab, mode) != CODE_FOR_nothing;
 
-  if (flag_unsafe_math_optimizations
-      && sqrtfn
-      && REAL_VALUES_EQUAL (c, dconst1_4)
-      && hw_sqrt_exists)
-    {
-      /* sqrt(x)  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
-
-      /* sqrt(sqrt(x))  */
-      return build_and_insert_call (gsi, loc, sqrtfn, sqrt_arg0);
-    }
-      
-  /* Optimize pow(x,0.75) = sqrt(x) * sqrt(sqrt(x)) unless we are
-     optimizing for space.  Don't do this optimization if we don't have
-     a hardware sqrt insn.  */
-  real_from_integer (&dconst3_4, VOIDmode, 3, SIGNED);
-  SET_REAL_EXP (&dconst3_4, REAL_EXP (&dconst3_4) - 2);
-
-  if (flag_unsafe_math_optimizations
-      && sqrtfn
-      && optimize_function_for_speed_p (cfun)
-      && REAL_VALUES_EQUAL (c, dconst3_4)
-      && hw_sqrt_exists)
-    {
-      /* sqrt(x)  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
-
-      /* sqrt(sqrt(x))  */
-      sqrt_sqrt = build_and_insert_call (gsi, loc, sqrtfn, sqrt_arg0);
-
-      /* sqrt(x) * sqrt(sqrt(x))  */
-      return build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
-				     sqrt_arg0, sqrt_sqrt);
-    }
-
   /* Optimize pow(x,1./3.) = cbrt(x).  This requires unsafe math
      optimizations since 1./3. is not exactly representable.  If x
      is negative and finite, the correct value of pow(x,1./3.) is
@@ -1263,7 +1578,7 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
       && sqrtfn
       && cbrtfn
       && (gimple_val_nonnegative_real_p (arg0) || !HONOR_NANS (mode))
-      && optimize_function_for_speed_p (cfun)
+      && speed_p
       && hw_sqrt_exists
       && REAL_VALUES_EQUAL (c, dconst1_6))
     {
@@ -1274,54 +1589,31 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
       return build_and_insert_call (gsi, loc, cbrtfn, sqrt_arg0);
     }
 
-  /* Optimize pow(x,c), where n = 2c for some nonzero integer n
-     and c not an integer, into
-
-       sqrt(x) * powi(x, n/2),                n > 0;
-       1.0 / (sqrt(x) * powi(x, abs(n/2))),   n < 0.
-
-     Do not calculate the powi factor when n/2 = 0.  */
-  real_arithmetic (&c2, MULT_EXPR, &c, &dconst2);
-  n = real_to_integer (&c2);
-  real_from_integer (&cint, VOIDmode, n, SIGNED);
-  c2_is_int = real_identical (&c2, &cint);
 
+  /* Attempt to expand the POW as a product of square root chains.
+     Expand the 0.25 case even when otpimising for size.  */
   if (flag_unsafe_math_optimizations
       && sqrtfn
-      && c2_is_int
-      && !c_is_int
-      && optimize_function_for_speed_p (cfun))
+      && hw_sqrt_exists
+      && (speed_p || REAL_VALUES_EQUAL (c, dconst1_4))
+      && !HONOR_SIGNED_ZEROS (mode))
     {
-      tree powi_x_ndiv2 = NULL_TREE;
-
-      /* Attempt to fold powi(arg0, abs(n/2)) into multiplies.  If not
-         possible or profitable, give up.  Skip the degenerate case when
-         n is 1 or -1, where the result is always 1.  */
-      if (absu_hwi (n) != 1)
-	{
-	  powi_x_ndiv2 = gimple_expand_builtin_powi (gsi, loc, arg0,
-						     abs_hwi (n / 2));
-	  if (!powi_x_ndiv2)
-	    return NULL_TREE;
-	}
+      unsigned int max_depth = speed_p
+				? PARAM_VALUE (PARAM_MAX_POW_SQRT_DEPTH)
+				: 2;
 
-      /* Calculate sqrt(x).  When n is not 1 or -1, multiply it by the
-	 result of the optimal multiply sequence just calculated.  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
+      tree expand_with_sqrts
+	= expand_pow_as_sqrts (gsi, loc, arg0, arg1, max_depth);
 
-      if (absu_hwi (n) == 1)
-	result = sqrt_arg0;
-      else
-	result = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
-					 sqrt_arg0, powi_x_ndiv2);
-
-      /* If n is negative, reciprocate the result.  */
-      if (n < 0)
-	result = build_and_insert_binop (gsi, loc, "powroot", RDIV_EXPR,
-					 build_real (type, dconst1), result);
-      return result;
+      if (expand_with_sqrts)
+	return expand_with_sqrts;
     }
 
+  real_arithmetic (&c2, MULT_EXPR, &c, &dconst2);
+  n = real_to_integer (&c2);
+  real_from_integer (&cint, VOIDmode, n, SIGNED);
+  c2_is_int = real_identical (&c2, &cint);
+
   /* Optimize pow(x,c), where 3c = n for some nonzero integer n, into
 
      powi(x, n/3) * powi(cbrt(x), n%3),                    n > 0;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-08 15:09     ` Kyrill Tkachov
@ 2015-05-13 11:32       ` Richard Biener
  2015-05-13 16:06         ` Kyrill Tkachov
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2015-05-13 11:32 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches

On Fri, May 8, 2015 at 5:09 PM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> On 08/05/15 14:56, Kyrill Tkachov wrote:
>>
>> On 08/05/15 11:18, Richard Biener wrote:
>>>
>>> On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow
>>>> (x,
>>>> (int)k + 0.5)
>>>> using square roots. So, for the above examples it would generate sqrt
>>>> (x) *
>>>> sqrt (sqrt (x)),
>>>> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
>>>> will calculate the
>>>> reciprocal of that).
>>>>
>>>> However, the implementation of these optimisations is done on a bit of
>>>> an
>>>> ad-hoc basis with
>>>> the 0.25, 0.5, 0.75 cases hardcoded.
>>>> Judging by
>>>>
>>>> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
>>>> these are the most commonly used exponents (at least in SPEC ;))
>>>>
>>>> This patch generalises this optimisation into a (hopefully) more robust
>>>> algorithm.
>>>> In particular, it expands calls to pow (x, CST) by expanding the integer
>>>> part of CST
>>>> using a powi, like it does already, and then expanding the fractional
>>>> part
>>>> as a product
>>>> of repeated applications of a square root if the fractional part can be
>>>> expressed
>>>> as a multiple of a power of 0.5.
>>>>
>>>> I try to explain the algorithm in more detail in the comments in the
>>>> patch
>>>> but, for example:
>>>>
>>>> pow (x, 5.625) is not currently handled, but with this patch will be
>>>> expanded
>>>> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0
>>>> +
>>>> 0.5 + 0.5**3
>>>>
>>>> Negative exponents are handled in either of two ways, depending on the
>>>> exponent value:
>>>> * Using a simple reciprocal.
>>>>     For example:
>>>>     pow (x, -5.625) == 1.0 / pow (x, 5.625)
>>>>       --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>>>>
>>>> * For pow (x, EXP) with negative exponent EXP with integer part INT and
>>>> fractional part FRAC:
>>>> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>>>>     For example:
>>>>     pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>>>>       --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>>>>
>>>>
>>>> Since hardware square root instructions tend to be expensive, we may
>>>> want to
>>>> reduce the number
>>>> of square roots we are willing to calculate. Since we reuse intermediate
>>>> square root results,
>>>> this boils down to restricting the depth of the square root chains. In
>>>> all
>>>> the examples above
>>>> that depth is 3. I've made this maximum depth parametrisable in
>>>> params.def.
>>>> By adjusting that
>>>> parameter we can adjust the resolution of this optimisation. So, if it's
>>>> set
>>>> to '4' then we
>>>> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
>>>> including negative
>>>> multiples. Currently, GCC will not try to expand negative multiples of
>>>> anything else than 0.5
>>>>
>>>> I have tried to keep the existing functionality intact and activate this
>>>> only for
>>>> -funsafe-math-optimizations and only when the target has a sqrt
>>>> instruction.
>>>>    An exception to that is pow (x, 0.5) which we prefer to transform to
>>>> sqrt
>>>> even
>>>> when a hardware sqrt is not available, presumably because the library
>>>> function for
>>>> sqrt is usually faster than pow (?).
>>>
>>> Yes.  It's also a safe transform - which you seem to put under
>>> flag_unsafe_math_optimizations only with your patch.
>>>
>>> It would be clearer to just leave the special-case
>>>
>>> -  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
>>> -     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
>>> -     sqrt(-0) = -0.  */
>>> -  if (sqrtfn
>>> -      && REAL_VALUES_EQUAL (c, dconsthalf)
>>> -      && !HONOR_SIGNED_ZEROS (mode))
>>> -    return build_and_insert_call (gsi, loc, sqrtfn, arg0);
>>>
>>> in as-is.
>>
>> Ok, I'll leave that case explicit.
>>
>>> You also removed the Os constraint which you should put back in.
>>> Basically if !optimize_function_for_speed_p then generate at most
>>> two calls to sqrt (iff the HW has a sqrt instruction).
>>
>> I tried to move that logic into expand_with_sqrts but
>> I'll move it outside it. It seems that this boils down to
>> only 0.25, as any other 2xsqrt chain will also involve a
>> multiply or a divide which we currently avoid.
>>
>>> You fail to add a testcase that checks that the optimization applies.
>>
>> I'll add one to scan the sincos dump.
>> I notice that we don't have a testuite check that the target has
>> a hw sqrt instructions. Would you like me to add one? Or can I make
>> the testcase aarch64-specific?

Would be great to have a testsuite check for this.

>>
>>> Otherwise the idea looks good though there must be a better way
>>> to compute the series than by using real-arithmetic and forcefully
>>> trying out all possibilities...
>>
>> I get that feeling too. What I need is not only a way
>> of figuring out if the fractional part of the exponent can be
>> represented in this way, but also compute the depth of the
>> sqrt chain and the number of multiplies...
>> That being said, the current approach is O(maximum depth) and
>> I don't expect the depth to go much beyond 3 or 4 in practice.
>>
>> Thanks for looking at it!
>> I'll respin the patch.
>
>
> And here it is, with my above comments implemented.
> Bootstrapped on x86_64 and tested on aarch64.
> Full testing on arm and aarch64 ongoing.
>
> Is this ok if testing comes clean?

Ok.

Thanks,
Richard.

> Thanks,
> Kyrill
>
>
>> Kyrill
>>
>>> Richard.
>>>
>>>> Having seen the glibc implementation of a fully IEEE-754-compliant pow
>>>> function, I think we
>>>> would prefer synthesising the pow call whenever we can for -ffast-math.
>>>>
>>>> I have seen this optimisation trigger a few times in SPEC2k6, in
>>>> particular
>>>> in 447.dealII
>>>> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125)
>>>> and
>>>> pow (x, 0.875)
>>>> with square roots, multiplies and, in the case of -0.25, divides.
>>>> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>>>>
>>>> On 481.wrf on aarch64 I saw about a 1% improvement.
>>>> The cycle count on x86_64 was also smaller, but not by a significant
>>>> amount
>>>> (the same calls to
>>>> pow were eliminated).
>>>>
>>>> In general, I think this can shine if multiple expandable calls to pow
>>>> appear together.
>>>> So, for example for code:
>>>> double
>>>> baz (double a)
>>>> {
>>>>     return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) -
>>>> __builtin_pow
>>>> (a, 3.375);
>>>> }
>>>>
>>>> we can generate:
>>>> baz:
>>>>           fsqrt   d3, d0
>>>>           fmul    d4, d0, d0
>>>>           fmov    d5, 1.0e+0
>>>>           fmul    d6, d0, d4
>>>>           fsqrt   d2, d3
>>>>           fmul    d1, d0, d2
>>>>           fsqrt   d0, d2
>>>>           fmul    d3, d3, d2
>>>>           fdiv    d1, d5, d1
>>>>           fmul    d3, d3, d6
>>>>           fmul    d2, d2, d0
>>>>           fmadd   d0, d4, d3, d1
>>>>           fmsub   d0, d6, d2, d0
>>>>           ret
>>>>
>>>> reusing the sqrt results and doing more optimisations rather than the
>>>> current:
>>>> baz:
>>>>           stp     x29, x30, [sp, -48]!
>>>>           fmov    d1, -1.25e+0
>>>>           add     x29, sp, 0
>>>>           stp     d8, d9, [sp, 16]
>>>>           fmov    d9, d0
>>>>           str     d10, [sp, 32]
>>>>           bl      pow
>>>>           fmov    d8, d0
>>>>           fmov    d0, d9
>>>>           fmov    d1, 5.75e+0
>>>>           bl      pow
>>>>           fmov    d10, d0
>>>>           fmov    d0, d9
>>>>           fmov    d1, 3.375e+0
>>>>           bl      pow
>>>>           fadd    d8, d8, d10
>>>>           ldr     d10, [sp, 32]
>>>>           fsub    d0, d8, d0
>>>>           ldp     d8, d9, [sp, 16]
>>>>           ldp     x29, x30, [sp], 48
>>>>           ret
>>>>
>>>>
>>>> Of course gcc could already do that if the exponents were to fall in the
>>>> set
>>>> {0.25, 0.75, k+0.5}
>>>> but with this patch that set can be greatly expanded.
>>>>
>>>> I suppose if we're really lucky we might even open up new vectorisation
>>>> opportunities.
>>>> For example:
>>>> void
>>>> vecfoo (float *a, float *b)
>>>> {
>>>>     for (int i = 0; i < N; i++)
>>>>       a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i],
>>>> 3.625);
>>>> }
>>>>
>>>> will now be vectorisable if we have a hardware vector sqrt instruction.
>>>> Though I'm not sure
>>>> how often this would appear in real code.
>>>>
>>>>
>>>> I would like advice on some implementation details:
>>>> - The new function representable_as_half_series_p is used to check if
>>>> the
>>>> fractional
>>>> part of an exponent can be represented as a multiple of a power of 0.5.
>>>> It
>>>> does so
>>>> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there
>>>> is a
>>>> smarter
>>>> way of doing this, considering that REAL_VALUE_TYPE holds the bit
>>>> representation of the
>>>> floating point number?
>>>>
>>>> - Are there any correctness cases that I may have missed? This patch
>>>> gates
>>>> the optimisation
>>>> on -funsafe-math-optimizations, but maybe there are some edge cases that
>>>> I
>>>> missed?
>>>>
>>>> - What should be the default value of the max-pow-sqrt-depth param? In
>>>> this
>>>> patch it's 5 which
>>>> on second thought strikes me as a bit aggressive. To catch exponents
>>>> that
>>>> are multiples of
>>>> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I
>>>> suppose it depends on
>>>> how likely more fine-grained powers are to appear in real code, how
>>>> expensive the C library
>>>> implementation of pow is, and how expensive are the sqrt instructions in
>>>> hardware.
>>>>
>>>>
>>>> Bootstrapped and tested on x86_64, aarch64, arm (pending)
>>>> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase
>>>> that
>>>> might
>>>> be of interest to look at?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>       * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>>>>       * tree-ssa-math-opts.c: Include params.h
>>>>       (pow_synth_sqrt_info): New struct.
>>>>       (representable_as_half_series_p): New function.
>>>>       (get_fn_chain): Likewise.
>>>>       (print_nested_fn): Likewise.
>>>>       (dump_fractional_sqrt_sequence): Likewise.
>>>>       (dump_integer_part): Likewise.
>>>>       (expand_pow_as_sqrts): Likewise.
>>>>       (gimple_expand_builtin_pow): Use above to attempt to expand
>>>>       pow as series of square roots.  Removed now unused variables.
>>>>
>>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>       * gcc.dg/pow-sqrt.x: New file.
>>>>       * gcc.dg/pow-sqrt-1.c: New test.
>>>>       * gcc.dg/pow-sqrt-2.c: Likewise.
>>>>       * gcc.dg/pow-sqrt-3.c: Likewise.
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-13 11:32       ` Richard Biener
@ 2015-05-13 16:06         ` Kyrill Tkachov
  2015-05-18 10:46           ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Kyrill Tkachov @ 2015-05-13 16:06 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1007 bytes --]

Hi Richard,

On 13/05/15 12:27, Richard Biener wrote:
>>> I notice that we don't have a testuite check that the target has
>>> >>a hw sqrt instructions. Would you like me to add one? Or can I make
>>> >>the testcase aarch64-specific?
> Would be great to have a testsuite check for this.
>

I've committed the patch with r223167.

The attached patch adds a testsuite check for hardware sqrt instructions.
In this version I've included arm (on the condition that vfp is possible),
aarch64, x86_64 and powerpc with vsx.
Is this definition ok?

I'm particularly not familiar with the powerpc architectures.

With this check in place, I've migrated the pow synthesis test from gcc.target/aarch64 to gcc.dg.

This test passes on arm-none-eabi, aarch64-none-elf and x86_64-linux.

Ok?

2015-05-13  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * lib/target-supports.exp (check_effective_target_hw_sqrt): New check.
     * gcc.dg/pow-sqrt-synth-1.c: New test.
     * gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.

[-- Attachment #2: hw-sqrt-testsuite.patch --]
[-- Type: text/x-patch, Size: 3285 bytes --]

commit e30889fa7024ccfd47731aafbaf2288646b65504
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Wed May 13 16:08:03 2015 +0100

    Add testsuite check for hw sqrt. Add generic test for pow sqrt synthesis

diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
new file mode 100644
index 0000000..a65efeb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile { target hw_sqrt } } */
+/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
+/* { dg-additional-options "-mfloat-abi=softfp -mfpu=neon-vfpv4" { target arm*-*-* } } */
+
+double
+foo (double a)
+{
+  return __builtin_pow (a, -5.875);
+}
+
+double
+foof (double a)
+{
+  return __builtin_pow (a, 0.75f);
+}
+
+double
+bar (double a)
+{
+  return __builtin_pow (a, 1.0 + 0.00390625);
+}
+
+double
+baz (double a)
+{
+  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
+}
+
+#define N 256
+void
+vecfoo (double *a)
+{
+  for (int i = 0; i < N; i++)
+    a[i] = __builtin_pow (a[i], 1.25);
+}
+
+/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
+/* { dg-final { cleanup-tree-dump "sincos" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
deleted file mode 100644
index 52514fb..0000000
--- a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
+++ /dev/null
@@ -1,38 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
-
-
-double
-foo (double a)
-{
-  return __builtin_pow (a, -5.875);
-}
-
-double
-foof (double a)
-{
-  return __builtin_pow (a, 0.75f);
-}
-
-double
-bar (double a)
-{
-  return __builtin_pow (a, 1.0 + 0.00390625);
-}
-
-double
-baz (double a)
-{
-  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
-}
-
-#define N 256
-void
-vecfoo (double *a)
-{
-  for (int i = 0; i < N; i++)
-    a[i] = __builtin_pow (a[i], 1.25);
-}
-
-/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
-/* { dg-final { cleanup-tree-dump "sincos" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 3728927..c8f20a4 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4668,6 +4668,27 @@ proc check_effective_target_vect_call_copysignf { } {
     return $et_vect_call_copysignf_saved
 }
 
+# Return 1 if the target supports hardware square root instructions.
+
+proc check_effective_target_hw_sqrt { } {
+    global et_hw_sqrt_saved
+
+    if [info exists et_hw_sqrt_saved] {
+	verbose "check_effective_target_hw_sqrt: using cached result" 2
+    } else {
+	set et_hw_sqrt_saved 0
+	if { [istarget x86_64-*-*]
+	     || ([istarget powerpc*-*-*] && [check_vsx_hw_available])
+	     || [istarget aarch64*-*-*]
+	     || ([istarget arm*-*-*] && [check_effective_target_arm_vfp_ok]) } {
+	   set et_hw_sqrt_saved 1
+	}
+    }
+
+    verbose "check_effective_target_hw_sqrt: returning et_hw_sqrt_saved" 2
+    return $et_hw_sqrt_saved
+}
+
 # Return 1 if the target supports vector sqrtf calls.
 
 proc check_effective_target_vect_call_sqrtf { } {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-13 16:06         ` Kyrill Tkachov
@ 2015-05-18 10:46           ` Richard Biener
  2015-05-20 13:54             ` Kyrill Tkachov
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2015-05-18 10:46 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: GCC Patches

On Wed, May 13, 2015 at 5:33 PM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi Richard,
>
> On 13/05/15 12:27, Richard Biener wrote:
>>>>
>>>> I notice that we don't have a testuite check that the target has
>>>> >>a hw sqrt instructions. Would you like me to add one? Or can I make
>>>> >>the testcase aarch64-specific?
>>
>> Would be great to have a testsuite check for this.
>>
>
> I've committed the patch with r223167.
>
> The attached patch adds a testsuite check for hardware sqrt instructions.
> In this version I've included arm (on the condition that vfp is possible),
> aarch64, x86_64 and powerpc with vsx.
> Is this definition ok?
>
> I'm particularly not familiar with the powerpc architectures.
>
> With this check in place, I've migrated the pow synthesis test from
> gcc.target/aarch64 to gcc.dg.
>
> This test passes on arm-none-eabi, aarch64-none-elf and x86_64-linux.
>
> Ok?

Ok.

Thanks,
Richard.

> 2015-05-13  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * lib/target-supports.exp (check_effective_target_hw_sqrt): New check.
>     * gcc.dg/pow-sqrt-synth-1.c: New test.
>     * gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible
  2015-05-18 10:46           ` Richard Biener
@ 2015-05-20 13:54             ` Kyrill Tkachov
  0 siblings, 0 replies; 9+ messages in thread
From: Kyrill Tkachov @ 2015-05-20 13:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1969 bytes --]


On 18/05/15 11:32, Richard Biener wrote:
> On Wed, May 13, 2015 at 5:33 PM, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi Richard,
>>
>> On 13/05/15 12:27, Richard Biener wrote:
>>>>> I notice that we don't have a testuite check that the target has
>>>>>>> a hw sqrt instructions. Would you like me to add one? Or can I make
>>>>>>> the testcase aarch64-specific?
>>> Would be great to have a testsuite check for this.
>>>
>> I've committed the patch with r223167.
>>
>> The attached patch adds a testsuite check for hardware sqrt instructions.
>> In this version I've included arm (on the condition that vfp is possible),
>> aarch64, x86_64 and powerpc with vsx.
>> Is this definition ok?
>>
>> I'm particularly not familiar with the powerpc architectures.
>>
>> With this check in place, I've migrated the pow synthesis test from
>> gcc.target/aarch64 to gcc.dg.
>>
>> This test passes on arm-none-eabi, aarch64-none-elf and x86_64-linux.
>>
>> Ok?
> Ok.

Thanks.
However, after some discussion on IRC I'd prefer to rename the testsuite check
to sqrt_insn so as not to give the impression that it is a runtime hardware check.

Also, this version adds an entry in sourcebuild.texi.

I'll commit this version in 24 hours unless someone objects.
Test still passes on arm, x86_64 and aarch64.

Cheers,
Kyrill

2015-05-20  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

      * lib/target-supports.exp (check_effective_target_sqrt_insn): New check.
      * gcc.dg/pow-sqrt-synth-1.c: New test.
      * gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.

2015-05-20  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * doc/sourcebuild.texi (7.2.3.9 Other hardware attributes):
     Document sqrt_insn.
>
> Thanks,
> Richard.
>
>> 2015-05-13  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * lib/target-supports.exp (check_effective_target_hw_sqrt): New check.
>>      * gcc.dg/pow-sqrt-synth-1.c: New test.
>>      * gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.


[-- Attachment #2: pow-sqrt-testsuite.patch --]
[-- Type: text/x-patch, Size: 3707 bytes --]

commit e35362535c9888daf00d1430e2d3a932d7ece228
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Wed May 13 16:08:03 2015 +0100

    Add testsuite check for hw sqrt. Add generic test for pow sqrt synthesis

diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index c6ef40e..abe0779 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -1695,6 +1695,9 @@ Target supports FPU instructions.
 @item non_strict_align
 Target does not require strict alignment.
 
+@item sqrt_insn
+Target has a square root instruction that the compiler can generate.
+
 @item sse
 Target supports compiling @code{sse} instructions.
 
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
new file mode 100644
index 0000000..d55b626
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile { target sqrt_insn } } */
+/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
+/* { dg-additional-options "-mfloat-abi=softfp -mfpu=neon-vfpv4" { target arm*-*-* } } */
+
+double
+foo (double a)
+{
+  return __builtin_pow (a, -5.875);
+}
+
+double
+foof (double a)
+{
+  return __builtin_pow (a, 0.75f);
+}
+
+double
+bar (double a)
+{
+  return __builtin_pow (a, 1.0 + 0.00390625);
+}
+
+double
+baz (double a)
+{
+  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
+}
+
+#define N 256
+void
+vecfoo (double *a)
+{
+  for (int i = 0; i < N; i++)
+    a[i] = __builtin_pow (a[i], 1.25);
+}
+
+/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
+/* { dg-final { cleanup-tree-dump "sincos" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
deleted file mode 100644
index 52514fb..0000000
--- a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
+++ /dev/null
@@ -1,38 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
-
-
-double
-foo (double a)
-{
-  return __builtin_pow (a, -5.875);
-}
-
-double
-foof (double a)
-{
-  return __builtin_pow (a, 0.75f);
-}
-
-double
-bar (double a)
-{
-  return __builtin_pow (a, 1.0 + 0.00390625);
-}
-
-double
-baz (double a)
-{
-  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
-}
-
-#define N 256
-void
-vecfoo (double *a)
-{
-  for (int i = 0; i < N; i++)
-    a[i] = __builtin_pow (a[i], 1.25);
-}
-
-/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
-/* { dg-final { cleanup-tree-dump "sincos" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 3728927..e3c4416 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4668,6 +4668,27 @@ proc check_effective_target_vect_call_copysignf { } {
     return $et_vect_call_copysignf_saved
 }
 
+# Return 1 if the target supports hardware square root instructions.
+
+proc check_effective_target_sqrt_insn { } {
+    global et_sqrt_insn_saved
+
+    if [info exists et_sqrt_insn_saved] {
+	verbose "check_effective_target_hw_sqrt: using cached result" 2
+    } else {
+	set et_sqrt_insn_saved 0
+	if { [istarget x86_64-*-*]
+	     || [istarget powerpc*-*-*]
+	     || [istarget aarch64*-*-*]
+	     || ([istarget arm*-*-*] && [check_effective_target_arm_vfp_ok]) } {
+	   set et_sqrt_insn_saved 1
+	}
+    }
+
+    verbose "check_effective_target_hw_sqrt: returning et_sqrt_insn_saved" 2
+    return $et_sqrt_insn_saved
+}
+
 # Return 1 if the target supports vector sqrtf calls.
 
 proc check_effective_target_vect_call_sqrtf { } {

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-20 13:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-01 16:02 [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible Kyrill Tkachov
2015-05-08  8:15 ` Kyrill Tkachov
2015-05-08 10:18 ` Richard Biener
2015-05-08 13:57   ` Kyrill Tkachov
2015-05-08 15:09     ` Kyrill Tkachov
2015-05-13 11:32       ` Richard Biener
2015-05-13 16:06         ` Kyrill Tkachov
2015-05-18 10:46           ` Richard Biener
2015-05-20 13:54             ` Kyrill Tkachov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).