public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Factor out division by squares and remove division around comparisons (1/2)
@ 2017-08-10 14:11 Jackson Woodruff
  2017-08-10 15:26 ` Jackson Woodruff
  2017-08-15 14:22 ` Richard Biener
  0 siblings, 2 replies; 22+ messages in thread
From: Jackson Woodruff @ 2017-08-10 14:11 UTC (permalink / raw)
  To: Wilco.dijkstra, richard.guenther, kyrylo.tkachov, joseph, gcc-patches

Hi all,

The patch implements the division opitmizations discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .

The implemented change differs slightly from the
proposed one in that we re-associate:

     C / x comparison 0.0 -> x comparison' 0.0

Where C is any constant and comparison' is changed as
appropriate if C is negative.

The implementations also removes the division from:

     x / C comparison 0.0 -> x comparison' 0.0

Where again, comparison' is changed as appropriate if C is
negative.

We also change the association of

      x / (y * C) -> (x / C) / y

If C is a constant. All of the above require
-funsafe-math-optimizations.

We also change:

   x / (- y) -> (-x) / y

Which requires -fno-trapping-math.

Bootstrapped and regtested (with part 2 of this patch) on aarch64.

OK for trunk?

(Apologies if the recipients in the 'to' field received this twice,
I accidentally sent this from an email gcc-patches doesn't accept)

Jackson

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* match.pd: New patterns.


gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* gcc.dg/associate_comparison_1.c: New.
	* gcc.dg/associate_division_2.c: New.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-10 14:11 [PATCH] Factor out division by squares and remove division around comparisons (1/2) Jackson Woodruff
@ 2017-08-10 15:26 ` Jackson Woodruff
  2017-08-11  0:15   ` Joseph Myers
  2017-08-15 14:22 ` Richard Biener
  1 sibling, 1 reply; 22+ messages in thread
From: Jackson Woodruff @ 2017-08-10 15:26 UTC (permalink / raw)
  To: Wilco.dijkstra, richard.guenther, kyrylo.tkachov, joseph, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1466 bytes --]

And have now attached that patch.

Jackson


On 08/10/2017 03:09 PM, Jackson Woodruff wrote:
> Hi all,
>
> The patch implements the division opitmizations discussed in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .
>
> The implemented change differs slightly from the
> proposed one in that we re-associate:
>
>     C / x comparison 0.0 -> x comparison' 0.0
>
> Where C is any constant and comparison' is changed as
> appropriate if C is negative.
>
> The implementations also removes the division from:
>
>     x / C comparison 0.0 -> x comparison' 0.0
>
> Where again, comparison' is changed as appropriate if C is
> negative.
>
> We also change the association of
>
>      x / (y * C) -> (x / C) / y
>
> If C is a constant. All of the above require
> -funsafe-math-optimizations.
>
> We also change:
>
>   x / (- y) -> (-x) / y
>
> Which requires -fno-trapping-math.
>
> Bootstrapped and regtested (with part 2 of this patch) on aarch64.
>
> OK for trunk?
>
> (Apologies if the recipients in the 'to' field received this twice,
> I accidentally sent this from an email gcc-patches doesn't accept)
>
> Jackson
>
> gcc/
>
> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>
>     PR 71026/tree-optimization
>     * match.pd: New patterns.
>
>
> gcc/testsuite
>
> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>
>     PR 71026/tree-optimization
>     * gcc.dg/associate_comparison_1.c: New.
>     * gcc.dg/associate_division_2.c: New.
>


[-- Attachment #2: patchfile-1 --]
[-- Type: text/plain, Size: 3318 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index e98db52af84946cf579c6434e06d450713a47162..7abfd11601a956ecb59c945256c9f30dc0b6f6c9 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -342,16 +342,42 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (negate @0)))
 
 (if (flag_reciprocal_math)
- /* Convert (A/B)/C to A/(B*C)  */
+ /* Convert (A/B)/C to A/(B*C) where B is not a constant.  */
  (simplify
   (rdiv (rdiv:s @0 @1) @2)
-   (rdiv @0 (mult @1 @2)))
+   (if (TREE_CODE (@1) != REAL_CST)
+    (rdiv @0 (mult @1 @2))))
+
+ /* Simplify x / (C * y) to (x / C) / y where C is a constant.  */
+ (simplify
+  (rdiv @0
+   (mult @1 REAL_CST@2))
+  (rdiv (rdiv @0 @2) @1))
 
  /* Convert A/(B/C) to (A/B)*C  */
  (simplify
   (rdiv @0 (rdiv:s @1 @2))
    (mult (rdiv @0 @1) @2)))
 
+(if (!flag_trapping_math)
+  /* Simplify x / (- y) to -x / y.  */
+  (simplify
+    (rdiv @0 (negate @1))
+    (rdiv (negate @0) @1)))
+
+(if (flag_unsafe_math_optimizations)
+  /* Simplify (C / x op 0.0) to x op 0.0 for C > 0.  */
+  (for op (lt le gt ge)
+	   neg_op (gt ge lt le)
+    (simplify
+      (op (rdiv REAL_CST@0 @1) real_zerop@2)
+      (switch
+       (if (!real_equal (TREE_REAL_CST_PTR (@0), &dconst0)
+	    && !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0)))
+	(op @1 @2))
+       (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0)))
+	(neg_op @1 @2))))))
+
 /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */
 (for div (trunc_div ceil_div floor_div round_div exact_div)
  (simplify
@@ -3446,6 +3472,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (!HONOR_SNANS (type))
    @0))
 
+ /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1
+    and simplify (x / C1) cmp C2 -> x cmp C2 * C1.  */
+ (for cmp (lt le gt ge)
+      neg_cmp (gt ge lt le)
+  (for op (mult rdiv)
+       inv_op (rdiv mult)
+      (simplify
+       (cmp (op @0 REAL_CST@1) REAL_CST@2)
+       (switch
+	(if (!real_equal (TREE_REAL_CST_PTR (@1), &dconst0)
+	     && !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1)))
+	 (cmp @0 (inv_op @2 @1)))
+	(if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1)))
+	 (neg_cmp @0 (inv_op @2 @1)))))))
+
  /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y).  */
  (for root (SQRT CBRT)
   (simplify
diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c b/gcc/testsuite/gcc.dg/associate_comparison_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..c12e5cfb2a8388068fa1731cc2305f9fe5139fd6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */
+
+int
+lt_cmp (float x, float y)
+{
+  return x / 2 <= 0;
+}
+
+int
+inv_cmp (float x, float y)
+{
+  return 5 / x >= 0;
+}
+
+
+/* { dg-final { scan-tree-dump-not " / " "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/associate_division_2.c b/gcc/testsuite/gcc.dg/associate_division_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..1f9c1741ce980f1148016e37399134aa8a619b1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_division_2.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+neg_mov (float w, float x, float *y, float *z)
+{
+  *z = x / (- w);
+  *y = 3 / w;
+
+  return 5 / w;
+}
+
+/* { dg-final { scan-tree-dump-times " / " 1 "optimized" } } */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-10 15:26 ` Jackson Woodruff
@ 2017-08-11  0:15   ` Joseph Myers
  0 siblings, 0 replies; 22+ messages in thread
From: Joseph Myers @ 2017-08-11  0:15 UTC (permalink / raw)
  To: Jackson Woodruff
  Cc: Wilco.dijkstra, richard.guenther, kyrylo.tkachov, gcc-patches

On Thu, 10 Aug 2017, Jackson Woodruff wrote:

> > We also change:
> > 
> >   x / (- y) -> (-x) / y
> > 
> > Which requires -fno-trapping-math.

I don't see why that requires -fno-trapping-math.  The exceptions should 
be identical from both variants, as should the result, as far as defined 
by C or IEEE 754 (it's possible it might affect the sign of a NaN result 
on some architectures, but only where that sign is unspecified in C and 
IEEE 754 terms, so such a change is OK).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-10 14:11 [PATCH] Factor out division by squares and remove division around comparisons (1/2) Jackson Woodruff
  2017-08-10 15:26 ` Jackson Woodruff
@ 2017-08-15 14:22 ` Richard Biener
  2017-08-15 14:43   ` Wilco Dijkstra
  1 sibling, 1 reply; 22+ messages in thread
From: Richard Biener @ 2017-08-15 14:22 UTC (permalink / raw)
  To: Jackson Woodruff
  Cc: Wilco.dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches

On Thu, Aug 10, 2017 at 4:09 PM, Jackson Woodruff
<jackson.woodruff@foss.arm.com> wrote:
> Hi all,
>
> The patch implements the division opitmizations discussed in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .
>
> The implemented change differs slightly from the
> proposed one in that we re-associate:
>
>     C / x comparison 0.0 -> x comparison' 0.0
>
> Where C is any constant and comparison' is changed as
> appropriate if C is negative.
>
> The implementations also removes the division from:
>
>     x / C comparison 0.0 -> x comparison' 0.0
>
> Where again, comparison' is changed as appropriate if C is
> negative.
>
> We also change the association of
>
>      x / (y * C) -> (x / C) / y
>
> If C is a constant.

Why's that profitable?

> All of the above require
> -funsafe-math-optimizations.
>
> We also change:
>
>   x / (- y) -> (-x) / y

Why?  (it's only one of the possible canonicalizations)

Note there are exactly two RDIV_EXPR simplifications remaining
in fold_binary_loc, maybe you can move them to match.pd
(beware of negate_expr_p ... this negate optimization should eventually
move to backprop).

> Which requires -fno-trapping-math.
>
> Bootstrapped and regtested (with part 2 of this patch) on aarch64.
>
> OK for trunk?
>
> (Apologies if the recipients in the 'to' field received this twice,
> I accidentally sent this from an email gcc-patches doesn't accept)
>
> Jackson
>
> gcc/
>
> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>
>         PR 71026/tree-optimization
>         * match.pd: New patterns.
>
>
> gcc/testsuite
>
> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>
>         PR 71026/tree-optimization
>         * gcc.dg/associate_comparison_1.c: New.
>         * gcc.dg/associate_division_2.c: New.
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-15 14:22 ` Richard Biener
@ 2017-08-15 14:43   ` Wilco Dijkstra
  2017-08-17 10:20     ` Richard Biener
  0 siblings, 1 reply; 22+ messages in thread
From: Wilco Dijkstra @ 2017-08-15 14:43 UTC (permalink / raw)
  To: Richard Biener, Jackson Woodruff
  Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

Richard Biener wrote:
> > We also change the association of
> >
> >      x / (y * C) -> (x / C) / y
> >
> > If C is a constant.
>
> Why's that profitable?

It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
Also 1/y is now available to the reciprocal optimization, see 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.

> >   x / (- y) -> (-x) / y
> 
> Why?  (it's only one of the possible canonicalizations)

Same here, y is now available for reciprocal optimization. The
negate may now be optimized, for example (a * b) / -y -> (-a*b) / y
will use a negated multiple on various targets.

Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-15 14:43   ` Wilco Dijkstra
@ 2017-08-17 10:20     ` Richard Biener
  2017-08-17 10:29       ` Wilco Dijkstra
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Biener @ 2017-08-17 10:20 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Richard Biener wrote:
>> > We also change the association of
>> >
>> >      x / (y * C) -> (x / C) / y
>> >
>> > If C is a constant.
>>
>> Why's that profitable?
>
> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
> Also 1/y is now available to the reciprocal optimization, see
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.

Sure, but on its own it's going to be slower.  So this isn't the
correct way to enable those followup transforms.

>> >   x / (- y) -> (-x) / y
>>
>> Why?  (it's only one of the possible canonicalizations)
>
> Same here, y is now available for reciprocal optimization. The
> negate may now be optimized, for example (a * b) / -y -> (-a*b) / y
> will use a negated multiple on various targets.

Fair enough.  Though if it were x / -(a*b) you'd regress that case.

Richard.

>
> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-17 10:20     ` Richard Biener
@ 2017-08-17 10:29       ` Wilco Dijkstra
  2017-08-17 10:39         ` Richard Biener
  2017-08-24 21:26         ` Jeff Law
  0 siblings, 2 replies; 22+ messages in thread
From: Wilco Dijkstra @ 2017-08-17 10:29 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

Richard Biener wrote:
> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > Richard Biener wrote:
>>> > We also change the association of
>>> >
>>> >      x / (y * C) -> (x / C) / y
>>> >
>>> > If C is a constant.
>>>
>>> Why's that profitable?
>>
>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>> Also 1/y is now available to the reciprocal optimization, see
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>
> Sure, but on its own it's going to be slower.  So this isn't the
> correct way to enable those followup transforms.

How can it be any slower? It's one division and one multiply in both cases.

>>> >   x / (- y) -> (-x) / y
>>>
>>> Why?  (it's only one of the possible canonicalizations)
>>
>> Same here, y is now available for reciprocal optimization. The
>> negate may now be optimized, for example (a * b) / -y -> (-a*b) / y
>> will use a negated multiple on various targets.
>
> Fair enough.  Though if it were x / -(a*b) you'd regress that case.

Possibly. You might still be able to merge the negate if the result is used in an
addition or multiply, which wouldn't be possible if it were hiding in a subexpression.
Without global analysis it seems best to move constants/negates to the toplevel
if they can't be trivially removed in a subexpression. Eg. -x / (a * b * -c).

Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-17 10:29       ` Wilco Dijkstra
@ 2017-08-17 10:39         ` Richard Biener
  2017-08-17 12:46           ` Joseph Myers
  2017-08-24 21:26         ` Jeff Law
  1 sibling, 1 reply; 22+ messages in thread
From: Richard Biener @ 2017-08-17 10:39 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

On Thu, Aug 17, 2017 at 11:55 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Richard Biener wrote:
>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> > Richard Biener wrote:
>>>> > We also change the association of
>>>> >
>>>> >      x / (y * C) -> (x / C) / y
>>>> >
>>>> > If C is a constant.
>>>>
>>>> Why's that profitable?
>>>
>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>>> Also 1/y is now available to the reciprocal optimization, see
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>>
>> Sure, but on its own it's going to be slower.  So this isn't the
>> correct way to enable those followup transforms.
>
> How can it be any slower? It's one division and one multiply in both cases.

(x / C) / y is two divisions.  If you restrict it to the case where we can
transform this to (x * C') / y then it's indeed one.

>>>> >   x / (- y) -> (-x) / y
>>>>
>>>> Why?  (it's only one of the possible canonicalizations)
>>>
>>> Same here, y is now available for reciprocal optimization. The
>>> negate may now be optimized, for example (a * b) / -y -> (-a*b) / y
>>> will use a negated multiple on various targets.
>>
>> Fair enough.  Though if it were x / -(a*b) you'd regress that case.
>
> Possibly. You might still be able to merge the negate if the result is used in an
> addition or multiply, which wouldn't be possible if it were hiding in a subexpression.
> Without global analysis it seems best to move constants/negates to the toplevel
> if they can't be trivially removed in a subexpression. Eg. -x / (a * b * -c).

Sure.  So both patterns are canonicalization which is fine for match.pd.  Those
followup transforms should be done at a place that can look at more complicated
patterns.  We have the reassoc pass, then backprop (not exactly matching),
and the recip pattern matching / cse pass.

Richard.

> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-17 10:39         ` Richard Biener
@ 2017-08-17 12:46           ` Joseph Myers
  0 siblings, 0 replies; 22+ messages in thread
From: Joseph Myers @ 2017-08-17 12:46 UTC (permalink / raw)
  To: Richard Biener
  Cc: Wilco Dijkstra, Jackson Woodruff, kyrylo.tkachov, GCC Patches, nd

On Thu, 17 Aug 2017, Richard Biener wrote:

> > Without global analysis it seems best to move constants/negates to the 
> > toplevel if they can't be trivially removed in a subexpression. Eg. -x 
> > / (a * b * -c).
> 
> Sure.  So both patterns are canonicalization which is fine for match.pd.  
> Those followup transforms should be done at a place that can look at 
> more complicated patterns.  We have the reassoc pass, then backprop (not 
> exactly matching), and the recip pattern matching / cse pass.

Watch out for -frounding-math issues, though.  -a * b is always the same 
as a * -b, but not the same as -(a * b) with -frounding-math, and 
similarly, -frounding-math should prevent eliminating the negations from 
the -x / (a * b * -c) expression (whereas elimination from -x / -(a * b * 
c) would be fine).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-17 10:29       ` Wilco Dijkstra
  2017-08-17 10:39         ` Richard Biener
@ 2017-08-24 21:26         ` Jeff Law
  2017-08-29 12:13           ` Jackson Woodruff
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff Law @ 2017-08-24 21:26 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Biener
  Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:
> Richard Biener wrote:
>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>> Richard Biener wrote:
>>>>> We also change the association of
>>>>>
>>>>>       x / (y * C) -> (x / C) / y
>>>>>
>>>>> If C is a constant.
>>>>
>>>> Why's that profitable?
>>>
>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>>> Also 1/y is now available to the reciprocal optimization, see
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>>
>> Sure, but on its own it's going to be slower.  So this isn't the
>> correct way to enable those followup transforms.
> 
> How can it be any slower? It's one division and one multiply in both cases.
x / (y * C) -> (x / C) / y

Goes from one division and one multiplication to two divisions.  I'm
guessing that's what Richi is (reasonably) concerned about.

So it may be the case that we need more complex pattern matching here
for when to perform the first transformation to ultimately enable the
second.


Jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-24 21:26         ` Jeff Law
@ 2017-08-29 12:13           ` Jackson Woodruff
  2017-08-29 13:31             ` Richard Biener
  0 siblings, 1 reply; 22+ messages in thread
From: Jackson Woodruff @ 2017-08-29 12:13 UTC (permalink / raw)
  To: Jeff Law, Wilco Dijkstra, Richard Biener
  Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

Hi all,

Apologies again to those CC'ed, who (again) received this twice.

Joseph: Yes you are correct. I misread the original thread, now fixed.

Richard: I've moved the optimizations out of fold-const.c. One has been 
replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) 
I've deleted as it only introduced a new optimization when running with 
the flags '-O0 -funsafe-math-optimizations'.

On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)'
is enabled and then the pattern is covered by that and
(x * C +- y * C -> C * (x +- y)) (which is already present in match.pd)

I have also updated the testcase for those optimizations to use 'O1' to 
avoid that case.


On 08/24/2017 10:06 PM, Jeff Law wrote:
> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:
>> Richard Biener wrote:
>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>>> Richard Biener wrote:
>>>>>> We also change the association of
>>>>>>
>>>>>>        x / (y * C) -> (x / C) / y
>>>>>>
>>>>>> If C is a constant.
>>>>>
>>>>> Why's that profitable?
>>>>
>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>>>> Also 1/y is now available to the reciprocal optimization, see
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>>>
>>> Sure, but on its own it's going to be slower.  So this isn't the
>>> correct way to enable those followup transforms.
>>
>> How can it be any slower? It's one division and one multiply in both cases.
> x / (y * C) -> (x / C) / y
> 
> Goes from one division and one multiplication to two divisions.  I'm
> guessing that's what Richi is (reasonably) concerned about.
> 
> So it may be the case that we need more complex pattern matching here
> for when to perform the first transformation to ultimately enable the
> second.
> 

The only case where we don't remove the division but still execute this 
pattern is when run with'-O0 -freciprocal-math'.

As long as we have 'O1' or greater (and -freciprocal-math), that is 
enough to enable the removal of the reciprocal.


Jackson.

> 
> Jeff
> 

[-- Attachment #2: patchfile-1 --]
[-- Type: text/plain, Size: 7089 bytes --]

diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index eeeff1ed166734328a612142fdf6235274f9e858..907d96c3d994db1ec8dc0ef692ac0b4d59c99a4c 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -3834,47 +3834,6 @@ invert_truthvalue_loc (location_t loc, tree arg)
 			       : TRUTH_NOT_EXPR,
 			  type, arg);
 }
-
-/* Knowing that ARG0 and ARG1 are both RDIV_EXPRs, simplify a binary operation
-   with code CODE.  This optimization is unsafe.  */
-static tree
-distribute_real_division (location_t loc, enum tree_code code, tree type,
-			  tree arg0, tree arg1)
-{
-  bool mul0 = TREE_CODE (arg0) == MULT_EXPR;
-  bool mul1 = TREE_CODE (arg1) == MULT_EXPR;
-
-  /* (A / C) +- (B / C) -> (A +- B) / C.  */
-  if (mul0 == mul1
-      && operand_equal_p (TREE_OPERAND (arg0, 1),
-		       TREE_OPERAND (arg1, 1), 0))
-    return fold_build2_loc (loc, mul0 ? MULT_EXPR : RDIV_EXPR, type,
-			fold_build2_loc (loc, code, type,
-				     TREE_OPERAND (arg0, 0),
-				     TREE_OPERAND (arg1, 0)),
-			TREE_OPERAND (arg0, 1));
-
-  /* (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2).  */
-  if (operand_equal_p (TREE_OPERAND (arg0, 0),
-		       TREE_OPERAND (arg1, 0), 0)
-      && TREE_CODE (TREE_OPERAND (arg0, 1)) == REAL_CST
-      && TREE_CODE (TREE_OPERAND (arg1, 1)) == REAL_CST)
-    {
-      REAL_VALUE_TYPE r0, r1;
-      r0 = TREE_REAL_CST (TREE_OPERAND (arg0, 1));
-      r1 = TREE_REAL_CST (TREE_OPERAND (arg1, 1));
-      if (!mul0)
-	real_arithmetic (&r0, RDIV_EXPR, &dconst1, &r0);
-      if (!mul1)
-        real_arithmetic (&r1, RDIV_EXPR, &dconst1, &r1);
-      real_arithmetic (&r0, code, &r0, &r1);
-      return fold_build2_loc (loc, MULT_EXPR, type,
-			  TREE_OPERAND (arg0, 0),
-			  build_real (type, r0));
-    }
-
-  return NULL_TREE;
-}
 \f
 /* Return a BIT_FIELD_REF of type TYPE to refer to BITSIZE bits of INNER
    starting at BITPOS.  The field is unsigned if UNSIGNEDP is nonzero
@@ -9419,12 +9378,6 @@ fold_binary_loc (location_t loc,
 		}
 	    }
 
-	  if (flag_unsafe_math_optimizations
-	      && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR)
-	      && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR)
-	      && (tem = distribute_real_division (loc, code, type, arg0, arg1)))
-	    return tem;
-
           /* Convert a + (b*c + d*e) into (a + b*c) + d*e.
              We associate floats only if the user has specified
              -fassociative-math.  */
@@ -9772,13 +9725,6 @@ fold_binary_loc (location_t loc,
 	    return tem;
 	}
 
-      if (FLOAT_TYPE_P (type)
-	  && flag_unsafe_math_optimizations
-	  && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR)
-	  && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR)
-	  && (tem = distribute_real_division (loc, code, type, arg0, arg1)))
-	return tem;
-
       /* Handle (A1 * C1) - (A2 * C2) with A1, A2 or C1, C2 being the same or
 	 one.  Make sure the type is not saturating and has the signedness of
 	 the stripped operands, as fold_plusminus_mult_expr will re-associate.
diff --git a/gcc/match.pd b/gcc/match.pd
index e98db52af84946cf579c6434e06d450713a47162..2714b404826c2b4c74e3f382b85bf2ec183ee4f8 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -342,16 +342,41 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (negate @0)))
 
 (if (flag_reciprocal_math)
- /* Convert (A/B)/C to A/(B*C)  */
+ /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant.  */
  (simplify
   (rdiv (rdiv:s @0 @1) @2)
-   (rdiv @0 (mult @1 @2)))
+   (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST)
+    (rdiv @0 (mult @1 @2))))
+
+ /* Simplify x / (C * y) to (x / C) / y where C is a constant.  */
+ (simplify
+  (rdiv @0
+   (mult @1 REAL_CST@2))
+  (rdiv (rdiv @0 @2) @1))
 
  /* Convert A/(B/C) to (A/B)*C  */
  (simplify
   (rdiv @0 (rdiv:s @1 @2))
    (mult (rdiv @0 @1) @2)))
 
+/* Simplify x / (- y) to -x / y.  */
+(simplify
+ (rdiv @0 (negate @1))
+ (rdiv (negate @0) @1))
+
+(if (flag_unsafe_math_optimizations)
+  /* Simplify (C / x op 0.0) to x op 0.0 for C > 0.  */
+  (for op (lt le gt ge)
+       neg_op (gt ge lt le)
+    (simplify
+      (op (rdiv REAL_CST@0 @1) real_zerop@2)
+      (switch
+       (if (real_less (&dconst0, TREE_REAL_CST_PTR (@0)))
+	(op @1 @2))
+       /* For C < 0, use the inverted operator.  */
+       (if (real_less (TREE_REAL_CST_PTR (@0), &dconst0))
+	(neg_op @1 @2))))))
+
 /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */
 (for div (trunc_div ceil_div floor_div round_div exact_div)
  (simplify
@@ -610,15 +635,6 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
     (if (tem)
      (rdiv { tem; } @1)))))
 
-/* Convert C1/(X*C2) into (C1/C2)/X  */
-(simplify
- (rdiv REAL_CST@0 (mult @1 REAL_CST@2))
-  (if (flag_reciprocal_math)
-   (with
-    { tree tem = const_binop (RDIV_EXPR, type, @0, @2); }
-    (if (tem)
-     (rdiv { tem; } @1)))))
-
 /* Simplify ~X & X as zero.  */
 (simplify
  (bit_and:c (convert? @0) (convert? (bit_not @0)))
@@ -3446,6 +3462,34 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (!HONOR_SNANS (type))
    @0))
 
+ (for cmp (lt le gt ge)
+      neg_cmp (gt ge lt le)
+  /* Simplify (x / C1) cmp y -> x cmp y * C1.  */
+  (simplify
+   (cmp (rdiv @0 REAL_CST@1) @2)
+   (switch
+    (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1)))
+     (cmp @0 (mult @2 @1)))
+    (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0))
+     (neg_cmp @0 (mult @2 @1)))))
+
+  /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1, where C1 != 0.  */
+  (simplify
+   (cmp (mult @0 REAL_CST@1) REAL_CST@2)
+   (switch
+    (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1)))
+     (cmp @0 (rdiv @2 @1)))
+    (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0))
+     (neg_cmp @0 (rdiv @2 @1))))))
+
+  (for op (plus minus)
+   /* Simplify (A / C) +- (B / C) -> (A +- B) / C.  */
+   (simplify
+    (op (rdiv @0 @1)
+	(rdiv @2 @1))
+    (rdiv (op @0 @2) @1)))
+
+
  /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y).  */
  (for root (SQRT CBRT)
   (simplify
diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c b/gcc/testsuite/gcc.dg/associate_comparison_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..d0a197b48d660d3f2e9fabda855670ba477afc4d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */
+
+int
+lt_cmp (float x, float y)
+{
+  return x / 2 <= 0;
+}
+
+int lt_cmp_2 (float x, float y)
+{
+  return x / 3 <= y;
+}
+
+int
+inv_cmp (float x, float y)
+{
+  return 5 / x >= 0;
+}
+
+
+/* { dg-final { scan-tree-dump-not " / " "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/fold-div-1.c b/gcc/testsuite/gcc.dg/fold-div-1.c
index c1c7250f882cfed7705ba60994e47440580f4c76..73b75861166f40733d9768e9703664d51ee1a9ef 100644
--- a/gcc/testsuite/gcc.dg/fold-div-1.c
+++ b/gcc/testsuite/gcc.dg/fold-div-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-funsafe-math-optimizations -fdump-tree-gimple" } */
+/* { dg-options "-O1 -funsafe-math-optimizations -fdump-tree-gimple" } */
 
 float f(float x)
 {

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-29 12:13           ` Jackson Woodruff
@ 2017-08-29 13:31             ` Richard Biener
  2017-08-30 10:45               ` Jackson Woodruff
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Biener @ 2017-08-29 13:31 UTC (permalink / raw)
  To: Jackson Woodruff
  Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers,
	GCC Patches, nd

On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff
<jackson.woodruff@foss.arm.com> wrote:
> Hi all,
>
> Apologies again to those CC'ed, who (again) received this twice.
>
> Joseph: Yes you are correct. I misread the original thread, now fixed.
>
> Richard: I've moved the optimizations out of fold-const.c. One has been
> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've
> deleted as it only introduced a new optimization when running with the flags
> '-O0 -funsafe-math-optimizations'.

Hmm, how did you verify that, that it only adds sth with -O0
-funsafe-math-optimizations?
Is that because in GIMPLE the reassoc pass should do this transform?

You added

+/* Simplify x / (- y) to -x / y.  */
+(simplify
+ (rdiv @0 (negate @1))
+ (rdiv (negate @0) @1))

for

      /* (-A) / (-B) -> A / B  */
      if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1))
        return fold_build2_loc (loc, RDIV_EXPR, type,
                            TREE_OPERAND (arg0, 0),
                            negate_expr (arg1));
      if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0))
        return fold_build2_loc (loc, RDIV_EXPR, type,
                            negate_expr (arg0),
                            TREE_OPERAND (arg1, 0));

presumably?  This isn't equivalent.  It's more like

(simplify
  (rdiv (negate_expr_p @0) (negate @1))
  (rdiv (negate @0) @1))
(simplify
  (rdiv (negate @0) (negate_expr_p @1))
  (rdiv @0 (negate @1)))

and you should remove the corresponding fold-const.c code.

Please do these changes independently to aid bisecting in case of issues.

 (if (flag_reciprocal_math)
- /* Convert (A/B)/C to A/(B*C)  */
+ /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant.  */
  (simplify
   (rdiv (rdiv:s @0 @1) @2)
-   (rdiv @0 (mult @1 @2)))
+   (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST)
+    (rdiv @0 (mult @1 @2))))

why?  I guess to avoid ping-poning with

+ /* Simplify x / (C * y) to (x / C) / y where C is a constant.  */
+ (simplify
+  (rdiv @0
+   (mult @1 REAL_CST@2))
+  (rdiv (rdiv @0 @2) @1))

?  If so why not just disallow for @1 == REAL_CST?

> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)'
> is enabled and then the pattern is covered by that and
> (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd)
>
> I have also updated the testcase for those optimizations to use 'O1' to
> avoid that case.
>
>
> On 08/24/2017 10:06 PM, Jeff Law wrote:
>>
>> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:
>>>
>>> Richard Biener wrote:
>>>>
>>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com>
>>>> wrote:
>>>>>
>>>>> Richard Biener wrote:
>>>>>>>
>>>>>>> We also change the association of
>>>>>>>
>>>>>>>        x / (y * C) -> (x / C) / y
>>>>>>>
>>>>>>> If C is a constant.
>>>>>>
>>>>>>
>>>>>> Why's that profitable?
>>>>>
>>>>>
>>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>>>>> Also 1/y is now available to the reciprocal optimization, see
>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>>>>
>>>>
>>>> Sure, but on its own it's going to be slower.  So this isn't the
>>>> correct way to enable those followup transforms.
>>>
>>>
>>> How can it be any slower? It's one division and one multiply in both
>>> cases.
>>
>> x / (y * C) -> (x / C) / y
>>
>> Goes from one division and one multiplication to two divisions.  I'm
>> guessing that's what Richi is (reasonably) concerned about.
>>
>> So it may be the case that we need more complex pattern matching here
>> for when to perform the first transformation to ultimately enable the
>> second.
>>
>
> The only case where we don't remove the division but still execute this
> pattern is when run with'-O0 -freciprocal-math'.
>
> As long as we have 'O1' or greater (and -freciprocal-math), that is enough
> to enable the removal of the reciprocal.

I don't see this.  I presume you mean this happens in the recip pass?
But we don't optimize this when optimizing for size (but the above pattern
still applies) or when targetm.min_divisions_for_recip_mul is too large.

So I still think this is the wrong place to do this and instead the recip
pass should be extended.



>
> Jackson.
>
>>
>> Jeff
>>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-29 13:31             ` Richard Biener
@ 2017-08-30 10:45               ` Jackson Woodruff
  2017-08-30 13:26                 ` Richard Biener
  0 siblings, 1 reply; 22+ messages in thread
From: Jackson Woodruff @ 2017-08-30 10:45 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers,
	GCC Patches, nd

On 08/29/2017 01:13 PM, Richard Biener wrote:
> On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff
> <jackson.woodruff@foss.arm.com> wrote:
>> Hi all,
>>
>> Apologies again to those CC'ed, who (again) received this twice.
>>
>> Joseph: Yes you are correct. I misread the original thread, now fixed.
>>
>> Richard: I've moved the optimizations out of fold-const.c. One has been
>> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've
>> deleted as it only introduced a new optimization when running with the flags
>> '-O0 -funsafe-math-optimizations'.
> 
> Hmm, how did you verify that, that it only adds sth with -O0
> -funsafe-math-optimizations?

By checking with various flags, although not exhaustively. I looked for 
reasons for the behavior in match.pd (explained below).

I have also since discovered that the combinations of 
'-funsafe-math-optimizations -frounding-math' (at all levels) and 
'-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern 
adds something.

> Is that because in GIMPLE the reassoc pass should do this transform?
It's because the pattern that changes (X / C) -> X * (1 / C) is gated 
with O1:

   (for cst (REAL_CST COMPLEX_CST VECTOR_CST)
    (simplify
     (rdiv @0 cst@1)
->    (if (optimize)
->     (if (flag_reciprocal_math
       && !real_zerop (@1))
       (with
        { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), 
@1); }
        (if (tem)
         (mult @0 { tem; } )))
       (if (cst != COMPLEX_CST)
        (with { tree inverse = exact_inverse (type, @1); }
         (if (inverse)
          (mult @0 { inverse; } ))))))))


I've flagged the two lines that are particularly relevant to this.

Removing this pattern, as I would expect, means that the divisions in 
the above optimization (and the one further down) are not removed.

So then there is the question of edge cases. This pattern is (ignoring 
the second case) going to fail when const_binop returns null. Looking 
through that function says that it will fail (for reals) when:

- Either argument is null (not the case)

- The operation is not in the list (PLUS_EXPR,
   MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR)
   (again not the case)

- We honor Signalling NaNs and one of the operands is a sNaN.

- The operation is a division, and the second argument is zero 	
   and dividing by 0.0 raises an exception.

- The result is infinity and neither of the operands were infinity
   and flag_trapping_math is set.

- The result isn't exact and flag_rounding_math is set.


For (x / ( y * C) -> (x / C) / y), I will add some gating for each of 
these so that the pattern is never executed if one of these would be the 
case.

The additional cases where this isn't converted to a multiplication by 
the reciprocal appear to be when -freciprocal-math is used, but we don't 
have -fno-rounding-math, or funsafe-math-optimizations.

For the removal of (x / C +- y / C -> (x +- y) / C), there is a 
trade-off of whether it is worth having an extra pattern for these edge 
cases. On this I'm not sure.

> 
> You added
> 
> +/* Simplify x / (- y) to -x / y.  */
> +(simplify
> + (rdiv @0 (negate @1))
> + (rdiv (negate @0) @1))

Sorry, this was unclear, the changes to fold const should be split up 
from the additions to match.pd.

This was one of the changes discussed before. The idea is to 
canonicalize this so that y can be extracted in the cse_reciprocals pass.


> 
> for
> 
>        /* (-A) / (-B) -> A / B  */
>        if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1))
>          return fold_build2_loc (loc, RDIV_EXPR, type,
>                              TREE_OPERAND (arg0, 0),
>                              negate_expr (arg1));
>        if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0))
>          return fold_build2_loc (loc, RDIV_EXPR, type,
>                              negate_expr (arg0),
>                              TREE_OPERAND (arg1, 0));
> 
> presumably?  This isn't equivalent.  It's more like
> 
> (simplify
>    (rdiv (negate_expr_p @0) (negate @1))
>    (rdiv (negate @0) @1))
> (simplify
>    (rdiv (negate @0) (negate_expr_p @1))
>    (rdiv @0 (negate @1)))
> 
> and you should remove the corresponding fold-const.c code.
> 
> Please do these changes independently to aid bisecting in case of issues.

I'll resubmit two different patches when I can get them separated. It 
should also make it easier to review when it is clearer what belongs where.

> 
>   (if (flag_reciprocal_math)
> - /* Convert (A/B)/C to A/(B*C)  */
> + /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant.  */
>    (simplify
>     (rdiv (rdiv:s @0 @1) @2)
> -   (rdiv @0 (mult @1 @2)))
> +   (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST)
> +    (rdiv @0 (mult @1 @2))))
> 
> why?  I guess to avoid ping-poning with

Yes, although there is a mistake there. It should be:

(if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@2) != REAL_CST)

I'll fix that up.

> 
> + /* Simplify x / (C * y) to (x / C) / y where C is a constant.  */
> + (simplify
> +  (rdiv @0
> +   (mult @1 REAL_CST@2))
> +  (rdiv (rdiv @0 @2) @1))
> 
> ?  If so why not just disallow for @1 == REAL_CST?

The ping-ponging issue is there even when only one of the operands is a 
constant (which of course it has to be for this pattern to apply). For 
example, something like x / (y * 3), with '-O0 -ffast-math',
when the reciprocal isn't computed, ping pongs back and forth.

I'm not sure that it can be fixed without a condition on the pattern 
already there.

> 
>> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)'
>> is enabled and then the pattern is covered by that and
>> (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd)
>>
>> I have also updated the testcase for those optimizations to use 'O1' to
>> avoid that case.
>>
>>
>> On 08/24/2017 10:06 PM, Jeff Law wrote:
>>>
>>> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:
>>>>
>>>> Richard Biener wrote:
>>>>>
>>>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com>
>>>>> wrote:
>>>>>>
>>>>>> Richard Biener wrote:
>>>>>>>>
>>>>>>>> We also change the association of
>>>>>>>>
>>>>>>>>         x / (y * C) -> (x / C) / y
>>>>>>>>
>>>>>>>> If C is a constant.
>>>>>>>
>>>>>>>
>>>>>>> Why's that profitable?
>>>>>>
>>>>>>
>>>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>>>>>> Also 1/y is now available to the reciprocal optimization, see
>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>>>>>
>>>>>
>>>>> Sure, but on its own it's going to be slower.  So this isn't the
>>>>> correct way to enable those followup transforms.
>>>>
>>>>
>>>> How can it be any slower? It's one division and one multiply in both
>>>> cases.
>>>
>>> x / (y * C) -> (x / C) / y
>>>
>>> Goes from one division and one multiplication to two divisions.  I'm
>>> guessing that's what Richi is (reasonably) concerned about.
>>>
>>> So it may be the case that we need more complex pattern matching here
>>> for when to perform the first transformation to ultimately enable the
>>> second.
>>>
>>
>> The only case where we don't remove the division but still execute this
>> pattern is when run with'-O0 -freciprocal-math'.
>>
>> As long as we have 'O1' or greater (and -freciprocal-math), that is enough
>> to enable the removal of the reciprocal.
> 
> I don't see this.  I presume you mean this happens in the recip pass?

It's one of the match.pd patterns that actually does the extraction 
since C is always constant. I've copied in the pattern above in response 
to one of your previous comments.

> But we don't optimize this when optimizing for size (but the above pattern > still applies) or when targetm.min_divisions_for_recip_mul is too large

It was my understanding that this is used to specify the number of 
divisions you have to be able to replace with a multiplication before it 
is worthwhile inserting an extra multiplication.

So for situations like this, I think that this is not quite what is 
being measured, because there isn't an extra multiplication being inserted?

> 
> So I still think this is the wrong place to do this and instead the recip
> pass should be extended. 
>
> 
> 
>>
>> Jackson.
>>
>>>
>>> Jeff
>>>
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-30 10:45               ` Jackson Woodruff
@ 2017-08-30 13:26                 ` Richard Biener
  2017-09-06  9:55                   ` Jackson Woodruff
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Biener @ 2017-08-30 13:26 UTC (permalink / raw)
  To: Jackson Woodruff
  Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers,
	GCC Patches, nd

On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff
<jackson.woodruff@foss.arm.com> wrote:
> On 08/29/2017 01:13 PM, Richard Biener wrote:
>>
>> On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff
>> <jackson.woodruff@foss.arm.com> wrote:
>>>
>>> Hi all,
>>>
>>> Apologies again to those CC'ed, who (again) received this twice.
>>>
>>> Joseph: Yes you are correct. I misread the original thread, now fixed.
>>>
>>> Richard: I've moved the optimizations out of fold-const.c. One has been
>>> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C)
>>> I've
>>> deleted as it only introduced a new optimization when running with the
>>> flags
>>> '-O0 -funsafe-math-optimizations'.
>>
>>
>> Hmm, how did you verify that, that it only adds sth with -O0
>> -funsafe-math-optimizations?
>
>
> By checking with various flags, although not exhaustively. I looked for
> reasons for the behavior in match.pd (explained below).
>
> I have also since discovered that the combinations of
> '-funsafe-math-optimizations -frounding-math' (at all levels) and
> '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds
> something.
>
>> Is that because in GIMPLE the reassoc pass should do this transform?
>
> It's because the pattern that changes (X / C) -> X * (1 / C) is gated with
> O1:
>
>   (for cst (REAL_CST COMPLEX_CST VECTOR_CST)
>    (simplify
>     (rdiv @0 cst@1)
> ->    (if (optimize)
> ->     (if (flag_reciprocal_math
>       && !real_zerop (@1))
>       (with
>        { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1);
> }
>        (if (tem)
>         (mult @0 { tem; } )))
>       (if (cst != COMPLEX_CST)
>        (with { tree inverse = exact_inverse (type, @1); }
>         (if (inverse)
>          (mult @0 { inverse; } ))))))))
>
>
> I've flagged the two lines that are particularly relevant to this.

So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y
why's that in any way preferable?  I suppose this is again to enable
the recip pass to detect / y (as opposed to / (C * y))?  What's the
reason to believe that / y is more "frequent"?

> Removing this pattern, as I would expect, means that the divisions in the
> above optimization (and the one further down) are not removed.
>
> So then there is the question of edge cases. This pattern is (ignoring the
> second case) going to fail when const_binop returns null. Looking through
> that function says that it will fail (for reals) when:
>
> - Either argument is null (not the case)
>
> - The operation is not in the list (PLUS_EXPR,
>   MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR)
>   (again not the case)
>
> - We honor Signalling NaNs and one of the operands is a sNaN.
>
> - The operation is a division, and the second argument is zero
>   and dividing by 0.0 raises an exception.
>
> - The result is infinity and neither of the operands were infinity
>   and flag_trapping_math is set.
>
> - The result isn't exact and flag_rounding_math is set.
>
>
> For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these
> so that the pattern is never executed if one of these would be the case.

Why not transform this directly to (x * (1/C)) / y then (and only then)?
That makes it obvious not two divisions prevail.

That said, I'm questioning this canonicalization.  I can come up with a
testcase where it makes things worse:

 tem = x / (y * C);
 tem2 = z / (y * C);

should generate

 rdivtmp = 1 / (y*C);
 tem = x *rdivtmp;
 tem2= z * rdivtmp;

instead of

 rdivtmp = 1/y;
 tem = x * 1/C * rdivtmp;
 tem2 = z * 1/C * rdivtmp;

> The additional cases where this isn't converted to a multiplication by the
> reciprocal appear to be when -freciprocal-math is used, but we don't have
> -fno-rounding-math, or funsafe-math-optimizations.
>
> For the removal of (x / C +- y / C -> (x +- y) / C), there is a trade-off of
> whether it is worth having an extra pattern for these edge cases. On this
> I'm not sure.

Well, at least leave it in fold-const.c if you can't move it.

>>
>> You added
>>
>> +/* Simplify x / (- y) to -x / y.  */
>> +(simplify
>> + (rdiv @0 (negate @1))
>> + (rdiv (negate @0) @1))
>
>
> Sorry, this was unclear, the changes to fold const should be split up from
> the additions to match.pd.
>
> This was one of the changes discussed before. The idea is to canonicalize
> this so that y can be extracted in the cse_reciprocals pass.

As said elsewhere I think cse_reciprocals needs a rewrite to not rely on
arbitrary canonicalization to do its work but consider association
opportunities globally with a focus on CSE.

>
>>
>> for
>>
>>        /* (-A) / (-B) -> A / B  */
>>        if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1))
>>          return fold_build2_loc (loc, RDIV_EXPR, type,
>>                              TREE_OPERAND (arg0, 0),
>>                              negate_expr (arg1));
>>        if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0))
>>          return fold_build2_loc (loc, RDIV_EXPR, type,
>>                              negate_expr (arg0),
>>                              TREE_OPERAND (arg1, 0));
>>
>> presumably?  This isn't equivalent.  It's more like
>>
>> (simplify
>>    (rdiv (negate_expr_p @0) (negate @1))
>>    (rdiv (negate @0) @1))
>> (simplify
>>    (rdiv (negate @0) (negate_expr_p @1))
>>    (rdiv @0 (negate @1)))
>>
>> and you should remove the corresponding fold-const.c code.
>>
>> Please do these changes independently to aid bisecting in case of issues.
>
>
> I'll resubmit two different patches when I can get them separated. It should
> also make it easier to review when it is clearer what belongs where.

Thanks.

>>
>>   (if (flag_reciprocal_math)
>> - /* Convert (A/B)/C to A/(B*C)  */
>> + /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant.  */
>>    (simplify
>>     (rdiv (rdiv:s @0 @1) @2)
>> -   (rdiv @0 (mult @1 @2)))
>> +   (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST)
>> +    (rdiv @0 (mult @1 @2))))
>>
>> why?  I guess to avoid ping-poning with
>
>
> Yes, although there is a mistake there. It should be:
>
> (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@2) != REAL_CST)
>
> I'll fix that up.
>
>>
>> + /* Simplify x / (C * y) to (x / C) / y where C is a constant.  */
>> + (simplify
>> +  (rdiv @0
>> +   (mult @1 REAL_CST@2))
>> +  (rdiv (rdiv @0 @2) @1))
>>
>> ?  If so why not just disallow for @1 == REAL_CST?
>
>
> The ping-ponging issue is there even when only one of the operands is a
> constant (which of course it has to be for this pattern to apply). For
> example, something like x / (y * 3), with '-O0 -ffast-math',
> when the reciprocal isn't computed, ping pongs back and forth.
>
> I'm not sure that it can be fixed without a condition on the pattern already
> there.
>
>
>>
>>> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)'
>>> is enabled and then the pattern is covered by that and
>>> (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd)
>>>
>>> I have also updated the testcase for those optimizations to use 'O1' to
>>> avoid that case.
>>>
>>>
>>> On 08/24/2017 10:06 PM, Jeff Law wrote:
>>>>
>>>>
>>>> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:
>>>>>
>>>>>
>>>>> Richard Biener wrote:
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra
>>>>>> <Wilco.Dijkstra@arm.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Richard Biener wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We also change the association of
>>>>>>>>>
>>>>>>>>>         x / (y * C) -> (x / C) / y
>>>>>>>>>
>>>>>>>>> If C is a constant.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Why's that profitable?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>>>>>>> Also 1/y is now available to the reciprocal optimization, see
>>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sure, but on its own it's going to be slower.  So this isn't the
>>>>>> correct way to enable those followup transforms.
>>>>>
>>>>>
>>>>>
>>>>> How can it be any slower? It's one division and one multiply in both
>>>>> cases.
>>>>
>>>>
>>>> x / (y * C) -> (x / C) / y
>>>>
>>>> Goes from one division and one multiplication to two divisions.  I'm
>>>> guessing that's what Richi is (reasonably) concerned about.
>>>>
>>>> So it may be the case that we need more complex pattern matching here
>>>> for when to perform the first transformation to ultimately enable the
>>>> second.
>>>>
>>>
>>> The only case where we don't remove the division but still execute this
>>> pattern is when run with'-O0 -freciprocal-math'.
>>>
>>> As long as we have 'O1' or greater (and -freciprocal-math), that is
>>> enough
>>> to enable the removal of the reciprocal.
>>
>>
>> I don't see this.  I presume you mean this happens in the recip pass?
>
>
> It's one of the match.pd patterns that actually does the extraction since C
> is always constant. I've copied in the pattern above in response to one of
> your previous comments.
>
>> But we don't optimize this when optimizing for size (but the above pattern
>> > still applies) or when targetm.min_divisions_for_recip_mul is too large
>
>
> It was my understanding that this is used to specify the number of divisions
> you have to be able to replace with a multiplication before it is worthwhile
> inserting an extra multiplication.
>
> So for situations like this, I think that this is not quite what is being
> measured, because there isn't an extra multiplication being inserted?
>
>
>>
>> So I still think this is the wrong place to do this and instead the recip
>> pass should be extended.
>>
>>
>>>
>>> Jackson.
>>>
>>>>
>>>> Jeff
>>>>
>>>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-08-30 13:26                 ` Richard Biener
@ 2017-09-06  9:55                   ` Jackson Woodruff
  2017-09-13 18:37                     ` Jeff Law
  0 siblings, 1 reply; 22+ messages in thread
From: Jackson Woodruff @ 2017-09-06  9:55 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers,
	GCC Patches, nd

[-- Attachment #1: Type: text/plain, Size: 7960 bytes --]

On 08/30/2017 01:46 PM, Richard Biener wrote:
> On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff
> <jackson.woodruff@foss.arm.com> wrote:
>> On 08/29/2017 01:13 PM, Richard Biener wrote:
>>>
>>> On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff
>>> <jackson.woodruff@foss.arm.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Apologies again to those CC'ed, who (again) received this twice.
>>>>
>>>> Joseph: Yes you are correct. I misread the original thread, now fixed.
>>>>
>>>> Richard: I've moved the optimizations out of fold-const.c. One has been
>>>> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C)
>>>> I've
>>>> deleted as it only introduced a new optimization when running with the
>>>> flags
>>>> '-O0 -funsafe-math-optimizations'.
>>>
>>>
>>> Hmm, how did you verify that, that it only adds sth with -O0
>>> -funsafe-math-optimizations?
>>
>>
>> By checking with various flags, although not exhaustively. I looked for
>> reasons for the behavior in match.pd (explained below).
>>
>> I have also since discovered that the combinations of
>> '-funsafe-math-optimizations -frounding-math' (at all levels) and
>> '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds
>> something.
>>
>>> Is that because in GIMPLE the reassoc pass should do this transform?
>>
>> It's because the pattern that changes (X / C) -> X * (1 / C) is gated with
>> O1:
>>
>>    (for cst (REAL_CST COMPLEX_CST VECTOR_CST)
>>     (simplify
>>      (rdiv @0 cst@1)
>> ->    (if (optimize)
>> ->     (if (flag_reciprocal_math
>>        && !real_zerop (@1))
>>        (with
>>         { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1);
>> }
>>         (if (tem)
>>          (mult @0 { tem; } )))
>>        (if (cst != COMPLEX_CST)
>>         (with { tree inverse = exact_inverse (type, @1); }
>>          (if (inverse)
>>           (mult @0 { inverse; } ))))))))
>>
>>
>> I've flagged the two lines that are particularly relevant to this.
> 
> So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y
> why's that in any way preferable?  I suppose this is again to enable
> the recip pass to detect / y (as opposed to / (C * y))?  What's the
> reason to believe that / y is more "frequent"?
> 
>> Removing this pattern, as I would expect, means that the divisions in the
>> above optimization (and the one further down) are not removed.
>>
>> So then there is the question of edge cases. This pattern is (ignoring the
>> second case) going to fail when const_binop returns null. Looking through
>> that function says that it will fail (for reals) when:
>>
>> - Either argument is null (not the case)
>>
>> - The operation is not in the list (PLUS_EXPR,
>>    MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR)
>>    (again not the case)
>>
>> - We honor Signalling NaNs and one of the operands is a sNaN.
>>
>> - The operation is a division, and the second argument is zero
>>    and dividing by 0.0 raises an exception.
>>
>> - The result is infinity and neither of the operands were infinity
>>    and flag_trapping_math is set.
>>
>> - The result isn't exact and flag_rounding_math is set.
>>
>>
>> For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these
>> so that the pattern is never executed if one of these would be the case.
> 
> Why not transform this directly to (x * (1/C)) / y then (and only then)?
> That makes it obvious not two divisions prevail.

Done.

> 
> That said, I'm questioning this canonicalization.  I can come up with a
> testcase where it makes things worse:
> 
>   tem = x / (y * C);
>   tem2 = z / (y * C);
> 
> should generate
> 
>   rdivtmp = 1 / (y*C);
>   tem = x *rdivtmp;
>   tem2= z * rdivtmp;
> 
> instead of
> 
>   rdivtmp = 1/y;
>   tem = x * 1/C * rdivtmp;
>   tem2 = z * 1/C * rdivtmp;

Ideally we would be able to CSE that into

rdivtmp = 1/y * 1/C;
tem = x * rdivtmp;
tem2 = z * rdivtmp;

Although we currently do not. An equally (perhaps more?) problematic 
case is something like:

tem = x / (y * C)
tem2 = y * C

Which becomes:

tem = x * (1 / C) / y
tem2 = y * C

Instead of

K = y * C
tem = x / K
tem2 = K

Which ultimately requires context awareness to avoid. This does seem to 
be a general problem with a large number of match.pd patterns rather 
than anything specific to this one. For example, a similar example can 
be constructed for (say) (A / B) / C -> (A / (B * C)).

> 
>> The additional cases where this isn't converted to a multiplication by the
>> reciprocal appear to be when -freciprocal-math is used, but we don't have
>> -fno-rounding-math, or funsafe-math-optimizations.
>> >>
>>>
>>>> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)'
>>>> is enabled and then the pattern is covered by that and
>>>> (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd)
>>>>
>>>> I have also updated the testcase for those optimizations to use 'O1' to
>>>> avoid that case.
>>>>
>>>>
>>>> On 08/24/2017 10:06 PM, Jeff Law wrote:
>>>>>
>>>>>
>>>>> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:
>>>>>>
>>>>>>
>>>>>> Richard Biener wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra
>>>>>>> <Wilco.Dijkstra@arm.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Richard Biener wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We also change the association of
>>>>>>>>>>
>>>>>>>>>>          x / (y * C) -> (x / C) / y
>>>>>>>>>>
>>>>>>>>>> If C is a constant.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Why's that profitable?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
>>>>>>>> Also 1/y is now available to the reciprocal optimization, see
>>>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sure, but on its own it's going to be slower.  So this isn't the
>>>>>>> correct way to enable those followup transforms.
>>>>>>
>>>>>>
>>>>>>
>>>>>> How can it be any slower? It's one division and one multiply in both
>>>>>> cases.
>>>>>
>>>>>
>>>>> x / (y * C) -> (x / C) / y
>>>>>
>>>>> Goes from one division and one multiplication to two divisions.  I'm
>>>>> guessing that's what Richi is (reasonably) concerned about.
>>>>>
>>>>> So it may be the case that we need more complex pattern matching here
>>>>> for when to perform the first transformation to ultimately enable the
>>>>> second.
>>>>>
>>>>
>>>> The only case where we don't remove the division but still execute this
>>>> pattern is when run with'-O0 -freciprocal-math'.
>>>>
>>>> As long as we have 'O1' or greater (and -freciprocal-math), that is
>>>> enough
>>>> to enable the removal of the reciprocal.
>>>
>>>
>>> I don't see this.  I presume you mean this happens in the recip pass?
>>
>>
>> It's one of the match.pd patterns that actually does the extraction since C
>> is always constant. I've copied in the pattern above in response to one of
>> your previous comments.
>>
>>> But we don't optimize this when optimizing for size (but the above pattern
>>>> still applies) or when targetm.min_divisions_for_recip_mul is too large
>>
>>
>> It was my understanding that this is used to specify the number of divisions
>> you have to be able to replace with a multiplication before it is worthwhile
>> inserting an extra multiplication.
>>
>> So for situations like this, I think that this is not quite what is being
>> measured, because there isn't an extra multiplication being inserted?
>>
>>
>>>
>>> So I still think this is the wrong place to do this and instead the recip
>>> pass should be extended.
>>>
>>>
>>>>
>>>> Jackson.
>>>>
>>>>>
>>>>> Jeff
>>>>>
>>>>
>>

Updated ChangeLog:

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* match.pd: New patterns.

gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* gcc.dg/associate_comparison_1.c: New.
	* gcc.dg/extract_recip_1.c: New.
	* gcc.dg/extract_recip_2.c: New.

[-- Attachment #2: patchfile-1 --]
[-- Type: text/plain, Size: 4856 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index 69dd8193cd0524d99fba8be8da8183230b8d621a..d4f56f5f332f9076481bb5db26264116c7f18728 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -342,16 +342,45 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (negate @0)))
 
 (if (flag_reciprocal_math)
- /* Convert (A/B)/C to A/(B*C)  */
+ /* Convert (A/B)/C to A/(B*C). */
  (simplify
   (rdiv (rdiv:s @0 @1) @2)
-   (rdiv @0 (mult @1 @2)))
+  (rdiv @0 (mult @1 @2)))
+
+ /* Simplify x / (C * y) to (x * (1 / C)) / y where C is a constant.  */
+ (if (optimize)
+  (simplify
+   (rdiv @0
+    (mult @1 REAL_CST@2))
+   (if (!real_zerop (@1))
+    (with
+     { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @2); }
+     (if (tem)
+      (rdiv (mult @0 { tem; } ) @1))))))
 
  /* Convert A/(B/C) to (A/B)*C  */
  (simplify
   (rdiv @0 (rdiv:s @1 @2))
    (mult (rdiv @0 @1) @2)))
 
+/* Simplify x / (- y) to -x / y.  */
+(simplify
+ (rdiv @0 (negate @1))
+ (rdiv (negate @0) @1))
+
+(if (flag_unsafe_math_optimizations)
+  /* Simplify (C / x op 0.0) to x op 0.0 for C > 0.  */
+  (for op (lt le gt ge)
+       neg_op (gt ge lt le)
+    (simplify
+      (op (rdiv REAL_CST@0 @1) real_zerop@2)
+      (switch
+       (if (real_less (&dconst0, TREE_REAL_CST_PTR (@0)))
+	(op @1 @2))
+       /* For C < 0, use the inverted operator.  */
+       (if (real_less (TREE_REAL_CST_PTR (@0), &dconst0))
+	(neg_op @1 @2))))))
+
 /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */
 (for div (trunc_div ceil_div floor_div round_div exact_div)
  (simplify
@@ -610,15 +639,6 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
     (if (tem)
      (rdiv { tem; } @1)))))
 
-/* Convert C1/(X*C2) into (C1/C2)/X  */
-(simplify
- (rdiv REAL_CST@0 (mult @1 REAL_CST@2))
-  (if (flag_reciprocal_math)
-   (with
-    { tree tem = const_binop (RDIV_EXPR, type, @0, @2); }
-    (if (tem)
-     (rdiv { tem; } @1)))))
-
 /* Simplify ~X & X as zero.  */
 (simplify
  (bit_and:c (convert? @0) (convert? (bit_not @0)))
@@ -3517,6 +3537,26 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (!HONOR_SNANS (type))
    @0))
 
+ (for cmp (lt le gt ge)
+      neg_cmp (gt ge lt le)
+  /* Simplify (x / C1) cmp y -> x cmp y * C1.  */
+  (simplify
+   (cmp (rdiv @0 REAL_CST@1) @2)
+   (switch
+    (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1)))
+     (cmp @0 (mult @2 @1)))
+    (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0))
+     (neg_cmp @0 (mult @2 @1)))))
+
+  /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1, where C1 != 0.  */
+  (simplify
+   (cmp (mult @0 REAL_CST@1) REAL_CST@2)
+   (switch
+    (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1)))
+     (cmp @0 (rdiv @2 @1)))
+    (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0))
+     (neg_cmp @0 (rdiv @2 @1))))))
+
  /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y).  */
  (for root (SQRT CBRT)
   (simplify
diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c b/gcc/testsuite/gcc.dg/associate_comparison_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..d0a197b48d660d3f2e9fabda855670ba477afc4d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */
+
+int
+lt_cmp (float x, float y)
+{
+  return x / 2 <= 0;
+}
+
+int lt_cmp_2 (float x, float y)
+{
+  return x / 3 <= y;
+}
+
+int
+inv_cmp (float x, float y)
+{
+  return 5 / x >= 0;
+}
+
+
+/* { dg-final { scan-tree-dump-not " / " "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/extract_recip_1.c b/gcc/testsuite/gcc.dg/extract_recip_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..54556587889848f6da090207690608e34e19d407
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_1.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+neg_recip (float x, float y, float z)
+{
+  return (x / y) + (z / -y);
+}
+
+float
+const_divisor (float x, float y, float z)
+{
+  return x / (y * 3.0f) + z / y;
+}
+
+/* 0 multiplications in 'neg_recip', 1 multiplcations in
+   'const_divisor' expected.  */
+/* { dg-final { scan-tree-dump-times " \\* " 1 "optimized" } } */
+
+/* 1 division in 'neg_recip', 1 division in 'const_divisor'.  */
+/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/extract_recip_2.c b/gcc/testsuite/gcc.dg/extract_recip_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..1f9c1741ce980f1148016e37399134aa8a619b1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_2.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+neg_mov (float w, float x, float *y, float *z)
+{
+  *z = x / (- w);
+  *y = 3 / w;
+
+  return 5 / w;
+}
+
+/* { dg-final { scan-tree-dump-times " / " 1 "optimized" } } */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-09-06  9:55                   ` Jackson Woodruff
@ 2017-09-13 18:37                     ` Jeff Law
  2017-09-13 21:20                       ` Wilco Dijkstra
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Law @ 2017-09-13 18:37 UTC (permalink / raw)
  To: Jackson Woodruff, Richard Biener
  Cc: Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

On 09/06/2017 03:55 AM, Jackson Woodruff wrote:
> On 08/30/2017 01:46 PM, Richard Biener wrote:
>> On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff
>> <jackson.woodruff@foss.arm.com> wrote:
>>> On 08/29/2017 01:13 PM, Richard Biener wrote:
>>>>
>>>> On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff
>>>> <jackson.woodruff@foss.arm.com> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Apologies again to those CC'ed, who (again) received this twice.
>>>>>
>>>>> Joseph: Yes you are correct. I misread the original thread, now fixed.
>>>>>
>>>>> Richard: I've moved the optimizations out of fold-const.c. One has
>>>>> been
>>>>> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C)
>>>>> I've
>>>>> deleted as it only introduced a new optimization when running with the
>>>>> flags
>>>>> '-O0 -funsafe-math-optimizations'.
>>>>
>>>>
>>>> Hmm, how did you verify that, that it only adds sth with -O0
>>>> -funsafe-math-optimizations?
>>>
>>>
>>> By checking with various flags, although not exhaustively. I looked for
>>> reasons for the behavior in match.pd (explained below).
>>>
>>> I have also since discovered that the combinations of
>>> '-funsafe-math-optimizations -frounding-math' (at all levels) and
>>> '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern
>>> adds
>>> something.
>>>
>>>> Is that because in GIMPLE the reassoc pass should do this transform?
>>>
>>> It's because the pattern that changes (X / C) -> X * (1 / C) is gated
>>> with
>>> O1:
>>>
>>>    (for cst (REAL_CST COMPLEX_CST VECTOR_CST)
>>>     (simplify
>>>      (rdiv @0 cst@1)
>>> ->    (if (optimize)
>>> ->     (if (flag_reciprocal_math
>>>        && !real_zerop (@1))
>>>        (with
>>>         { tree tem = const_binop (RDIV_EXPR, type, build_one_cst
>>> (type), @1);
>>> }
>>>         (if (tem)
>>>          (mult @0 { tem; } )))
>>>        (if (cst != COMPLEX_CST)
>>>         (with { tree inverse = exact_inverse (type, @1); }
>>>          (if (inverse)
>>>           (mult @0 { inverse; } ))))))))
>>>
>>>
>>> I've flagged the two lines that are particularly relevant to this.
>>
>> So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y
>> why's that in any way preferable?  I suppose this is again to enable
>> the recip pass to detect / y (as opposed to / (C * y))?  What's the
>> reason to believe that / y is more "frequent"?
>>
>>> Removing this pattern, as I would expect, means that the divisions in
>>> the
>>> above optimization (and the one further down) are not removed.
>>>
>>> So then there is the question of edge cases. This pattern is
>>> (ignoring the
>>> second case) going to fail when const_binop returns null. Looking
>>> through
>>> that function says that it will fail (for reals) when:
>>>
>>> - Either argument is null (not the case)
>>>
>>> - The operation is not in the list (PLUS_EXPR,
>>>    MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR)
>>>    (again not the case)
>>>
>>> - We honor Signalling NaNs and one of the operands is a sNaN.
>>>
>>> - The operation is a division, and the second argument is zero
>>>    and dividing by 0.0 raises an exception.
>>>
>>> - The result is infinity and neither of the operands were infinity
>>>    and flag_trapping_math is set.
>>>
>>> - The result isn't exact and flag_rounding_math is set.
>>>
>>>
>>> For (x / ( y * C) -> (x / C) / y), I will add some gating for each of
>>> these
>>> so that the pattern is never executed if one of these would be the case.
>>
>> Why not transform this directly to (x * (1/C)) / y then (and only then)?
>> That makes it obvious not two divisions prevail.
> 
> Done.
> 
>>
>> That said, I'm questioning this canonicalization.  I can come up with a
>> testcase where it makes things worse:
>>
>>   tem = x / (y * C);
>>   tem2 = z / (y * C);
>>
>> should generate
>>
>>   rdivtmp = 1 / (y*C);
>>   tem = x *rdivtmp;
>>   tem2= z * rdivtmp;
>>
>> instead of
>>
>>   rdivtmp = 1/y;
>>   tem = x * 1/C * rdivtmp;
>>   tem2 = z * 1/C * rdivtmp;
> 
> Ideally we would be able to CSE that into
> 
> rdivtmp = 1/y * 1/C;
> tem = x * rdivtmp;
> tem2 = z * rdivtmp;
So why is your sequence significantly better than Richi's desired
seqeuence?  They both seem to need 3 mults and a division (which in both
cases might be a reciprocal estimation).    In Richi's sequence we have
to mult and feed the result as an operand into the reciprocal insn.  In
yours we feed the result of the reciprocal into the multiply.

ISTM in the end if  y*C or 1/(y*C) is a CSE, then Richi's sequence wins.
 Similarly if 1/y is a CSE, then yours wins.  Is there some reason to
believe that one is a more likely CSE than the other?

> Although we currently do not. An equally (perhaps more?) problematic
> case is something like:
> 
> tem = x / (y * C)
> tem2 = y * C
> 
> Which becomes:
> 
> tem = x * (1 / C) / y
> tem2 = y * C
> 
> Instead of
> 
> K = y * C
> tem = x / K
> tem2 = K
> 
> Which ultimately requires context awareness to avoid. This does seem to
> be a general problem with a large number of match.pd patterns rather
> than anything specific to this one. For example, a similar example can
> be constructed for (say) (A / B) / C -> (A / (B * C)).
I think there's a fundamental phase ordering problem here.  You want to
CSE stuff as much as possible, then expose reciprocals, then CSE again
because exposing reciprocals can expose new CSE opportunities.

And I suspect that no matter how hard we try, there's going to be cases
that get exposed by various transformations in the pipeline such that to
fully optimize the cse - reciprocal - cse sequence would need to be
repeated to fully optimize.  We may have to live with not being
particularly good at picking up the those second order effects.

I do think that the need for cse - reciprocal - cse  sequencing suggests
that match.pd may not be the right place for these transformations.  I
think Richi has pointed this out a couple times already.

I'm not going to accept or reject at this time.  I think we need to make
a higher level decision.  Are these transformations better suited for
match.pd or the reciprocal transformation pass?  I realize that some
patterns are already in match.pd, but let's try to settle the higher
level issue before we add more.

jeff


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-09-13 18:37                     ` Jeff Law
@ 2017-09-13 21:20                       ` Wilco Dijkstra
  2017-09-15 12:07                         ` Richard Biener
  2017-09-15 16:50                         ` Jeff Law
  0 siblings, 2 replies; 22+ messages in thread
From: Wilco Dijkstra @ 2017-09-13 21:20 UTC (permalink / raw)
  To: Jeff Law, Jackson Woodruff, Richard Biener
  Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

Jeff Law wrote:
> On 09/06/2017 03:55 AM, Jackson Woodruff wrote:
> > On 08/30/2017 01:46 PM, Richard Biener wrote:

>>>   rdivtmp = 1 / (y*C);
>>>   tem = x *rdivtmp;
>>>   tem2= z * rdivtmp;
>>>
>>> instead of
>>>
>>>   rdivtmp = 1/y;
>>>   tem = x * 1/C * rdivtmp;
>>>   tem2 = z * 1/C * rdivtmp;
>> 
>> Ideally we would be able to CSE that into
>> 
>> rdivtmp = 1/y * 1/C;
>> tem = x * rdivtmp;
>> tem2 = z * rdivtmp;
> So why is your sequence significantly better than Richi's desired
> seqeuence?  They both seem to need 3 mults and a division (which in both
> cases might be a reciprocal estimation).    In Richi's sequence we have
> to mult and feed the result as an operand into the reciprocal insn.  In
> yours we feed the result of the reciprocal into the multiply.

Basically this stuff happens a lot in real code, which is exactly why I proposed it.
I even provided counts of how many divisions each transformation avoids:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. 

Note this transformation is a canonicalization - if you can't merge a
constant somehow, moving it out to the LHS can expose more
opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> (C3 * x) / y.
Same for negation as it behaves like * -1.

The key here is that it is at least an order of magnitude worse if you have
to execute an extra division than if you have an extra multiply.

> ISTM in the end if  y*C or 1/(y*C) is a CSE, then Richi's sequence wins.
> Similarly if 1/y is a CSE, then yours wins.  Is there some reason to
> believe that one is a more likely CSE than the other?

The idea is that 1/y is much more likely a CSE than 1/(y*C).

We could make the pattern only fire in single use cases and see whether
that makes a difference. It would be easy to test old vs new vs single-use
new and count how many divisions we end up with. We could add 1/ (y * C)
to the reciprocal phase if it is unacceptable as a canonicalization, but then
you won't be able to optimize (C1 * x * C2) / y.

> I think there's a fundamental phase ordering problem here.  You want to
> CSE stuff as much as possible, then expose reciprocals, then CSE again
> because exposing reciprocals can expose new CSE opportunities.

I agree there are phase ordering issues and various problems in
reassociation, CSE and division optimizations not being able to find and
optimize complex cases that are worthwhile.

However I don't agree doing CSE before reciprocals is a good idea. We
want to expose reciprocals early on, even if that means we find fewer
CSEs as a result - again because division is so much more expensive than
any other operation. CSE is generally not smart enough to CSE a * x in
a * b * x and a * c * x, something which is likely to happen quite frequently -
unlike the made up division examples here.

> And I suspect that no matter how hard we try, there's going to be cases
> that get exposed by various transformations in the pipeline such that to
> fully optimize the cse - reciprocal - cse sequence would need to be
> repeated to fully optimize.  We may have to live with not being
> particularly good at picking up the those second order effects.
> 
> I do think that the need for cse - reciprocal - cse  sequencing suggests
> that match.pd may not be the right place for these transformations.  I
> think Richi has pointed this out a couple times already.

I don't think you can ever expect to find the optimal case by repeating
optimizations. It's quite easy to construct examples where first doing CSE
makes things significantly worse. Ultimately to get something optimal you'd
need to try lots of permutations and count for each possible permutation
how many multiplies and divisons you end up after full optimization.
Quite impractical...

> I'm not going to accept or reject at this time.  I think we need to make
> a higher level decision.  Are these transformations better suited for
> match.pd or the reciprocal transformation pass?  I realize that some
> patterns are already in match.pd, but let's try to settle the higher
> level issue before we add more.

The first question is whether you see it as a canonicalization. If so, then
match.pd should be fine.

Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-09-13 21:20                       ` Wilco Dijkstra
@ 2017-09-15 12:07                         ` Richard Biener
  2017-09-15 16:50                         ` Jeff Law
  1 sibling, 0 replies; 22+ messages in thread
From: Richard Biener @ 2017-09-15 12:07 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Jeff Law, Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers,
	GCC Patches, nd

On Wed, Sep 13, 2017 at 11:20 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Jeff Law wrote:
>> On 09/06/2017 03:55 AM, Jackson Woodruff wrote:
>> > On 08/30/2017 01:46 PM, Richard Biener wrote:
>
>>>>   rdivtmp = 1 / (y*C);
>>>>   tem = x *rdivtmp;
>>>>   tem2= z * rdivtmp;
>>>>
>>>> instead of
>>>>
>>>>   rdivtmp = 1/y;
>>>>   tem = x * 1/C * rdivtmp;
>>>>   tem2 = z * 1/C * rdivtmp;
>>>
>>> Ideally we would be able to CSE that into
>>>
>>> rdivtmp = 1/y * 1/C;
>>> tem = x * rdivtmp;
>>> tem2 = z * rdivtmp;
>> So why is your sequence significantly better than Richi's desired
>> seqeuence?  They both seem to need 3 mults and a division (which in both
>> cases might be a reciprocal estimation).    In Richi's sequence we have
>> to mult and feed the result as an operand into the reciprocal insn.  In
>> yours we feed the result of the reciprocal into the multiply.
>
> Basically this stuff happens a lot in real code, which is exactly why I proposed it.
> I even provided counts of how many divisions each transformation avoids:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026.
>
> Note this transformation is a canonicalization - if you can't merge a
> constant somehow, moving it out to the LHS can expose more
> opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> (C3 * x) / y.
> Same for negation as it behaves like * -1.
>
> The key here is that it is at least an order of magnitude worse if you have
> to execute an extra division than if you have an extra multiply.
>
>> ISTM in the end if  y*C or 1/(y*C) is a CSE, then Richi's sequence wins.
>> Similarly if 1/y is a CSE, then yours wins.  Is there some reason to
>> believe that one is a more likely CSE than the other?
>
> The idea is that 1/y is much more likely a CSE than 1/(y*C).
>
> We could make the pattern only fire in single use cases and see whether
> that makes a difference. It would be easy to test old vs new vs single-use
> new and count how many divisions we end up with. We could add 1/ (y * C)
> to the reciprocal phase if it is unacceptable as a canonicalization, but then
> you won't be able to optimize (C1 * x * C2) / y.
>
>> I think there's a fundamental phase ordering problem here.  You want to
>> CSE stuff as much as possible, then expose reciprocals, then CSE again
>> because exposing reciprocals can expose new CSE opportunities.
>
> I agree there are phase ordering issues and various problems in
> reassociation, CSE and division optimizations not being able to find and
> optimize complex cases that are worthwhile.
>
> However I don't agree doing CSE before reciprocals is a good idea. We
> want to expose reciprocals early on, even if that means we find fewer
> CSEs as a result - again because division is so much more expensive than
> any other operation. CSE is generally not smart enough to CSE a * x in
> a * b * x and a * c * x, something which is likely to happen quite frequently -
> unlike the made up division examples here.
>
>> And I suspect that no matter how hard we try, there's going to be cases
>> that get exposed by various transformations in the pipeline such that to
>> fully optimize the cse - reciprocal - cse sequence would need to be
>> repeated to fully optimize.  We may have to live with not being
>> particularly good at picking up the those second order effects.
>>
>> I do think that the need for cse - reciprocal - cse  sequencing suggests
>> that match.pd may not be the right place for these transformations.  I
>> think Richi has pointed this out a couple times already.
>
> I don't think you can ever expect to find the optimal case by repeating
> optimizations. It's quite easy to construct examples where first doing CSE
> makes things significantly worse. Ultimately to get something optimal you'd
> need to try lots of permutations and count for each possible permutation
> how many multiplies and divisons you end up after full optimization.
> Quite impractical...
>
>> I'm not going to accept or reject at this time.  I think we need to make
>> a higher level decision.  Are these transformations better suited for
>> match.pd or the reciprocal transformation pass?  I realize that some
>> patterns are already in match.pd, but let's try to settle the higher
>> level issue before we add more.
>
> The first question is whether you see it as a canonicalization. If so, then
> match.pd should be fine.

A canonicalization to more divisions is not fine.  That is, if we think
that x / (C * y) is non-canonical because constant parts should be
in the denominator then fine, canonicalize it as (x * C') / y with
C' = 1/C.  But then implement it so, not in a pattern that suggests
you'll end up with two divisions.

Let me repeat though that this looks like a job for re-association.

From that perspective the part of the patch doing

+ /* Simplify x / (C * y) to (x * (1 / C)) / y where C is a constant.  */
+ (if (optimize)
+  (simplify
+   (rdiv @0
+    (mult @1 REAL_CST@2))
+   (if (!real_zerop (@1))
+    (with
+     { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @2); }
+     (if (tem)
+      (rdiv (mult @0 { tem; } ) @1))))))

looks ok apart from the if (optimize) check and the real_zerop (@1) check
which should test @2.

_please_ split the patch up.

+/* Simplify x / (- y) to -x / y.  */
+(simplify
+ (rdiv @0 (negate @1))
+ (rdiv (negate @0) @1))

I think the comments on both patterns should say 'Canonicalize' instead of
'Simplify' as those are not simplifications, they are even Canonicalizations
that change the value in one case (which usually makes canonicalizations
questionable IMHO given nothign guarantees the followup optimziation they
should enable).

Richard.

> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-09-13 21:20                       ` Wilco Dijkstra
  2017-09-15 12:07                         ` Richard Biener
@ 2017-09-15 16:50                         ` Jeff Law
  2017-09-16 21:39                           ` Bernhard Reutner-Fischer
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff Law @ 2017-09-15 16:50 UTC (permalink / raw)
  To: Wilco Dijkstra, Jackson Woodruff, Richard Biener
  Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

On 09/13/2017 03:20 PM, Wilco Dijkstra wrote:
> Jeff Law wrote:
>> On 09/06/2017 03:55 AM, Jackson Woodruff wrote:
>>> On 08/30/2017 01:46 PM, Richard Biener wrote:
> 
>>>>    rdivtmp = 1 / (y*C);
>>>>    tem = x *rdivtmp;
>>>>    tem2= z * rdivtmp;
>>>>
>>>> instead of
>>>>
>>>>    rdivtmp = 1/y;
>>>>    tem = x * 1/C * rdivtmp;
>>>>    tem2 = z * 1/C * rdivtmp;
>>>
>>> Ideally we would be able to CSE that into
>>>
>>> rdivtmp = 1/y * 1/C;
>>> tem = x * rdivtmp;
>>> tem2 = z * rdivtmp;
>> So why is your sequence significantly better than Richi's desired
>> seqeuence?  They both seem to need 3 mults and a division (which in both
>> cases might be a reciprocal estimation).    In Richi's sequence we have
>> to mult and feed the result as an operand into the reciprocal insn.  In
>> yours we feed the result of the reciprocal into the multiply.
> 
> Basically this stuff happens a lot in real code, which is exactly why I proposed it.
> I even provided counts of how many divisions each transformation avoids:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. 
I don't doubt that it happens in real code.  There's a reason why MIPS
IV added recip and rsqrt 20 years ago.  Our ability to exploit them has
always been fairly limited though.

What I'm trying to avoid is a transformation where the two forms are
roughly equal in terms of what they expose vs what they hide.

If pulling the 1/c out is consistently better, then that's obviously
good.  If it's essentially a toss-up because of the other interactions
with CSE, then we need to think about it more deeply.

I _suspect_ that pulling the 1/c out is generally better, but something
more concrete than my intuition would be helpful.




> 
> Note this transformation is a canonicalization - if you can't merge a
> constant somehow, moving it out to the LHS can expose more
> opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> (C3 * x) / y.
> Same for negation as it behaves like * -1.
FWIW, I generally agree.

> 
> The key here is that it is at least an order of magnitude worse if you have
> to execute an extra division than if you have an extra multiply.
No doubt.  I'll trade a division for a multiply any day.  Similarly 1/C
is just a constant, so I consider it essentially free.


> 
>> ISTM in the end if  y*C or 1/(y*C) is a CSE, then Richi's sequence wins.
>>  Similarly if 1/y is a CSE, then yours wins.  Is there some reason to
>> believe that one is a more likely CSE than the other?
> 
> The idea is that 1/y is much more likely a CSE than 1/(y*C).
Do we have anything other than intuition to back that up?
> 
> We could make the pattern only fire in single use cases and see whether
> that makes a difference. It would be easy to test old vs new vs single-use
> new and count how many divisions we end up with. We could add 1/ (y * C)
> to the reciprocal phase if it is unacceptable as a canonicalization, but then
> you won't be able to optimize (C1 * x * C2) / y.
We could.  I think the question would then become does restricting to
the single-use case kill too many opportunities.

Sigh.  I think the 4 of us could go round and round on this forever in
the pursuit of the perfect code.  Though in reality I'm happy with a
clear improvement.


> 
>> I think there's a fundamental phase ordering problem here.  You want to
>> CSE stuff as much as possible, then expose reciprocals, then CSE again
>> because exposing reciprocals can expose new CSE opportunities.
> 
> I agree there are phase ordering issues and various problems in
> reassociation, CSE and division optimizations not being able to find and
> optimize complex cases that are worthwhile.
> 
> However I don't agree doing CSE before reciprocals is a good idea. We
> want to expose reciprocals early on, even if that means we find fewer
> CSEs as a result - again because division is so much more expensive than
> any other operation. CSE is generally not smart enough to CSE a * x in
> a * b * x and a * c * x, something which is likely to happen quite frequently -
> unlike the made up division examples here.
We have much stronger reassociation capabilities for multiplication --
it's a well understood problem and if we fail to pick up a * x across
those two kind of expressions, I'd consider our reassociation code as
failing pretty badly, particularly for integer types.

BUt yes, division is expensive.  And I'm all for a tranformation that
turns a division into a multiplication.  That's almost always a win.

> 
> The first question is whether you see it as a canonicalization. If so, then
> match.pd should be fine.
Pulling the constant part out of the denominator and turning it into a
multiplication by the recip constant should likely be considered
canonicalization.  I think Richi largely agreed to this in the thread as
well and asked for that hunk of the patch to be extracted and submitted
independent of the other changes so that it could go ahead and move
forward while we figure out the rest.

Note that tree-ssa-reassoc.c has a fair amount of this kind of
canonicalization.  In fact, I think that's where we handle things like
pulling out negations.  You may actually do better handling it there.



Jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-09-15 16:50                         ` Jeff Law
@ 2017-09-16 21:39                           ` Bernhard Reutner-Fischer
  2017-10-13 18:18                             ` Jeff Law
  0 siblings, 1 reply; 22+ messages in thread
From: Bernhard Reutner-Fischer @ 2017-09-16 21:39 UTC (permalink / raw)
  To: gcc-patches, Jeff Law, Wilco Dijkstra, Jackson Woodruff, Richard Biener
  Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd

On 15 September 2017 18:50:26 CEST, Jeff Law <law@redhat.com> wrote:
>On 09/13/2017 03:20 PM, Wilco Dijkstra wrote:
>> Jeff Law wrote:
>>> On 09/06/2017 03:55 AM, Jackson Woodruff wrote:
>>>> On 08/30/2017 01:46 PM, Richard Biener wrote:
>> 
>>>>>    rdivtmp = 1 / (y*C);
>>>>>    tem = x *rdivtmp;
>>>>>    tem2= z * rdivtmp;
>>>>>
>>>>> instead of
>>>>>
>>>>>    rdivtmp = 1/y;
>>>>>    tem = x * 1/C * rdivtmp;
>>>>>    tem2 = z * 1/C * rdivtmp;
>>>>
>>>> Ideally we would be able to CSE that into
>>>>
>>>> rdivtmp = 1/y * 1/C;
>>>> tem = x * rdivtmp;
>>>> tem2 = z * rdivtmp;
>>> So why is your sequence significantly better than Richi's desired
>>> seqeuence?  They both seem to need 3 mults and a division (which in
>both
>>> cases might be a reciprocal estimation).    In Richi's sequence we
>have
>>> to mult and feed the result as an operand into the reciprocal insn. 
>In
>>> yours we feed the result of the reciprocal into the multiply.
>> 
>> Basically this stuff happens a lot in real code, which is exactly why
>I proposed it.
>> I even provided counts of how many divisions each transformation
>avoids:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. 
>I don't doubt that it happens in real code.  There's a reason why MIPS
>IV added recip and rsqrt 20 years ago.  Our ability to exploit them has
>always been fairly limited though.
>
>What I'm trying to avoid is a transformation where the two forms are
>roughly equal in terms of what they expose vs what they hide.
>
>If pulling the 1/c out is consistently better, then that's obviously
>good.  If it's essentially a toss-up because of the other interactions
>with CSE, then we need to think about it more deeply.
>
>I _suspect_ that pulling the 1/c out is generally better, but something
>more concrete than my intuition would be helpful.
>
>
>
>
>> 
>> Note this transformation is a canonicalization - if you can't merge a
>> constant somehow, moving it out to the LHS can expose more
>> opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y ->
>(C3 * x) / y.
>> Same for negation as it behaves like * -1.
>FWIW, I generally agree.
>
>> 
>> The key here is that it is at least an order of magnitude worse if
>you have
>> to execute an extra division than if you have an extra multiply.
>No doubt.  I'll trade a division for a multiply any day.  Similarly 1/C
>is just a constant, so I consider it essentially free.
>
>
>> 
>>> ISTM in the end if  y*C or 1/(y*C) is a CSE, then Richi's sequence
>wins.
>>>  Similarly if 1/y is a CSE, then yours wins.  Is there some reason
>to
>>> believe that one is a more likely CSE than the other?
>> 
>> The idea is that 1/y is much more likely a CSE than 1/(y*C).
>Do we have anything other than intuition to back that up?
>> 
>> We could make the pattern only fire in single use cases and see
>whether
>> that makes a difference. It would be easy to test old vs new vs
>single-use
>> new and count how many divisions we end up with. We could add 1/ (y *
>C)
>> to the reciprocal phase if it is unacceptable as a canonicalization,
>but then
>> you won't be able to optimize (C1 * x * C2) / y.
>We could.  I think the question would then become does restricting to
>the single-use case kill too many opportunities.
>
>Sigh.  I think the 4 of us could go round and round on this forever in
>the pursuit of the perfect code.  Though in reality I'm happy with a
>clear improvement.


Btw reminds me a little bit of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417
which IIRC never was applied. Maybe someone could talk Denys (a colleague of yours nowadays, Jeff) into submitting a real patch? ;)

Cheers,
>
>> 
>>> I think there's a fundamental phase ordering problem here.  You want
>to
>>> CSE stuff as much as possible, then expose reciprocals, then CSE
>again
>>> because exposing reciprocals can expose new CSE opportunities.
>> 
>> I agree there are phase ordering issues and various problems in
>> reassociation, CSE and division optimizations not being able to find
>and
>> optimize complex cases that are worthwhile.
>> 
>> However I don't agree doing CSE before reciprocals is a good idea. We
>> want to expose reciprocals early on, even if that means we find fewer
>> CSEs as a result - again because division is so much more expensive
>than
>> any other operation. CSE is generally not smart enough to CSE a * x
>in
>> a * b * x and a * c * x, something which is likely to happen quite
>frequently -
>> unlike the made up division examples here.
>We have much stronger reassociation capabilities for multiplication --
>it's a well understood problem and if we fail to pick up a * x across
>those two kind of expressions, I'd consider our reassociation code as
>failing pretty badly, particularly for integer types.
>
>BUt yes, division is expensive.  And I'm all for a tranformation that
>turns a division into a multiplication.  That's almost always a win.
>
>> 
>> The first question is whether you see it as a canonicalization. If
>so, then
>> match.pd should be fine.
>Pulling the constant part out of the denominator and turning it into a
>multiplication by the recip constant should likely be considered
>canonicalization.  I think Richi largely agreed to this in the thread
>as
>well and asked for that hunk of the patch to be extracted and submitted
>independent of the other changes so that it could go ahead and move
>forward while we figure out the rest.
>
>Note that tree-ssa-reassoc.c has a fair amount of this kind of
>canonicalization.  In fact, I think that's where we handle things like
>pulling out negations.  You may actually do better handling it there.
>
>
>
>Jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-09-16 21:39                           ` Bernhard Reutner-Fischer
@ 2017-10-13 18:18                             ` Jeff Law
  2017-10-13 18:53                               ` Wilco Dijkstra
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Law @ 2017-10-13 18:18 UTC (permalink / raw)
  To: Bernhard Reutner-Fischer, gcc-patches, Wilco Dijkstra,
	Jackson Woodruff, Richard Biener
  Cc: kyrylo.tkachov, Joseph S. Myers, nd

On 09/16/2017 03:39 PM, Bernhard Reutner-Fischer wrote:
> On 15 September 2017 18:50:26 CEST, Jeff Law <law@redhat.com> wrote:
>> On 09/13/2017 03:20 PM, Wilco Dijkstra wrote:
>>> Jeff Law wrote:
>>>> On 09/06/2017 03:55 AM, Jackson Woodruff wrote:
>>>>> On 08/30/2017 01:46 PM, Richard Biener wrote:
>>>
>>>>>>    rdivtmp = 1 / (y*C);
>>>>>>    tem = x *rdivtmp;
>>>>>>    tem2= z * rdivtmp;
>>>>>>
>>>>>> instead of
>>>>>>
>>>>>>    rdivtmp = 1/y;
>>>>>>    tem = x * 1/C * rdivtmp;
>>>>>>    tem2 = z * 1/C * rdivtmp;
>>>>>
>>>>> Ideally we would be able to CSE that into
>>>>>
>>>>> rdivtmp = 1/y * 1/C;
>>>>> tem = x * rdivtmp;
>>>>> tem2 = z * rdivtmp;
>>>> So why is your sequence significantly better than Richi's desired
>>>> seqeuence?  They both seem to need 3 mults and a division (which in
>> both
>>>> cases might be a reciprocal estimation).    In Richi's sequence we
>> have
>>>> to mult and feed the result as an operand into the reciprocal insn. 
>> In
>>>> yours we feed the result of the reciprocal into the multiply.
>>>
>>> Basically this stuff happens a lot in real code, which is exactly why
>> I proposed it.
>>> I even provided counts of how many divisions each transformation
>> avoids:
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. 
>> I don't doubt that it happens in real code.  There's a reason why MIPS
>> IV added recip and rsqrt 20 years ago.  Our ability to exploit them has
>> always been fairly limited though.
>>
>> What I'm trying to avoid is a transformation where the two forms are
>> roughly equal in terms of what they expose vs what they hide.
>>
>> If pulling the 1/c out is consistently better, then that's obviously
>> good.  If it's essentially a toss-up because of the other interactions
>> with CSE, then we need to think about it more deeply.
>>
>> I _suspect_ that pulling the 1/c out is generally better, but something
>> more concrete than my intuition would be helpful.
>>
>>
>>
>>
>>>
>>> Note this transformation is a canonicalization - if you can't merge a
>>> constant somehow, moving it out to the LHS can expose more
>>> opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y ->
>> (C3 * x) / y.
>>> Same for negation as it behaves like * -1.
>> FWIW, I generally agree.
>>
>>>
>>> The key here is that it is at least an order of magnitude worse if
>> you have
>>> to execute an extra division than if you have an extra multiply.
>> No doubt.  I'll trade a division for a multiply any day.  Similarly 1/C
>> is just a constant, so I consider it essentially free.
>>
>>
>>>
>>>> ISTM in the end if  y*C or 1/(y*C) is a CSE, then Richi's sequence
>> wins.
>>>>  Similarly if 1/y is a CSE, then yours wins.  Is there some reason
>> to
>>>> believe that one is a more likely CSE than the other?
>>>
>>> The idea is that 1/y is much more likely a CSE than 1/(y*C).
>> Do we have anything other than intuition to back that up?
>>>
>>> We could make the pattern only fire in single use cases and see
>> whether
>>> that makes a difference. It would be easy to test old vs new vs
>> single-use
>>> new and count how many divisions we end up with. We could add 1/ (y *
>> C)
>>> to the reciprocal phase if it is unacceptable as a canonicalization,
>> but then
>>> you won't be able to optimize (C1 * x * C2) / y.
>> We could.  I think the question would then become does restricting to
>> the single-use case kill too many opportunities.
>>
>> Sigh.  I think the 4 of us could go round and round on this forever in
>> the pursuit of the perfect code.  Though in reality I'm happy with a
>> clear improvement.
> 
> 
> Btw reminds me a little bit of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417
> which IIRC never was applied. Maybe someone could talk Denys (a colleague of yours nowadays, Jeff) into submitting a real patch? ;)
Denys has been a Red Hatter for a long time (approaching 10 years now I
think).  He's not in the compiler group, but is in the same organization
as the compiler group.

Tege hasn't worked on GCC regularly in years and Denys hasn't ever
really engaged the GCC community all that well.  I wouldn't expect 28417
to move forward without something other than Tege and Denys pushing on it.


jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
  2017-10-13 18:18                             ` Jeff Law
@ 2017-10-13 18:53                               ` Wilco Dijkstra
  0 siblings, 0 replies; 22+ messages in thread
From: Wilco Dijkstra @ 2017-10-13 18:53 UTC (permalink / raw)
  To: Jeff Law, Bernhard Reutner-Fischer, gcc-patches,
	Jackson Woodruff, Richard Biener
  Cc: kyrylo.tkachov, Joseph S. Myers, nd

Jeff Law wrote:

>> Btw reminds me a little bit of  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417

> I wouldn't expect 28417
> to move forward without something other than Tege and Denys pushing on it.

Hmm that doesn't look optimal. You can typically do a multiply with the magic
constant arranged in such a way that on 32-bit targets you don't need a shift at all.

There isn't much to the proof either, code like this is always self-proving: it tries
several possible magic constants until a solution is found that has no error for the
max input. Given there are usually several valid solutions, you can then select the
best shift/magic value combination.

Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-10-13 18:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-10 14:11 [PATCH] Factor out division by squares and remove division around comparisons (1/2) Jackson Woodruff
2017-08-10 15:26 ` Jackson Woodruff
2017-08-11  0:15   ` Joseph Myers
2017-08-15 14:22 ` Richard Biener
2017-08-15 14:43   ` Wilco Dijkstra
2017-08-17 10:20     ` Richard Biener
2017-08-17 10:29       ` Wilco Dijkstra
2017-08-17 10:39         ` Richard Biener
2017-08-17 12:46           ` Joseph Myers
2017-08-24 21:26         ` Jeff Law
2017-08-29 12:13           ` Jackson Woodruff
2017-08-29 13:31             ` Richard Biener
2017-08-30 10:45               ` Jackson Woodruff
2017-08-30 13:26                 ` Richard Biener
2017-09-06  9:55                   ` Jackson Woodruff
2017-09-13 18:37                     ` Jeff Law
2017-09-13 21:20                       ` Wilco Dijkstra
2017-09-15 12:07                         ` Richard Biener
2017-09-15 16:50                         ` Jeff Law
2017-09-16 21:39                           ` Bernhard Reutner-Fischer
2017-10-13 18:18                             ` Jeff Law
2017-10-13 18:53                               ` Wilco Dijkstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).