*[PATCH] Factor out division by squares and remove division around comparisons (1/2)@ 2017-08-10 14:11 Jackson Woodruff2017-08-10 15:26 ` Jackson Woodruff 2017-08-15 14:22 ` Richard Biener 0 siblings, 2 replies; 22+ messages in thread From: Jackson Woodruff @ 2017-08-10 14:11 UTC (permalink / raw) To: Wilco.dijkstra, richard.guenther, kyrylo.tkachov, joseph, gcc-patches Hi all, The patch implements the division opitmizations discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . The implemented change differs slightly from the proposed one in that we re-associate: C / x comparison 0.0 -> x comparison' 0.0 Where C is any constant and comparison' is changed as appropriate if C is negative. The implementations also removes the division from: x / C comparison 0.0 -> x comparison' 0.0 Where again, comparison' is changed as appropriate if C is negative. We also change the association of x / (y * C) -> (x / C) / y If C is a constant. All of the above require -funsafe-math-optimizations. We also change: x / (- y) -> (-x) / y Which requires -fno-trapping-math. Bootstrapped and regtested (with part 2 of this patch) on aarch64. OK for trunk? (Apologies if the recipients in the 'to' field received this twice, I accidentally sent this from an email gcc-patches doesn't accept) Jackson gcc/ 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> PR 71026/tree-optimization * match.pd: New patterns. gcc/testsuite 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> PR 71026/tree-optimization * gcc.dg/associate_comparison_1.c: New. * gcc.dg/associate_division_2.c: New. ^ permalink raw reply [flat|nested] 22+ messages in thread

*2017-08-10 14:11 [PATCH] Factor out division by squares and remove division around comparisons (1/2) Jackson WoodruffRe: [PATCH] Factor out division by squares and remove division around comparisons (1/2)@ 2017-08-10 15:26 ` Jackson Woodruff2017-08-11 0:15 ` Joseph Myers 2017-08-15 14:22 ` Richard Biener 1 sibling, 1 reply; 22+ messages in thread From: Jackson Woodruff @ 2017-08-10 15:26 UTC (permalink / raw) To: Wilco.dijkstra, richard.guenther, kyrylo.tkachov, joseph, gcc-patches [-- Attachment #1: Type: text/plain, Size: 1466 bytes --] And have now attached that patch. Jackson On 08/10/2017 03:09 PM, Jackson Woodruff wrote: > Hi all, > > The patch implements the division opitmizations discussed in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . > > The implemented change differs slightly from the > proposed one in that we re-associate: > > C / x comparison 0.0 -> x comparison' 0.0 > > Where C is any constant and comparison' is changed as > appropriate if C is negative. > > The implementations also removes the division from: > > x / C comparison 0.0 -> x comparison' 0.0 > > Where again, comparison' is changed as appropriate if C is > negative. > > We also change the association of > > x / (y * C) -> (x / C) / y > > If C is a constant. All of the above require > -funsafe-math-optimizations. > > We also change: > > x / (- y) -> (-x) / y > > Which requires -fno-trapping-math. > > Bootstrapped and regtested (with part 2 of this patch) on aarch64. > > OK for trunk? > > (Apologies if the recipients in the 'to' field received this twice, > I accidentally sent this from an email gcc-patches doesn't accept) > > Jackson > > gcc/ > > 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> > > PR 71026/tree-optimization > * match.pd: New patterns. > > > gcc/testsuite > > 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> > > PR 71026/tree-optimization > * gcc.dg/associate_comparison_1.c: New. > * gcc.dg/associate_division_2.c: New. > [-- Attachment #2: patchfile-1 --] [-- Type: text/plain, Size: 3318 bytes --] diff --git a/gcc/match.pd b/gcc/match.pd index e98db52af84946cf579c6434e06d450713a47162..7abfd11601a956ecb59c945256c9f30dc0b6f6c9 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -342,16 +342,42 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (negate @0))) (if (flag_reciprocal_math) - /* Convert (A/B)/C to A/(B*C) */ + /* Convert (A/B)/C to A/(B*C) where B is not a constant. */ (simplify (rdiv (rdiv:s @0 @1) @2) - (rdiv @0 (mult @1 @2))) + (if (TREE_CODE (@1) != REAL_CST) + (rdiv @0 (mult @1 @2)))) + + /* Simplify x / (C * y) to (x / C) / y where C is a constant. */ + (simplify + (rdiv @0 + (mult @1 REAL_CST@2)) + (rdiv (rdiv @0 @2) @1)) /* Convert A/(B/C) to (A/B)*C */ (simplify (rdiv @0 (rdiv:s @1 @2)) (mult (rdiv @0 @1) @2))) +(if (!flag_trapping_math) + /* Simplify x / (- y) to -x / y. */ + (simplify + (rdiv @0 (negate @1)) + (rdiv (negate @0) @1))) + +(if (flag_unsafe_math_optimizations) + /* Simplify (C / x op 0.0) to x op 0.0 for C > 0. */ + (for op (lt le gt ge) + neg_op (gt ge lt le) + (simplify + (op (rdiv REAL_CST@0 @1) real_zerop@2) + (switch + (if (!real_equal (TREE_REAL_CST_PTR (@0), &dconst0) + && !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0))) + (op @1 @2)) + (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0))) + (neg_op @1 @2)))))) + /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */ (for div (trunc_div ceil_div floor_div round_div exact_div) (simplify @@ -3446,6 +3472,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (!HONOR_SNANS (type)) @0)) + /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1 + and simplify (x / C1) cmp C2 -> x cmp C2 * C1. */ + (for cmp (lt le gt ge) + neg_cmp (gt ge lt le) + (for op (mult rdiv) + inv_op (rdiv mult) + (simplify + (cmp (op @0 REAL_CST@1) REAL_CST@2) + (switch + (if (!real_equal (TREE_REAL_CST_PTR (@1), &dconst0) + && !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1))) + (cmp @0 (inv_op @2 @1))) + (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1))) + (neg_cmp @0 (inv_op @2 @1))))))) + /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y). */ (for root (SQRT CBRT) (simplify diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c b/gcc/testsuite/gcc.dg/associate_comparison_1.c new file mode 100644 index 0000000000000000000000000000000000000000..c12e5cfb2a8388068fa1731cc2305f9fe5139fd6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */ + +int +lt_cmp (float x, float y) +{ + return x / 2 <= 0; +} + +int +inv_cmp (float x, float y) +{ + return 5 / x >= 0; +} + + +/* { dg-final { scan-tree-dump-not " / " "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/associate_division_2.c b/gcc/testsuite/gcc.dg/associate_division_2.c new file mode 100644 index 0000000000000000000000000000000000000000..1f9c1741ce980f1148016e37399134aa8a619b1d --- /dev/null +++ b/gcc/testsuite/gcc.dg/associate_division_2.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +float +neg_mov (float w, float x, float *y, float *z) +{ + *z = x / (- w); + *y = 3 / w; + + return 5 / w; +} + +/* { dg-final { scan-tree-dump-times " / " 1 "optimized" } } */ ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-10 15:26 ` Jackson Woodruff@ 2017-08-11 0:15 ` Joseph Myers0 siblings, 0 replies; 22+ messages in thread From: Joseph Myers @ 2017-08-11 0:15 UTC (permalink / raw) To: Jackson Woodruff Cc: Wilco.dijkstra, richard.guenther, kyrylo.tkachov, gcc-patches On Thu, 10 Aug 2017, Jackson Woodruff wrote: > > We also change: > > > > x / (- y) -> (-x) / y > > > > Which requires -fno-trapping-math. I don't see why that requires -fno-trapping-math. The exceptions should be identical from both variants, as should the result, as far as defined by C or IEEE 754 (it's possible it might affect the sign of a NaN result on some architectures, but only where that sign is unspecified in C and IEEE 754 terms, so such a change is OK). -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-10 14:11 [PATCH] Factor out division by squares and remove division around comparisons (1/2) Jackson Woodruff 2017-08-10 15:26 ` Jackson Woodruff@ 2017-08-15 14:22 ` Richard Biener2017-08-15 14:43 ` Wilco Dijkstra 1 sibling, 1 reply; 22+ messages in thread From: Richard Biener @ 2017-08-15 14:22 UTC (permalink / raw) To: Jackson Woodruff Cc: Wilco.dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches On Thu, Aug 10, 2017 at 4:09 PM, Jackson Woodruff <jackson.woodruff@foss.arm.com> wrote: > Hi all, > > The patch implements the division opitmizations discussed in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . > > The implemented change differs slightly from the > proposed one in that we re-associate: > > C / x comparison 0.0 -> x comparison' 0.0 > > Where C is any constant and comparison' is changed as > appropriate if C is negative. > > The implementations also removes the division from: > > x / C comparison 0.0 -> x comparison' 0.0 > > Where again, comparison' is changed as appropriate if C is > negative. > > We also change the association of > > x / (y * C) -> (x / C) / y > > If C is a constant. Why's that profitable? > All of the above require > -funsafe-math-optimizations. > > We also change: > > x / (- y) -> (-x) / y Why? (it's only one of the possible canonicalizations) Note there are exactly two RDIV_EXPR simplifications remaining in fold_binary_loc, maybe you can move them to match.pd (beware of negate_expr_p ... this negate optimization should eventually move to backprop). > Which requires -fno-trapping-math. > > Bootstrapped and regtested (with part 2 of this patch) on aarch64. > > OK for trunk? > > (Apologies if the recipients in the 'to' field received this twice, > I accidentally sent this from an email gcc-patches doesn't accept) > > Jackson > > gcc/ > > 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> > > PR 71026/tree-optimization > * match.pd: New patterns. > > > gcc/testsuite > > 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> > > PR 71026/tree-optimization > * gcc.dg/associate_comparison_1.c: New. > * gcc.dg/associate_division_2.c: New. > ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-15 14:22 ` Richard Biener@ 2017-08-15 14:43 ` Wilco Dijkstra2017-08-17 10:20 ` Richard Biener 0 siblings, 1 reply; 22+ messages in thread From: Wilco Dijkstra @ 2017-08-15 14:43 UTC (permalink / raw) To: Richard Biener, Jackson Woodruff Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd Richard Biener wrote: > > We also change the association of > > > > x / (y * C) -> (x / C) / y > > > > If C is a constant. > > Why's that profitable? It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. Also 1/y is now available to the reciprocal optimization, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. > > x / (- y) -> (-x) / y > > Why? (it's only one of the possible canonicalizations) Same here, y is now available for reciprocal optimization. The negate may now be optimized, for example (a * b) / -y -> (-a*b) / y will use a negated multiple on various targets. Wilco ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-15 14:43 ` Wilco Dijkstra@ 2017-08-17 10:20 ` Richard Biener2017-08-17 10:29 ` Wilco Dijkstra 0 siblings, 1 reply; 22+ messages in thread From: Richard Biener @ 2017-08-17 10:20 UTC (permalink / raw) To: Wilco Dijkstra Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Richard Biener wrote: >> > We also change the association of >> > >> > x / (y * C) -> (x / C) / y >> > >> > If C is a constant. >> >> Why's that profitable? > > It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. > Also 1/y is now available to the reciprocal optimization, see > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. Sure, but on its own it's going to be slower. So this isn't the correct way to enable those followup transforms. >> > x / (- y) -> (-x) / y >> >> Why? (it's only one of the possible canonicalizations) > > Same here, y is now available for reciprocal optimization. The > negate may now be optimized, for example (a * b) / -y -> (-a*b) / y > will use a negated multiple on various targets. Fair enough. Though if it were x / -(a*b) you'd regress that case. Richard. > > Wilco ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-17 10:20 ` Richard Biener@ 2017-08-17 10:29 ` Wilco Dijkstra2017-08-17 10:39 ` Richard Biener 2017-08-24 21:26 ` Jeff Law 0 siblings, 2 replies; 22+ messages in thread From: Wilco Dijkstra @ 2017-08-17 10:29 UTC (permalink / raw) To: Richard Biener Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd Richard Biener wrote: > On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > > Richard Biener wrote: >>> > We also change the association of >>> > >>> > x / (y * C) -> (x / C) / y >>> > >>> > If C is a constant. >>> >>> Why's that profitable? >> >> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >> Also 1/y is now available to the reciprocal optimization, see >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. > > Sure, but on its own it's going to be slower. So this isn't the > correct way to enable those followup transforms. How can it be any slower? It's one division and one multiply in both cases. >>> > x / (- y) -> (-x) / y >>> >>> Why? (it's only one of the possible canonicalizations) >> >> Same here, y is now available for reciprocal optimization. The >> negate may now be optimized, for example (a * b) / -y -> (-a*b) / y >> will use a negated multiple on various targets. > > Fair enough. Though if it were x / -(a*b) you'd regress that case. Possibly. You might still be able to merge the negate if the result is used in an addition or multiply, which wouldn't be possible if it were hiding in a subexpression. Without global analysis it seems best to move constants/negates to the toplevel if they can't be trivially removed in a subexpression. Eg. -x / (a * b * -c). Wilco ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-17 10:29 ` Wilco Dijkstra@ 2017-08-17 10:39 ` Richard Biener2017-08-17 12:46 ` Joseph Myers 2017-08-24 21:26 ` Jeff Law 1 sibling, 1 reply; 22+ messages in thread From: Richard Biener @ 2017-08-17 10:39 UTC (permalink / raw) To: Wilco Dijkstra Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On Thu, Aug 17, 2017 at 11:55 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Richard Biener wrote: >> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: >> > Richard Biener wrote: >>>> > We also change the association of >>>> > >>>> > x / (y * C) -> (x / C) / y >>>> > >>>> > If C is a constant. >>>> >>>> Why's that profitable? >>> >>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >>> Also 1/y is now available to the reciprocal optimization, see >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. >> >> Sure, but on its own it's going to be slower. So this isn't the >> correct way to enable those followup transforms. > > How can it be any slower? It's one division and one multiply in both cases. (x / C) / y is two divisions. If you restrict it to the case where we can transform this to (x * C') / y then it's indeed one. >>>> > x / (- y) -> (-x) / y >>>> >>>> Why? (it's only one of the possible canonicalizations) >>> >>> Same here, y is now available for reciprocal optimization. The >>> negate may now be optimized, for example (a * b) / -y -> (-a*b) / y >>> will use a negated multiple on various targets. >> >> Fair enough. Though if it were x / -(a*b) you'd regress that case. > > Possibly. You might still be able to merge the negate if the result is used in an > addition or multiply, which wouldn't be possible if it were hiding in a subexpression. > Without global analysis it seems best to move constants/negates to the toplevel > if they can't be trivially removed in a subexpression. Eg. -x / (a * b * -c). Sure. So both patterns are canonicalization which is fine for match.pd. Those followup transforms should be done at a place that can look at more complicated patterns. We have the reassoc pass, then backprop (not exactly matching), and the recip pattern matching / cse pass. Richard. > Wilco ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-17 10:39 ` Richard Biener@ 2017-08-17 12:46 ` Joseph Myers0 siblings, 0 replies; 22+ messages in thread From: Joseph Myers @ 2017-08-17 12:46 UTC (permalink / raw) To: Richard Biener Cc: Wilco Dijkstra, Jackson Woodruff, kyrylo.tkachov, GCC Patches, nd On Thu, 17 Aug 2017, Richard Biener wrote: > > Without global analysis it seems best to move constants/negates to the > > toplevel if they can't be trivially removed in a subexpression. Eg. -x > > / (a * b * -c). > > Sure. So both patterns are canonicalization which is fine for match.pd. > Those followup transforms should be done at a place that can look at > more complicated patterns. We have the reassoc pass, then backprop (not > exactly matching), and the recip pattern matching / cse pass. Watch out for -frounding-math issues, though. -a * b is always the same as a * -b, but not the same as -(a * b) with -frounding-math, and similarly, -frounding-math should prevent eliminating the negations from the -x / (a * b * -c) expression (whereas elimination from -x / -(a * b * c) would be fine). -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-17 10:29 ` Wilco Dijkstra 2017-08-17 10:39 ` Richard Biener@ 2017-08-24 21:26 ` Jeff Law2017-08-29 12:13 ` Jackson Woodruff 1 sibling, 1 reply; 22+ messages in thread From: Jeff Law @ 2017-08-24 21:26 UTC (permalink / raw) To: Wilco Dijkstra, Richard Biener Cc: Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: > Richard Biener wrote: >> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: >>> Richard Biener wrote: >>>>> We also change the association of >>>>> >>>>> x / (y * C) -> (x / C) / y >>>>> >>>>> If C is a constant. >>>> >>>> Why's that profitable? >>> >>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >>> Also 1/y is now available to the reciprocal optimization, see >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. >> >> Sure, but on its own it's going to be slower. So this isn't the >> correct way to enable those followup transforms. > > How can it be any slower? It's one division and one multiply in both cases. x / (y * C) -> (x / C) / y Goes from one division and one multiplication to two divisions. I'm guessing that's what Richi is (reasonably) concerned about. So it may be the case that we need more complex pattern matching here for when to perform the first transformation to ultimately enable the second. Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-24 21:26 ` Jeff Law@ 2017-08-29 12:13 ` Jackson Woodruff2017-08-29 13:31 ` Richard Biener 0 siblings, 1 reply; 22+ messages in thread From: Jackson Woodruff @ 2017-08-29 12:13 UTC (permalink / raw) To: Jeff Law, Wilco Dijkstra, Richard Biener Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd [-- Attachment #1: Type: text/plain, Size: 2088 bytes --] Hi all, Apologies again to those CC'ed, who (again) received this twice. Joseph: Yes you are correct. I misread the original thread, now fixed. Richard: I've moved the optimizations out of fold-const.c. One has been replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've deleted as it only introduced a new optimization when running with the flags '-O0 -funsafe-math-optimizations'. On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)' is enabled and then the pattern is covered by that and (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd) I have also updated the testcase for those optimizations to use 'O1' to avoid that case. On 08/24/2017 10:06 PM, Jeff Law wrote: > On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: >> Richard Biener wrote: >>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: >>>> Richard Biener wrote: >>>>>> We also change the association of >>>>>> >>>>>> x / (y * C) -> (x / C) / y >>>>>> >>>>>> If C is a constant. >>>>> >>>>> Why's that profitable? >>>> >>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >>>> Also 1/y is now available to the reciprocal optimization, see >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. >>> >>> Sure, but on its own it's going to be slower. So this isn't the >>> correct way to enable those followup transforms. >> >> How can it be any slower? It's one division and one multiply in both cases. > x / (y * C) -> (x / C) / y > > Goes from one division and one multiplication to two divisions. I'm > guessing that's what Richi is (reasonably) concerned about. > > So it may be the case that we need more complex pattern matching here > for when to perform the first transformation to ultimately enable the > second. > The only case where we don't remove the division but still execute this pattern is when run with'-O0 -freciprocal-math'. As long as we have 'O1' or greater (and -freciprocal-math), that is enough to enable the removal of the reciprocal. Jackson. > > Jeff > [-- Attachment #2: patchfile-1 --] [-- Type: text/plain, Size: 7089 bytes --] diff --git a/gcc/fold-const.c b/gcc/fold-const.c index eeeff1ed166734328a612142fdf6235274f9e858..907d96c3d994db1ec8dc0ef692ac0b4d59c99a4c 100644 --- a/gcc/fold-const.c +++ b/gcc/fold-const.c @@ -3834,47 +3834,6 @@ invert_truthvalue_loc (location_t loc, tree arg) : TRUTH_NOT_EXPR, type, arg); } - -/* Knowing that ARG0 and ARG1 are both RDIV_EXPRs, simplify a binary operation - with code CODE. This optimization is unsafe. */ -static tree -distribute_real_division (location_t loc, enum tree_code code, tree type, - tree arg0, tree arg1) -{ - bool mul0 = TREE_CODE (arg0) == MULT_EXPR; - bool mul1 = TREE_CODE (arg1) == MULT_EXPR; - - /* (A / C) +- (B / C) -> (A +- B) / C. */ - if (mul0 == mul1 - && operand_equal_p (TREE_OPERAND (arg0, 1), - TREE_OPERAND (arg1, 1), 0)) - return fold_build2_loc (loc, mul0 ? MULT_EXPR : RDIV_EXPR, type, - fold_build2_loc (loc, code, type, - TREE_OPERAND (arg0, 0), - TREE_OPERAND (arg1, 0)), - TREE_OPERAND (arg0, 1)); - - /* (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2). */ - if (operand_equal_p (TREE_OPERAND (arg0, 0), - TREE_OPERAND (arg1, 0), 0) - && TREE_CODE (TREE_OPERAND (arg0, 1)) == REAL_CST - && TREE_CODE (TREE_OPERAND (arg1, 1)) == REAL_CST) - { - REAL_VALUE_TYPE r0, r1; - r0 = TREE_REAL_CST (TREE_OPERAND (arg0, 1)); - r1 = TREE_REAL_CST (TREE_OPERAND (arg1, 1)); - if (!mul0) - real_arithmetic (&r0, RDIV_EXPR, &dconst1, &r0); - if (!mul1) - real_arithmetic (&r1, RDIV_EXPR, &dconst1, &r1); - real_arithmetic (&r0, code, &r0, &r1); - return fold_build2_loc (loc, MULT_EXPR, type, - TREE_OPERAND (arg0, 0), - build_real (type, r0)); - } - - return NULL_TREE; -} \f /* Return a BIT_FIELD_REF of type TYPE to refer to BITSIZE bits of INNER starting at BITPOS. The field is unsigned if UNSIGNEDP is nonzero @@ -9419,12 +9378,6 @@ fold_binary_loc (location_t loc, } } - if (flag_unsafe_math_optimizations - && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR) - && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR) - && (tem = distribute_real_division (loc, code, type, arg0, arg1))) - return tem; - /* Convert a + (b*c + d*e) into (a + b*c) + d*e. We associate floats only if the user has specified -fassociative-math. */ @@ -9772,13 +9725,6 @@ fold_binary_loc (location_t loc, return tem; } - if (FLOAT_TYPE_P (type) - && flag_unsafe_math_optimizations - && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR) - && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR) - && (tem = distribute_real_division (loc, code, type, arg0, arg1))) - return tem; - /* Handle (A1 * C1) - (A2 * C2) with A1, A2 or C1, C2 being the same or one. Make sure the type is not saturating and has the signedness of the stripped operands, as fold_plusminus_mult_expr will re-associate. diff --git a/gcc/match.pd b/gcc/match.pd index e98db52af84946cf579c6434e06d450713a47162..2714b404826c2b4c74e3f382b85bf2ec183ee4f8 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -342,16 +342,41 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (negate @0))) (if (flag_reciprocal_math) - /* Convert (A/B)/C to A/(B*C) */ + /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant. */ (simplify (rdiv (rdiv:s @0 @1) @2) - (rdiv @0 (mult @1 @2))) + (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST) + (rdiv @0 (mult @1 @2)))) + + /* Simplify x / (C * y) to (x / C) / y where C is a constant. */ + (simplify + (rdiv @0 + (mult @1 REAL_CST@2)) + (rdiv (rdiv @0 @2) @1)) /* Convert A/(B/C) to (A/B)*C */ (simplify (rdiv @0 (rdiv:s @1 @2)) (mult (rdiv @0 @1) @2))) +/* Simplify x / (- y) to -x / y. */ +(simplify + (rdiv @0 (negate @1)) + (rdiv (negate @0) @1)) + +(if (flag_unsafe_math_optimizations) + /* Simplify (C / x op 0.0) to x op 0.0 for C > 0. */ + (for op (lt le gt ge) + neg_op (gt ge lt le) + (simplify + (op (rdiv REAL_CST@0 @1) real_zerop@2) + (switch + (if (real_less (&dconst0, TREE_REAL_CST_PTR (@0))) + (op @1 @2)) + /* For C < 0, use the inverted operator. */ + (if (real_less (TREE_REAL_CST_PTR (@0), &dconst0)) + (neg_op @1 @2)))))) + /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */ (for div (trunc_div ceil_div floor_div round_div exact_div) (simplify @@ -610,15 +635,6 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (tem) (rdiv { tem; } @1))))) -/* Convert C1/(X*C2) into (C1/C2)/X */ -(simplify - (rdiv REAL_CST@0 (mult @1 REAL_CST@2)) - (if (flag_reciprocal_math) - (with - { tree tem = const_binop (RDIV_EXPR, type, @0, @2); } - (if (tem) - (rdiv { tem; } @1))))) - /* Simplify ~X & X as zero. */ (simplify (bit_and:c (convert? @0) (convert? (bit_not @0))) @@ -3446,6 +3462,34 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (!HONOR_SNANS (type)) @0)) + (for cmp (lt le gt ge) + neg_cmp (gt ge lt le) + /* Simplify (x / C1) cmp y -> x cmp y * C1. */ + (simplify + (cmp (rdiv @0 REAL_CST@1) @2) + (switch + (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1))) + (cmp @0 (mult @2 @1))) + (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0)) + (neg_cmp @0 (mult @2 @1))))) + + /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1, where C1 != 0. */ + (simplify + (cmp (mult @0 REAL_CST@1) REAL_CST@2) + (switch + (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1))) + (cmp @0 (rdiv @2 @1))) + (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0)) + (neg_cmp @0 (rdiv @2 @1)))))) + + (for op (plus minus) + /* Simplify (A / C) +- (B / C) -> (A +- B) / C. */ + (simplify + (op (rdiv @0 @1) + (rdiv @2 @1)) + (rdiv (op @0 @2) @1))) + + /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y). */ (for root (SQRT CBRT) (simplify diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c b/gcc/testsuite/gcc.dg/associate_comparison_1.c new file mode 100644 index 0000000000000000000000000000000000000000..d0a197b48d660d3f2e9fabda855670ba477afc4d --- /dev/null +++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */ + +int +lt_cmp (float x, float y) +{ + return x / 2 <= 0; +} + +int lt_cmp_2 (float x, float y) +{ + return x / 3 <= y; +} + +int +inv_cmp (float x, float y) +{ + return 5 / x >= 0; +} + + +/* { dg-final { scan-tree-dump-not " / " "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/fold-div-1.c b/gcc/testsuite/gcc.dg/fold-div-1.c index c1c7250f882cfed7705ba60994e47440580f4c76..73b75861166f40733d9768e9703664d51ee1a9ef 100644 --- a/gcc/testsuite/gcc.dg/fold-div-1.c +++ b/gcc/testsuite/gcc.dg/fold-div-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-funsafe-math-optimizations -fdump-tree-gimple" } */ +/* { dg-options "-O1 -funsafe-math-optimizations -fdump-tree-gimple" } */ float f(float x) { ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-29 12:13 ` Jackson Woodruff@ 2017-08-29 13:31 ` Richard Biener2017-08-30 10:45 ` Jackson Woodruff 0 siblings, 1 reply; 22+ messages in thread From: Richard Biener @ 2017-08-29 13:31 UTC (permalink / raw) To: Jackson Woodruff Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff <jackson.woodruff@foss.arm.com> wrote: > Hi all, > > Apologies again to those CC'ed, who (again) received this twice. > > Joseph: Yes you are correct. I misread the original thread, now fixed. > > Richard: I've moved the optimizations out of fold-const.c. One has been > replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've > deleted as it only introduced a new optimization when running with the flags > '-O0 -funsafe-math-optimizations'. Hmm, how did you verify that, that it only adds sth with -O0 -funsafe-math-optimizations? Is that because in GIMPLE the reassoc pass should do this transform? You added +/* Simplify x / (- y) to -x / y. */ +(simplify + (rdiv @0 (negate @1)) + (rdiv (negate @0) @1)) for /* (-A) / (-B) -> A / B */ if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1)) return fold_build2_loc (loc, RDIV_EXPR, type, TREE_OPERAND (arg0, 0), negate_expr (arg1)); if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0)) return fold_build2_loc (loc, RDIV_EXPR, type, negate_expr (arg0), TREE_OPERAND (arg1, 0)); presumably? This isn't equivalent. It's more like (simplify (rdiv (negate_expr_p @0) (negate @1)) (rdiv (negate @0) @1)) (simplify (rdiv (negate @0) (negate_expr_p @1)) (rdiv @0 (negate @1))) and you should remove the corresponding fold-const.c code. Please do these changes independently to aid bisecting in case of issues. (if (flag_reciprocal_math) - /* Convert (A/B)/C to A/(B*C) */ + /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant. */ (simplify (rdiv (rdiv:s @0 @1) @2) - (rdiv @0 (mult @1 @2))) + (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST) + (rdiv @0 (mult @1 @2)))) why? I guess to avoid ping-poning with + /* Simplify x / (C * y) to (x / C) / y where C is a constant. */ + (simplify + (rdiv @0 + (mult @1 REAL_CST@2)) + (rdiv (rdiv @0 @2) @1)) ? If so why not just disallow for @1 == REAL_CST? > On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)' > is enabled and then the pattern is covered by that and > (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd) > > I have also updated the testcase for those optimizations to use 'O1' to > avoid that case. > > > On 08/24/2017 10:06 PM, Jeff Law wrote: >> >> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: >>> >>> Richard Biener wrote: >>>> >>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> >>>> wrote: >>>>> >>>>> Richard Biener wrote: >>>>>>> >>>>>>> We also change the association of >>>>>>> >>>>>>> x / (y * C) -> (x / C) / y >>>>>>> >>>>>>> If C is a constant. >>>>>> >>>>>> >>>>>> Why's that profitable? >>>>> >>>>> >>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >>>>> Also 1/y is now available to the reciprocal optimization, see >>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. >>>> >>>> >>>> Sure, but on its own it's going to be slower. So this isn't the >>>> correct way to enable those followup transforms. >>> >>> >>> How can it be any slower? It's one division and one multiply in both >>> cases. >> >> x / (y * C) -> (x / C) / y >> >> Goes from one division and one multiplication to two divisions. I'm >> guessing that's what Richi is (reasonably) concerned about. >> >> So it may be the case that we need more complex pattern matching here >> for when to perform the first transformation to ultimately enable the >> second. >> > > The only case where we don't remove the division but still execute this > pattern is when run with'-O0 -freciprocal-math'. > > As long as we have 'O1' or greater (and -freciprocal-math), that is enough > to enable the removal of the reciprocal. I don't see this. I presume you mean this happens in the recip pass? But we don't optimize this when optimizing for size (but the above pattern still applies) or when targetm.min_divisions_for_recip_mul is too large. So I still think this is the wrong place to do this and instead the recip pass should be extended. > > Jackson. > >> >> Jeff >> > ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-29 13:31 ` Richard Biener@ 2017-08-30 10:45 ` Jackson Woodruff2017-08-30 13:26 ` Richard Biener 0 siblings, 1 reply; 22+ messages in thread From: Jackson Woodruff @ 2017-08-30 10:45 UTC (permalink / raw) To: Richard Biener Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On 08/29/2017 01:13 PM, Richard Biener wrote: > On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff > <jackson.woodruff@foss.arm.com> wrote: >> Hi all, >> >> Apologies again to those CC'ed, who (again) received this twice. >> >> Joseph: Yes you are correct. I misread the original thread, now fixed. >> >> Richard: I've moved the optimizations out of fold-const.c. One has been >> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've >> deleted as it only introduced a new optimization when running with the flags >> '-O0 -funsafe-math-optimizations'. > > Hmm, how did you verify that, that it only adds sth with -O0 > -funsafe-math-optimizations? By checking with various flags, although not exhaustively. I looked for reasons for the behavior in match.pd (explained below). I have also since discovered that the combinations of '-funsafe-math-optimizations -frounding-math' (at all levels) and '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds something. > Is that because in GIMPLE the reassoc pass should do this transform? It's because the pattern that changes (X / C) -> X * (1 / C) is gated with O1: (for cst (REAL_CST COMPLEX_CST VECTOR_CST) (simplify (rdiv @0 cst@1) -> (if (optimize) -> (if (flag_reciprocal_math && !real_zerop (@1)) (with { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1); } (if (tem) (mult @0 { tem; } ))) (if (cst != COMPLEX_CST) (with { tree inverse = exact_inverse (type, @1); } (if (inverse) (mult @0 { inverse; } )))))))) I've flagged the two lines that are particularly relevant to this. Removing this pattern, as I would expect, means that the divisions in the above optimization (and the one further down) are not removed. So then there is the question of edge cases. This pattern is (ignoring the second case) going to fail when const_binop returns null. Looking through that function says that it will fail (for reals) when: - Either argument is null (not the case) - The operation is not in the list (PLUS_EXPR, MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR) (again not the case) - We honor Signalling NaNs and one of the operands is a sNaN. - The operation is a division, and the second argument is zero and dividing by 0.0 raises an exception. - The result is infinity and neither of the operands were infinity and flag_trapping_math is set. - The result isn't exact and flag_rounding_math is set. For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these so that the pattern is never executed if one of these would be the case. The additional cases where this isn't converted to a multiplication by the reciprocal appear to be when -freciprocal-math is used, but we don't have -fno-rounding-math, or funsafe-math-optimizations. For the removal of (x / C +- y / C -> (x +- y) / C), there is a trade-off of whether it is worth having an extra pattern for these edge cases. On this I'm not sure. > > You added > > +/* Simplify x / (- y) to -x / y. */ > +(simplify > + (rdiv @0 (negate @1)) > + (rdiv (negate @0) @1)) Sorry, this was unclear, the changes to fold const should be split up from the additions to match.pd. This was one of the changes discussed before. The idea is to canonicalize this so that y can be extracted in the cse_reciprocals pass. > > for > > /* (-A) / (-B) -> A / B */ > if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1)) > return fold_build2_loc (loc, RDIV_EXPR, type, > TREE_OPERAND (arg0, 0), > negate_expr (arg1)); > if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0)) > return fold_build2_loc (loc, RDIV_EXPR, type, > negate_expr (arg0), > TREE_OPERAND (arg1, 0)); > > presumably? This isn't equivalent. It's more like > > (simplify > (rdiv (negate_expr_p @0) (negate @1)) > (rdiv (negate @0) @1)) > (simplify > (rdiv (negate @0) (negate_expr_p @1)) > (rdiv @0 (negate @1))) > > and you should remove the corresponding fold-const.c code. > > Please do these changes independently to aid bisecting in case of issues. I'll resubmit two different patches when I can get them separated. It should also make it easier to review when it is clearer what belongs where. > > (if (flag_reciprocal_math) > - /* Convert (A/B)/C to A/(B*C) */ > + /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant. */ > (simplify > (rdiv (rdiv:s @0 @1) @2) > - (rdiv @0 (mult @1 @2))) > + (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST) > + (rdiv @0 (mult @1 @2)))) > > why? I guess to avoid ping-poning with Yes, although there is a mistake there. It should be: (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@2) != REAL_CST) I'll fix that up. > > + /* Simplify x / (C * y) to (x / C) / y where C is a constant. */ > + (simplify > + (rdiv @0 > + (mult @1 REAL_CST@2)) > + (rdiv (rdiv @0 @2) @1)) > > ? If so why not just disallow for @1 == REAL_CST? The ping-ponging issue is there even when only one of the operands is a constant (which of course it has to be for this pattern to apply). For example, something like x / (y * 3), with '-O0 -ffast-math', when the reciprocal isn't computed, ping pongs back and forth. I'm not sure that it can be fixed without a condition on the pattern already there. > >> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)' >> is enabled and then the pattern is covered by that and >> (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd) >> >> I have also updated the testcase for those optimizations to use 'O1' to >> avoid that case. >> >> >> On 08/24/2017 10:06 PM, Jeff Law wrote: >>> >>> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: >>>> >>>> Richard Biener wrote: >>>>> >>>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> >>>>> wrote: >>>>>> >>>>>> Richard Biener wrote: >>>>>>>> >>>>>>>> We also change the association of >>>>>>>> >>>>>>>> x / (y * C) -> (x / C) / y >>>>>>>> >>>>>>>> If C is a constant. >>>>>>> >>>>>>> >>>>>>> Why's that profitable? >>>>>> >>>>>> >>>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >>>>>> Also 1/y is now available to the reciprocal optimization, see >>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. >>>>> >>>>> >>>>> Sure, but on its own it's going to be slower. So this isn't the >>>>> correct way to enable those followup transforms. >>>> >>>> >>>> How can it be any slower? It's one division and one multiply in both >>>> cases. >>> >>> x / (y * C) -> (x / C) / y >>> >>> Goes from one division and one multiplication to two divisions. I'm >>> guessing that's what Richi is (reasonably) concerned about. >>> >>> So it may be the case that we need more complex pattern matching here >>> for when to perform the first transformation to ultimately enable the >>> second. >>> >> >> The only case where we don't remove the division but still execute this >> pattern is when run with'-O0 -freciprocal-math'. >> >> As long as we have 'O1' or greater (and -freciprocal-math), that is enough >> to enable the removal of the reciprocal. > > I don't see this. I presume you mean this happens in the recip pass? It's one of the match.pd patterns that actually does the extraction since C is always constant. I've copied in the pattern above in response to one of your previous comments. > But we don't optimize this when optimizing for size (but the above pattern > still applies) or when targetm.min_divisions_for_recip_mul is too large It was my understanding that this is used to specify the number of divisions you have to be able to replace with a multiplication before it is worthwhile inserting an extra multiplication. So for situations like this, I think that this is not quite what is being measured, because there isn't an extra multiplication being inserted? > > So I still think this is the wrong place to do this and instead the recip > pass should be extended. > > > >> >> Jackson. >> >>> >>> Jeff >>> >> ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-30 10:45 ` Jackson Woodruff@ 2017-08-30 13:26 ` Richard Biener2017-09-06 9:55 ` Jackson Woodruff 0 siblings, 1 reply; 22+ messages in thread From: Richard Biener @ 2017-08-30 13:26 UTC (permalink / raw) To: Jackson Woodruff Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff <jackson.woodruff@foss.arm.com> wrote: > On 08/29/2017 01:13 PM, Richard Biener wrote: >> >> On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff >> <jackson.woodruff@foss.arm.com> wrote: >>> >>> Hi all, >>> >>> Apologies again to those CC'ed, who (again) received this twice. >>> >>> Joseph: Yes you are correct. I misread the original thread, now fixed. >>> >>> Richard: I've moved the optimizations out of fold-const.c. One has been >>> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) >>> I've >>> deleted as it only introduced a new optimization when running with the >>> flags >>> '-O0 -funsafe-math-optimizations'. >> >> >> Hmm, how did you verify that, that it only adds sth with -O0 >> -funsafe-math-optimizations? > > > By checking with various flags, although not exhaustively. I looked for > reasons for the behavior in match.pd (explained below). > > I have also since discovered that the combinations of > '-funsafe-math-optimizations -frounding-math' (at all levels) and > '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds > something. > >> Is that because in GIMPLE the reassoc pass should do this transform? > > It's because the pattern that changes (X / C) -> X * (1 / C) is gated with > O1: > > (for cst (REAL_CST COMPLEX_CST VECTOR_CST) > (simplify > (rdiv @0 cst@1) > -> (if (optimize) > -> (if (flag_reciprocal_math > && !real_zerop (@1)) > (with > { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1); > } > (if (tem) > (mult @0 { tem; } ))) > (if (cst != COMPLEX_CST) > (with { tree inverse = exact_inverse (type, @1); } > (if (inverse) > (mult @0 { inverse; } )))))))) > > > I've flagged the two lines that are particularly relevant to this. So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y why's that in any way preferable? I suppose this is again to enable the recip pass to detect / y (as opposed to / (C * y))? What's the reason to believe that / y is more "frequent"? > Removing this pattern, as I would expect, means that the divisions in the > above optimization (and the one further down) are not removed. > > So then there is the question of edge cases. This pattern is (ignoring the > second case) going to fail when const_binop returns null. Looking through > that function says that it will fail (for reals) when: > > - Either argument is null (not the case) > > - The operation is not in the list (PLUS_EXPR, > MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR) > (again not the case) > > - We honor Signalling NaNs and one of the operands is a sNaN. > > - The operation is a division, and the second argument is zero > and dividing by 0.0 raises an exception. > > - The result is infinity and neither of the operands were infinity > and flag_trapping_math is set. > > - The result isn't exact and flag_rounding_math is set. > > > For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these > so that the pattern is never executed if one of these would be the case. Why not transform this directly to (x * (1/C)) / y then (and only then)? That makes it obvious not two divisions prevail. That said, I'm questioning this canonicalization. I can come up with a testcase where it makes things worse: tem = x / (y * C); tem2 = z / (y * C); should generate rdivtmp = 1 / (y*C); tem = x *rdivtmp; tem2= z * rdivtmp; instead of rdivtmp = 1/y; tem = x * 1/C * rdivtmp; tem2 = z * 1/C * rdivtmp; > The additional cases where this isn't converted to a multiplication by the > reciprocal appear to be when -freciprocal-math is used, but we don't have > -fno-rounding-math, or funsafe-math-optimizations. > > For the removal of (x / C +- y / C -> (x +- y) / C), there is a trade-off of > whether it is worth having an extra pattern for these edge cases. On this > I'm not sure. Well, at least leave it in fold-const.c if you can't move it. >> >> You added >> >> +/* Simplify x / (- y) to -x / y. */ >> +(simplify >> + (rdiv @0 (negate @1)) >> + (rdiv (negate @0) @1)) > > > Sorry, this was unclear, the changes to fold const should be split up from > the additions to match.pd. > > This was one of the changes discussed before. The idea is to canonicalize > this so that y can be extracted in the cse_reciprocals pass. As said elsewhere I think cse_reciprocals needs a rewrite to not rely on arbitrary canonicalization to do its work but consider association opportunities globally with a focus on CSE. > >> >> for >> >> /* (-A) / (-B) -> A / B */ >> if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1)) >> return fold_build2_loc (loc, RDIV_EXPR, type, >> TREE_OPERAND (arg0, 0), >> negate_expr (arg1)); >> if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0)) >> return fold_build2_loc (loc, RDIV_EXPR, type, >> negate_expr (arg0), >> TREE_OPERAND (arg1, 0)); >> >> presumably? This isn't equivalent. It's more like >> >> (simplify >> (rdiv (negate_expr_p @0) (negate @1)) >> (rdiv (negate @0) @1)) >> (simplify >> (rdiv (negate @0) (negate_expr_p @1)) >> (rdiv @0 (negate @1))) >> >> and you should remove the corresponding fold-const.c code. >> >> Please do these changes independently to aid bisecting in case of issues. > > > I'll resubmit two different patches when I can get them separated. It should > also make it easier to review when it is clearer what belongs where. Thanks. >> >> (if (flag_reciprocal_math) >> - /* Convert (A/B)/C to A/(B*C) */ >> + /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant. */ >> (simplify >> (rdiv (rdiv:s @0 @1) @2) >> - (rdiv @0 (mult @1 @2))) >> + (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST) >> + (rdiv @0 (mult @1 @2)))) >> >> why? I guess to avoid ping-poning with > > > Yes, although there is a mistake there. It should be: > > (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@2) != REAL_CST) > > I'll fix that up. > >> >> + /* Simplify x / (C * y) to (x / C) / y where C is a constant. */ >> + (simplify >> + (rdiv @0 >> + (mult @1 REAL_CST@2)) >> + (rdiv (rdiv @0 @2) @1)) >> >> ? If so why not just disallow for @1 == REAL_CST? > > > The ping-ponging issue is there even when only one of the operands is a > constant (which of course it has to be for this pattern to apply). For > example, something like x / (y * 3), with '-O0 -ffast-math', > when the reciprocal isn't computed, ping pongs back and forth. > > I'm not sure that it can be fixed without a condition on the pattern already > there. > > >> >>> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)' >>> is enabled and then the pattern is covered by that and >>> (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd) >>> >>> I have also updated the testcase for those optimizations to use 'O1' to >>> avoid that case. >>> >>> >>> On 08/24/2017 10:06 PM, Jeff Law wrote: >>>> >>>> >>>> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: >>>>> >>>>> >>>>> Richard Biener wrote: >>>>>> >>>>>> >>>>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra >>>>>> <Wilco.Dijkstra@arm.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> Richard Biener wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> We also change the association of >>>>>>>>> >>>>>>>>> x / (y * C) -> (x / C) / y >>>>>>>>> >>>>>>>>> If C is a constant. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Why's that profitable? >>>>>>> >>>>>>> >>>>>>> >>>>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >>>>>>> Also 1/y is now available to the reciprocal optimization, see >>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. >>>>>> >>>>>> >>>>>> >>>>>> Sure, but on its own it's going to be slower. So this isn't the >>>>>> correct way to enable those followup transforms. >>>>> >>>>> >>>>> >>>>> How can it be any slower? It's one division and one multiply in both >>>>> cases. >>>> >>>> >>>> x / (y * C) -> (x / C) / y >>>> >>>> Goes from one division and one multiplication to two divisions. I'm >>>> guessing that's what Richi is (reasonably) concerned about. >>>> >>>> So it may be the case that we need more complex pattern matching here >>>> for when to perform the first transformation to ultimately enable the >>>> second. >>>> >>> >>> The only case where we don't remove the division but still execute this >>> pattern is when run with'-O0 -freciprocal-math'. >>> >>> As long as we have 'O1' or greater (and -freciprocal-math), that is >>> enough >>> to enable the removal of the reciprocal. >> >> >> I don't see this. I presume you mean this happens in the recip pass? > > > It's one of the match.pd patterns that actually does the extraction since C > is always constant. I've copied in the pattern above in response to one of > your previous comments. > >> But we don't optimize this when optimizing for size (but the above pattern >> > still applies) or when targetm.min_divisions_for_recip_mul is too large > > > It was my understanding that this is used to specify the number of divisions > you have to be able to replace with a multiplication before it is worthwhile > inserting an extra multiplication. > > So for situations like this, I think that this is not quite what is being > measured, because there isn't an extra multiplication being inserted? > > >> >> So I still think this is the wrong place to do this and instead the recip >> pass should be extended. >> >> >>> >>> Jackson. >>> >>>> >>>> Jeff >>>> >>> > ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-08-30 13:26 ` Richard Biener@ 2017-09-06 9:55 ` Jackson Woodruff2017-09-13 18:37 ` Jeff Law 0 siblings, 1 reply; 22+ messages in thread From: Jackson Woodruff @ 2017-09-06 9:55 UTC (permalink / raw) To: Richard Biener Cc: Jeff Law, Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd [-- Attachment #1: Type: text/plain, Size: 7960 bytes --] On 08/30/2017 01:46 PM, Richard Biener wrote: > On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff > <jackson.woodruff@foss.arm.com> wrote: >> On 08/29/2017 01:13 PM, Richard Biener wrote: >>> >>> On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff >>> <jackson.woodruff@foss.arm.com> wrote: >>>> >>>> Hi all, >>>> >>>> Apologies again to those CC'ed, who (again) received this twice. >>>> >>>> Joseph: Yes you are correct. I misread the original thread, now fixed. >>>> >>>> Richard: I've moved the optimizations out of fold-const.c. One has been >>>> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) >>>> I've >>>> deleted as it only introduced a new optimization when running with the >>>> flags >>>> '-O0 -funsafe-math-optimizations'. >>> >>> >>> Hmm, how did you verify that, that it only adds sth with -O0 >>> -funsafe-math-optimizations? >> >> >> By checking with various flags, although not exhaustively. I looked for >> reasons for the behavior in match.pd (explained below). >> >> I have also since discovered that the combinations of >> '-funsafe-math-optimizations -frounding-math' (at all levels) and >> '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds >> something. >> >>> Is that because in GIMPLE the reassoc pass should do this transform? >> >> It's because the pattern that changes (X / C) -> X * (1 / C) is gated with >> O1: >> >> (for cst (REAL_CST COMPLEX_CST VECTOR_CST) >> (simplify >> (rdiv @0 cst@1) >> -> (if (optimize) >> -> (if (flag_reciprocal_math >> && !real_zerop (@1)) >> (with >> { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1); >> } >> (if (tem) >> (mult @0 { tem; } ))) >> (if (cst != COMPLEX_CST) >> (with { tree inverse = exact_inverse (type, @1); } >> (if (inverse) >> (mult @0 { inverse; } )))))))) >> >> >> I've flagged the two lines that are particularly relevant to this. > > So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y > why's that in any way preferable? I suppose this is again to enable > the recip pass to detect / y (as opposed to / (C * y))? What's the > reason to believe that / y is more "frequent"? > >> Removing this pattern, as I would expect, means that the divisions in the >> above optimization (and the one further down) are not removed. >> >> So then there is the question of edge cases. This pattern is (ignoring the >> second case) going to fail when const_binop returns null. Looking through >> that function says that it will fail (for reals) when: >> >> - Either argument is null (not the case) >> >> - The operation is not in the list (PLUS_EXPR, >> MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR) >> (again not the case) >> >> - We honor Signalling NaNs and one of the operands is a sNaN. >> >> - The operation is a division, and the second argument is zero >> and dividing by 0.0 raises an exception. >> >> - The result is infinity and neither of the operands were infinity >> and flag_trapping_math is set. >> >> - The result isn't exact and flag_rounding_math is set. >> >> >> For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these >> so that the pattern is never executed if one of these would be the case. > > Why not transform this directly to (x * (1/C)) / y then (and only then)? > That makes it obvious not two divisions prevail. Done. > > That said, I'm questioning this canonicalization. I can come up with a > testcase where it makes things worse: > > tem = x / (y * C); > tem2 = z / (y * C); > > should generate > > rdivtmp = 1 / (y*C); > tem = x *rdivtmp; > tem2= z * rdivtmp; > > instead of > > rdivtmp = 1/y; > tem = x * 1/C * rdivtmp; > tem2 = z * 1/C * rdivtmp; Ideally we would be able to CSE that into rdivtmp = 1/y * 1/C; tem = x * rdivtmp; tem2 = z * rdivtmp; Although we currently do not. An equally (perhaps more?) problematic case is something like: tem = x / (y * C) tem2 = y * C Which becomes: tem = x * (1 / C) / y tem2 = y * C Instead of K = y * C tem = x / K tem2 = K Which ultimately requires context awareness to avoid. This does seem to be a general problem with a large number of match.pd patterns rather than anything specific to this one. For example, a similar example can be constructed for (say) (A / B) / C -> (A / (B * C)). > >> The additional cases where this isn't converted to a multiplication by the >> reciprocal appear to be when -freciprocal-math is used, but we don't have >> -fno-rounding-math, or funsafe-math-optimizations. >> >> >>> >>>> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)' >>>> is enabled and then the pattern is covered by that and >>>> (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd) >>>> >>>> I have also updated the testcase for those optimizations to use 'O1' to >>>> avoid that case. >>>> >>>> >>>> On 08/24/2017 10:06 PM, Jeff Law wrote: >>>>> >>>>> >>>>> On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: >>>>>> >>>>>> >>>>>> Richard Biener wrote: >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra >>>>>>> <Wilco.Dijkstra@arm.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Richard Biener wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We also change the association of >>>>>>>>>> >>>>>>>>>> x / (y * C) -> (x / C) / y >>>>>>>>>> >>>>>>>>>> If C is a constant. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Why's that profitable? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. >>>>>>>> Also 1/y is now available to the reciprocal optimization, see >>>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Sure, but on its own it's going to be slower. So this isn't the >>>>>>> correct way to enable those followup transforms. >>>>>> >>>>>> >>>>>> >>>>>> How can it be any slower? It's one division and one multiply in both >>>>>> cases. >>>>> >>>>> >>>>> x / (y * C) -> (x / C) / y >>>>> >>>>> Goes from one division and one multiplication to two divisions. I'm >>>>> guessing that's what Richi is (reasonably) concerned about. >>>>> >>>>> So it may be the case that we need more complex pattern matching here >>>>> for when to perform the first transformation to ultimately enable the >>>>> second. >>>>> >>>> >>>> The only case where we don't remove the division but still execute this >>>> pattern is when run with'-O0 -freciprocal-math'. >>>> >>>> As long as we have 'O1' or greater (and -freciprocal-math), that is >>>> enough >>>> to enable the removal of the reciprocal. >>> >>> >>> I don't see this. I presume you mean this happens in the recip pass? >> >> >> It's one of the match.pd patterns that actually does the extraction since C >> is always constant. I've copied in the pattern above in response to one of >> your previous comments. >> >>> But we don't optimize this when optimizing for size (but the above pattern >>>> still applies) or when targetm.min_divisions_for_recip_mul is too large >> >> >> It was my understanding that this is used to specify the number of divisions >> you have to be able to replace with a multiplication before it is worthwhile >> inserting an extra multiplication. >> >> So for situations like this, I think that this is not quite what is being >> measured, because there isn't an extra multiplication being inserted? >> >> >>> >>> So I still think this is the wrong place to do this and instead the recip >>> pass should be extended. >>> >>> >>>> >>>> Jackson. >>>> >>>>> >>>>> Jeff >>>>> >>>> >> Updated ChangeLog: gcc/ 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> PR 71026/tree-optimization * match.pd: New patterns. gcc/testsuite 2017-08-03 Jackson Woodruff <jackson.woodruff@arm.com> PR 71026/tree-optimization * gcc.dg/associate_comparison_1.c: New. * gcc.dg/extract_recip_1.c: New. * gcc.dg/extract_recip_2.c: New. [-- Attachment #2: patchfile-1 --] [-- Type: text/plain, Size: 4856 bytes --] diff --git a/gcc/match.pd b/gcc/match.pd index 69dd8193cd0524d99fba8be8da8183230b8d621a..d4f56f5f332f9076481bb5db26264116c7f18728 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -342,16 +342,45 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (negate @0))) (if (flag_reciprocal_math) - /* Convert (A/B)/C to A/(B*C) */ + /* Convert (A/B)/C to A/(B*C). */ (simplify (rdiv (rdiv:s @0 @1) @2) - (rdiv @0 (mult @1 @2))) + (rdiv @0 (mult @1 @2))) + + /* Simplify x / (C * y) to (x * (1 / C)) / y where C is a constant. */ + (if (optimize) + (simplify + (rdiv @0 + (mult @1 REAL_CST@2)) + (if (!real_zerop (@1)) + (with + { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @2); } + (if (tem) + (rdiv (mult @0 { tem; } ) @1)))))) /* Convert A/(B/C) to (A/B)*C */ (simplify (rdiv @0 (rdiv:s @1 @2)) (mult (rdiv @0 @1) @2))) +/* Simplify x / (- y) to -x / y. */ +(simplify + (rdiv @0 (negate @1)) + (rdiv (negate @0) @1)) + +(if (flag_unsafe_math_optimizations) + /* Simplify (C / x op 0.0) to x op 0.0 for C > 0. */ + (for op (lt le gt ge) + neg_op (gt ge lt le) + (simplify + (op (rdiv REAL_CST@0 @1) real_zerop@2) + (switch + (if (real_less (&dconst0, TREE_REAL_CST_PTR (@0))) + (op @1 @2)) + /* For C < 0, use the inverted operator. */ + (if (real_less (TREE_REAL_CST_PTR (@0), &dconst0)) + (neg_op @1 @2)))))) + /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */ (for div (trunc_div ceil_div floor_div round_div exact_div) (simplify @@ -610,15 +639,6 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (tem) (rdiv { tem; } @1))))) -/* Convert C1/(X*C2) into (C1/C2)/X */ -(simplify - (rdiv REAL_CST@0 (mult @1 REAL_CST@2)) - (if (flag_reciprocal_math) - (with - { tree tem = const_binop (RDIV_EXPR, type, @0, @2); } - (if (tem) - (rdiv { tem; } @1))))) - /* Simplify ~X & X as zero. */ (simplify (bit_and:c (convert? @0) (convert? (bit_not @0))) @@ -3517,6 +3537,26 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (!HONOR_SNANS (type)) @0)) + (for cmp (lt le gt ge) + neg_cmp (gt ge lt le) + /* Simplify (x / C1) cmp y -> x cmp y * C1. */ + (simplify + (cmp (rdiv @0 REAL_CST@1) @2) + (switch + (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1))) + (cmp @0 (mult @2 @1))) + (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0)) + (neg_cmp @0 (mult @2 @1))))) + + /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1, where C1 != 0. */ + (simplify + (cmp (mult @0 REAL_CST@1) REAL_CST@2) + (switch + (if (real_less (&dconst0, TREE_REAL_CST_PTR (@1))) + (cmp @0 (rdiv @2 @1))) + (if (real_less (TREE_REAL_CST_PTR (@1), &dconst0)) + (neg_cmp @0 (rdiv @2 @1)))))) + /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y). */ (for root (SQRT CBRT) (simplify diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c b/gcc/testsuite/gcc.dg/associate_comparison_1.c new file mode 100644 index 0000000000000000000000000000000000000000..d0a197b48d660d3f2e9fabda855670ba477afc4d --- /dev/null +++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */ + +int +lt_cmp (float x, float y) +{ + return x / 2 <= 0; +} + +int lt_cmp_2 (float x, float y) +{ + return x / 3 <= y; +} + +int +inv_cmp (float x, float y) +{ + return 5 / x >= 0; +} + + +/* { dg-final { scan-tree-dump-not " / " "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/extract_recip_1.c b/gcc/testsuite/gcc.dg/extract_recip_1.c new file mode 100644 index 0000000000000000000000000000000000000000..54556587889848f6da090207690608e34e19d407 --- /dev/null +++ b/gcc/testsuite/gcc.dg/extract_recip_1.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +float +neg_recip (float x, float y, float z) +{ + return (x / y) + (z / -y); +} + +float +const_divisor (float x, float y, float z) +{ + return x / (y * 3.0f) + z / y; +} + +/* 0 multiplications in 'neg_recip', 1 multiplcations in + 'const_divisor' expected. */ +/* { dg-final { scan-tree-dump-times " \\* " 1 "optimized" } } */ + +/* 1 division in 'neg_recip', 1 division in 'const_divisor'. */ +/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/extract_recip_2.c b/gcc/testsuite/gcc.dg/extract_recip_2.c new file mode 100644 index 0000000000000000000000000000000000000000..1f9c1741ce980f1148016e37399134aa8a619b1d --- /dev/null +++ b/gcc/testsuite/gcc.dg/extract_recip_2.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +float +neg_mov (float w, float x, float *y, float *z) +{ + *z = x / (- w); + *y = 3 / w; + + return 5 / w; +} + +/* { dg-final { scan-tree-dump-times " / " 1 "optimized" } } */ ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-09-06 9:55 ` Jackson Woodruff@ 2017-09-13 18:37 ` Jeff Law2017-09-13 21:20 ` Wilco Dijkstra 0 siblings, 1 reply; 22+ messages in thread From: Jeff Law @ 2017-09-13 18:37 UTC (permalink / raw) To: Jackson Woodruff, Richard Biener Cc: Wilco Dijkstra, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On 09/06/2017 03:55 AM, Jackson Woodruff wrote: > On 08/30/2017 01:46 PM, Richard Biener wrote: >> On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff >> <jackson.woodruff@foss.arm.com> wrote: >>> On 08/29/2017 01:13 PM, Richard Biener wrote: >>>> >>>> On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff >>>> <jackson.woodruff@foss.arm.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> Apologies again to those CC'ed, who (again) received this twice. >>>>> >>>>> Joseph: Yes you are correct. I misread the original thread, now fixed. >>>>> >>>>> Richard: I've moved the optimizations out of fold-const.c. One has >>>>> been >>>>> replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) >>>>> I've >>>>> deleted as it only introduced a new optimization when running with the >>>>> flags >>>>> '-O0 -funsafe-math-optimizations'. >>>> >>>> >>>> Hmm, how did you verify that, that it only adds sth with -O0 >>>> -funsafe-math-optimizations? >>> >>> >>> By checking with various flags, although not exhaustively. I looked for >>> reasons for the behavior in match.pd (explained below). >>> >>> I have also since discovered that the combinations of >>> '-funsafe-math-optimizations -frounding-math' (at all levels) and >>> '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern >>> adds >>> something. >>> >>>> Is that because in GIMPLE the reassoc pass should do this transform? >>> >>> It's because the pattern that changes (X / C) -> X * (1 / C) is gated >>> with >>> O1: >>> >>> (for cst (REAL_CST COMPLEX_CST VECTOR_CST) >>> (simplify >>> (rdiv @0 cst@1) >>> -> (if (optimize) >>> -> (if (flag_reciprocal_math >>> && !real_zerop (@1)) >>> (with >>> { tree tem = const_binop (RDIV_EXPR, type, build_one_cst >>> (type), @1); >>> } >>> (if (tem) >>> (mult @0 { tem; } ))) >>> (if (cst != COMPLEX_CST) >>> (with { tree inverse = exact_inverse (type, @1); } >>> (if (inverse) >>> (mult @0 { inverse; } )))))))) >>> >>> >>> I've flagged the two lines that are particularly relevant to this. >> >> So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y >> why's that in any way preferable? I suppose this is again to enable >> the recip pass to detect / y (as opposed to / (C * y))? What's the >> reason to believe that / y is more "frequent"? >> >>> Removing this pattern, as I would expect, means that the divisions in >>> the >>> above optimization (and the one further down) are not removed. >>> >>> So then there is the question of edge cases. This pattern is >>> (ignoring the >>> second case) going to fail when const_binop returns null. Looking >>> through >>> that function says that it will fail (for reals) when: >>> >>> - Either argument is null (not the case) >>> >>> - The operation is not in the list (PLUS_EXPR, >>> MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR) >>> (again not the case) >>> >>> - We honor Signalling NaNs and one of the operands is a sNaN. >>> >>> - The operation is a division, and the second argument is zero >>> and dividing by 0.0 raises an exception. >>> >>> - The result is infinity and neither of the operands were infinity >>> and flag_trapping_math is set. >>> >>> - The result isn't exact and flag_rounding_math is set. >>> >>> >>> For (x / ( y * C) -> (x / C) / y), I will add some gating for each of >>> these >>> so that the pattern is never executed if one of these would be the case. >> >> Why not transform this directly to (x * (1/C)) / y then (and only then)? >> That makes it obvious not two divisions prevail. > > Done. > >> >> That said, I'm questioning this canonicalization. I can come up with a >> testcase where it makes things worse: >> >> tem = x / (y * C); >> tem2 = z / (y * C); >> >> should generate >> >> rdivtmp = 1 / (y*C); >> tem = x *rdivtmp; >> tem2= z * rdivtmp; >> >> instead of >> >> rdivtmp = 1/y; >> tem = x * 1/C * rdivtmp; >> tem2 = z * 1/C * rdivtmp; > > Ideally we would be able to CSE that into > > rdivtmp = 1/y * 1/C; > tem = x * rdivtmp; > tem2 = z * rdivtmp; So why is your sequence significantly better than Richi's desired seqeuence? They both seem to need 3 mults and a division (which in both cases might be a reciprocal estimation). In Richi's sequence we have to mult and feed the result as an operand into the reciprocal insn. In yours we feed the result of the reciprocal into the multiply. ISTM in the end if y*C or 1/(y*C) is a CSE, then Richi's sequence wins. Similarly if 1/y is a CSE, then yours wins. Is there some reason to believe that one is a more likely CSE than the other? > Although we currently do not. An equally (perhaps more?) problematic > case is something like: > > tem = x / (y * C) > tem2 = y * C > > Which becomes: > > tem = x * (1 / C) / y > tem2 = y * C > > Instead of > > K = y * C > tem = x / K > tem2 = K > > Which ultimately requires context awareness to avoid. This does seem to > be a general problem with a large number of match.pd patterns rather > than anything specific to this one. For example, a similar example can > be constructed for (say) (A / B) / C -> (A / (B * C)). I think there's a fundamental phase ordering problem here. You want to CSE stuff as much as possible, then expose reciprocals, then CSE again because exposing reciprocals can expose new CSE opportunities. And I suspect that no matter how hard we try, there's going to be cases that get exposed by various transformations in the pipeline such that to fully optimize the cse - reciprocal - cse sequence would need to be repeated to fully optimize. We may have to live with not being particularly good at picking up the those second order effects. I do think that the need for cse - reciprocal - cse sequencing suggests that match.pd may not be the right place for these transformations. I think Richi has pointed this out a couple times already. I'm not going to accept or reject at this time. I think we need to make a higher level decision. Are these transformations better suited for match.pd or the reciprocal transformation pass? I realize that some patterns are already in match.pd, but let's try to settle the higher level issue before we add more. jeff ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-09-13 18:37 ` Jeff Law@ 2017-09-13 21:20 ` Wilco Dijkstra2017-09-15 12:07 ` Richard Biener 2017-09-15 16:50 ` Jeff Law 0 siblings, 2 replies; 22+ messages in thread From: Wilco Dijkstra @ 2017-09-13 21:20 UTC (permalink / raw) To: Jeff Law, Jackson Woodruff, Richard Biener Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd Jeff Law wrote: > On 09/06/2017 03:55 AM, Jackson Woodruff wrote: > > On 08/30/2017 01:46 PM, Richard Biener wrote: >>> rdivtmp = 1 / (y*C); >>> tem = x *rdivtmp; >>> tem2= z * rdivtmp; >>> >>> instead of >>> >>> rdivtmp = 1/y; >>> tem = x * 1/C * rdivtmp; >>> tem2 = z * 1/C * rdivtmp; >> >> Ideally we would be able to CSE that into >> >> rdivtmp = 1/y * 1/C; >> tem = x * rdivtmp; >> tem2 = z * rdivtmp; > So why is your sequence significantly better than Richi's desired > seqeuence? They both seem to need 3 mults and a division (which in both > cases might be a reciprocal estimation). In Richi's sequence we have > to mult and feed the result as an operand into the reciprocal insn. In > yours we feed the result of the reciprocal into the multiply. Basically this stuff happens a lot in real code, which is exactly why I proposed it. I even provided counts of how many divisions each transformation avoids: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. Note this transformation is a canonicalization - if you can't merge a constant somehow, moving it out to the LHS can expose more opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> (C3 * x) / y. Same for negation as it behaves like * -1. The key here is that it is at least an order of magnitude worse if you have to execute an extra division than if you have an extra multiply. > ISTM in the end if y*C or 1/(y*C) is a CSE, then Richi's sequence wins. > Similarly if 1/y is a CSE, then yours wins. Is there some reason to > believe that one is a more likely CSE than the other? The idea is that 1/y is much more likely a CSE than 1/(y*C). We could make the pattern only fire in single use cases and see whether that makes a difference. It would be easy to test old vs new vs single-use new and count how many divisions we end up with. We could add 1/ (y * C) to the reciprocal phase if it is unacceptable as a canonicalization, but then you won't be able to optimize (C1 * x * C2) / y. > I think there's a fundamental phase ordering problem here. You want to > CSE stuff as much as possible, then expose reciprocals, then CSE again > because exposing reciprocals can expose new CSE opportunities. I agree there are phase ordering issues and various problems in reassociation, CSE and division optimizations not being able to find and optimize complex cases that are worthwhile. However I don't agree doing CSE before reciprocals is a good idea. We want to expose reciprocals early on, even if that means we find fewer CSEs as a result - again because division is so much more expensive than any other operation. CSE is generally not smart enough to CSE a * x in a * b * x and a * c * x, something which is likely to happen quite frequently - unlike the made up division examples here. > And I suspect that no matter how hard we try, there's going to be cases > that get exposed by various transformations in the pipeline such that to > fully optimize the cse - reciprocal - cse sequence would need to be > repeated to fully optimize. We may have to live with not being > particularly good at picking up the those second order effects. > > I do think that the need for cse - reciprocal - cse sequencing suggests > that match.pd may not be the right place for these transformations. I > think Richi has pointed this out a couple times already. I don't think you can ever expect to find the optimal case by repeating optimizations. It's quite easy to construct examples where first doing CSE makes things significantly worse. Ultimately to get something optimal you'd need to try lots of permutations and count for each possible permutation how many multiplies and divisons you end up after full optimization. Quite impractical... > I'm not going to accept or reject at this time. I think we need to make > a higher level decision. Are these transformations better suited for > match.pd or the reciprocal transformation pass? I realize that some > patterns are already in match.pd, but let's try to settle the higher > level issue before we add more. The first question is whether you see it as a canonicalization. If so, then match.pd should be fine. Wilco ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-09-13 21:20 ` Wilco Dijkstra@ 2017-09-15 12:07 ` Richard Biener2017-09-15 16:50 ` Jeff Law 1 sibling, 0 replies; 22+ messages in thread From: Richard Biener @ 2017-09-15 12:07 UTC (permalink / raw) To: Wilco Dijkstra Cc: Jeff Law, Jackson Woodruff, kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On Wed, Sep 13, 2017 at 11:20 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Jeff Law wrote: >> On 09/06/2017 03:55 AM, Jackson Woodruff wrote: >> > On 08/30/2017 01:46 PM, Richard Biener wrote: > >>>> rdivtmp = 1 / (y*C); >>>> tem = x *rdivtmp; >>>> tem2= z * rdivtmp; >>>> >>>> instead of >>>> >>>> rdivtmp = 1/y; >>>> tem = x * 1/C * rdivtmp; >>>> tem2 = z * 1/C * rdivtmp; >>> >>> Ideally we would be able to CSE that into >>> >>> rdivtmp = 1/y * 1/C; >>> tem = x * rdivtmp; >>> tem2 = z * rdivtmp; >> So why is your sequence significantly better than Richi's desired >> seqeuence? They both seem to need 3 mults and a division (which in both >> cases might be a reciprocal estimation). In Richi's sequence we have >> to mult and feed the result as an operand into the reciprocal insn. In >> yours we feed the result of the reciprocal into the multiply. > > Basically this stuff happens a lot in real code, which is exactly why I proposed it. > I even provided counts of how many divisions each transformation avoids: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. > > Note this transformation is a canonicalization - if you can't merge a > constant somehow, moving it out to the LHS can expose more > opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> (C3 * x) / y. > Same for negation as it behaves like * -1. > > The key here is that it is at least an order of magnitude worse if you have > to execute an extra division than if you have an extra multiply. > >> ISTM in the end if y*C or 1/(y*C) is a CSE, then Richi's sequence wins. >> Similarly if 1/y is a CSE, then yours wins. Is there some reason to >> believe that one is a more likely CSE than the other? > > The idea is that 1/y is much more likely a CSE than 1/(y*C). > > We could make the pattern only fire in single use cases and see whether > that makes a difference. It would be easy to test old vs new vs single-use > new and count how many divisions we end up with. We could add 1/ (y * C) > to the reciprocal phase if it is unacceptable as a canonicalization, but then > you won't be able to optimize (C1 * x * C2) / y. > >> I think there's a fundamental phase ordering problem here. You want to >> CSE stuff as much as possible, then expose reciprocals, then CSE again >> because exposing reciprocals can expose new CSE opportunities. > > I agree there are phase ordering issues and various problems in > reassociation, CSE and division optimizations not being able to find and > optimize complex cases that are worthwhile. > > However I don't agree doing CSE before reciprocals is a good idea. We > want to expose reciprocals early on, even if that means we find fewer > CSEs as a result - again because division is so much more expensive than > any other operation. CSE is generally not smart enough to CSE a * x in > a * b * x and a * c * x, something which is likely to happen quite frequently - > unlike the made up division examples here. > >> And I suspect that no matter how hard we try, there's going to be cases >> that get exposed by various transformations in the pipeline such that to >> fully optimize the cse - reciprocal - cse sequence would need to be >> repeated to fully optimize. We may have to live with not being >> particularly good at picking up the those second order effects. >> >> I do think that the need for cse - reciprocal - cse sequencing suggests >> that match.pd may not be the right place for these transformations. I >> think Richi has pointed this out a couple times already. > > I don't think you can ever expect to find the optimal case by repeating > optimizations. It's quite easy to construct examples where first doing CSE > makes things significantly worse. Ultimately to get something optimal you'd > need to try lots of permutations and count for each possible permutation > how many multiplies and divisons you end up after full optimization. > Quite impractical... > >> I'm not going to accept or reject at this time. I think we need to make >> a higher level decision. Are these transformations better suited for >> match.pd or the reciprocal transformation pass? I realize that some >> patterns are already in match.pd, but let's try to settle the higher >> level issue before we add more. > > The first question is whether you see it as a canonicalization. If so, then > match.pd should be fine. A canonicalization to more divisions is not fine. That is, if we think that x / (C * y) is non-canonical because constant parts should be in the denominator then fine, canonicalize it as (x * C') / y with C' = 1/C. But then implement it so, not in a pattern that suggests you'll end up with two divisions. Let me repeat though that this looks like a job for re-association. From that perspective the part of the patch doing + /* Simplify x / (C * y) to (x * (1 / C)) / y where C is a constant. */ + (if (optimize) + (simplify + (rdiv @0 + (mult @1 REAL_CST@2)) + (if (!real_zerop (@1)) + (with + { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @2); } + (if (tem) + (rdiv (mult @0 { tem; } ) @1)))))) looks ok apart from the if (optimize) check and the real_zerop (@1) check which should test @2. _please_ split the patch up. +/* Simplify x / (- y) to -x / y. */ +(simplify + (rdiv @0 (negate @1)) + (rdiv (negate @0) @1)) I think the comments on both patterns should say 'Canonicalize' instead of 'Simplify' as those are not simplifications, they are even Canonicalizations that change the value in one case (which usually makes canonicalizations questionable IMHO given nothign guarantees the followup optimziation they should enable). Richard. > Wilco ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-09-13 21:20 ` Wilco Dijkstra 2017-09-15 12:07 ` Richard Biener@ 2017-09-15 16:50 ` Jeff Law2017-09-16 21:39 ` Bernhard Reutner-Fischer 1 sibling, 1 reply; 22+ messages in thread From: Jeff Law @ 2017-09-15 16:50 UTC (permalink / raw) To: Wilco Dijkstra, Jackson Woodruff, Richard Biener Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On 09/13/2017 03:20 PM, Wilco Dijkstra wrote: > Jeff Law wrote: >> On 09/06/2017 03:55 AM, Jackson Woodruff wrote: >>> On 08/30/2017 01:46 PM, Richard Biener wrote: > >>>> rdivtmp = 1 / (y*C); >>>> tem = x *rdivtmp; >>>> tem2= z * rdivtmp; >>>> >>>> instead of >>>> >>>> rdivtmp = 1/y; >>>> tem = x * 1/C * rdivtmp; >>>> tem2 = z * 1/C * rdivtmp; >>> >>> Ideally we would be able to CSE that into >>> >>> rdivtmp = 1/y * 1/C; >>> tem = x * rdivtmp; >>> tem2 = z * rdivtmp; >> So why is your sequence significantly better than Richi's desired >> seqeuence? They both seem to need 3 mults and a division (which in both >> cases might be a reciprocal estimation). In Richi's sequence we have >> to mult and feed the result as an operand into the reciprocal insn. In >> yours we feed the result of the reciprocal into the multiply. > > Basically this stuff happens a lot in real code, which is exactly why I proposed it. > I even provided counts of how many divisions each transformation avoids: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. I don't doubt that it happens in real code. There's a reason why MIPS IV added recip and rsqrt 20 years ago. Our ability to exploit them has always been fairly limited though. What I'm trying to avoid is a transformation where the two forms are roughly equal in terms of what they expose vs what they hide. If pulling the 1/c out is consistently better, then that's obviously good. If it's essentially a toss-up because of the other interactions with CSE, then we need to think about it more deeply. I _suspect_ that pulling the 1/c out is generally better, but something more concrete than my intuition would be helpful. > > Note this transformation is a canonicalization - if you can't merge a > constant somehow, moving it out to the LHS can expose more > opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> (C3 * x) / y. > Same for negation as it behaves like * -1. FWIW, I generally agree. > > The key here is that it is at least an order of magnitude worse if you have > to execute an extra division than if you have an extra multiply. No doubt. I'll trade a division for a multiply any day. Similarly 1/C is just a constant, so I consider it essentially free. > >> ISTM in the end if y*C or 1/(y*C) is a CSE, then Richi's sequence wins. >> Similarly if 1/y is a CSE, then yours wins. Is there some reason to >> believe that one is a more likely CSE than the other? > > The idea is that 1/y is much more likely a CSE than 1/(y*C). Do we have anything other than intuition to back that up? > > We could make the pattern only fire in single use cases and see whether > that makes a difference. It would be easy to test old vs new vs single-use > new and count how many divisions we end up with. We could add 1/ (y * C) > to the reciprocal phase if it is unacceptable as a canonicalization, but then > you won't be able to optimize (C1 * x * C2) / y. We could. I think the question would then become does restricting to the single-use case kill too many opportunities. Sigh. I think the 4 of us could go round and round on this forever in the pursuit of the perfect code. Though in reality I'm happy with a clear improvement. > >> I think there's a fundamental phase ordering problem here. You want to >> CSE stuff as much as possible, then expose reciprocals, then CSE again >> because exposing reciprocals can expose new CSE opportunities. > > I agree there are phase ordering issues and various problems in > reassociation, CSE and division optimizations not being able to find and > optimize complex cases that are worthwhile. > > However I don't agree doing CSE before reciprocals is a good idea. We > want to expose reciprocals early on, even if that means we find fewer > CSEs as a result - again because division is so much more expensive than > any other operation. CSE is generally not smart enough to CSE a * x in > a * b * x and a * c * x, something which is likely to happen quite frequently - > unlike the made up division examples here. We have much stronger reassociation capabilities for multiplication -- it's a well understood problem and if we fail to pick up a * x across those two kind of expressions, I'd consider our reassociation code as failing pretty badly, particularly for integer types. BUt yes, division is expensive. And I'm all for a tranformation that turns a division into a multiplication. That's almost always a win. > > The first question is whether you see it as a canonicalization. If so, then > match.pd should be fine. Pulling the constant part out of the denominator and turning it into a multiplication by the recip constant should likely be considered canonicalization. I think Richi largely agreed to this in the thread as well and asked for that hunk of the patch to be extracted and submitted independent of the other changes so that it could go ahead and move forward while we figure out the rest. Note that tree-ssa-reassoc.c has a fair amount of this kind of canonicalization. In fact, I think that's where we handle things like pulling out negations. You may actually do better handling it there. Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-09-15 16:50 ` Jeff Law@ 2017-09-16 21:39 ` Bernhard Reutner-Fischer2017-10-13 18:18 ` Jeff Law 0 siblings, 1 reply; 22+ messages in thread From: Bernhard Reutner-Fischer @ 2017-09-16 21:39 UTC (permalink / raw) To: gcc-patches, Jeff Law, Wilco Dijkstra, Jackson Woodruff, Richard Biener Cc: kyrylo.tkachov, Joseph S. Myers, GCC Patches, nd On 15 September 2017 18:50:26 CEST, Jeff Law <law@redhat.com> wrote: >On 09/13/2017 03:20 PM, Wilco Dijkstra wrote: >> Jeff Law wrote: >>> On 09/06/2017 03:55 AM, Jackson Woodruff wrote: >>>> On 08/30/2017 01:46 PM, Richard Biener wrote: >> >>>>> rdivtmp = 1 / (y*C); >>>>> tem = x *rdivtmp; >>>>> tem2= z * rdivtmp; >>>>> >>>>> instead of >>>>> >>>>> rdivtmp = 1/y; >>>>> tem = x * 1/C * rdivtmp; >>>>> tem2 = z * 1/C * rdivtmp; >>>> >>>> Ideally we would be able to CSE that into >>>> >>>> rdivtmp = 1/y * 1/C; >>>> tem = x * rdivtmp; >>>> tem2 = z * rdivtmp; >>> So why is your sequence significantly better than Richi's desired >>> seqeuence? They both seem to need 3 mults and a division (which in >both >>> cases might be a reciprocal estimation). In Richi's sequence we >have >>> to mult and feed the result as an operand into the reciprocal insn. >In >>> yours we feed the result of the reciprocal into the multiply. >> >> Basically this stuff happens a lot in real code, which is exactly why >I proposed it. >> I even provided counts of how many divisions each transformation >avoids: >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. >I don't doubt that it happens in real code. There's a reason why MIPS >IV added recip and rsqrt 20 years ago. Our ability to exploit them has >always been fairly limited though. > >What I'm trying to avoid is a transformation where the two forms are >roughly equal in terms of what they expose vs what they hide. > >If pulling the 1/c out is consistently better, then that's obviously >good. If it's essentially a toss-up because of the other interactions >with CSE, then we need to think about it more deeply. > >I _suspect_ that pulling the 1/c out is generally better, but something >more concrete than my intuition would be helpful. > > > > >> >> Note this transformation is a canonicalization - if you can't merge a >> constant somehow, moving it out to the LHS can expose more >> opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> >(C3 * x) / y. >> Same for negation as it behaves like * -1. >FWIW, I generally agree. > >> >> The key here is that it is at least an order of magnitude worse if >you have >> to execute an extra division than if you have an extra multiply. >No doubt. I'll trade a division for a multiply any day. Similarly 1/C >is just a constant, so I consider it essentially free. > > >> >>> ISTM in the end if y*C or 1/(y*C) is a CSE, then Richi's sequence >wins. >>> Similarly if 1/y is a CSE, then yours wins. Is there some reason >to >>> believe that one is a more likely CSE than the other? >> >> The idea is that 1/y is much more likely a CSE than 1/(y*C). >Do we have anything other than intuition to back that up? >> >> We could make the pattern only fire in single use cases and see >whether >> that makes a difference. It would be easy to test old vs new vs >single-use >> new and count how many divisions we end up with. We could add 1/ (y * >C) >> to the reciprocal phase if it is unacceptable as a canonicalization, >but then >> you won't be able to optimize (C1 * x * C2) / y. >We could. I think the question would then become does restricting to >the single-use case kill too many opportunities. > >Sigh. I think the 4 of us could go round and round on this forever in >the pursuit of the perfect code. Though in reality I'm happy with a >clear improvement. Btw reminds me a little bit of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417 which IIRC never was applied. Maybe someone could talk Denys (a colleague of yours nowadays, Jeff) into submitting a real patch? ;) Cheers, > >> >>> I think there's a fundamental phase ordering problem here. You want >to >>> CSE stuff as much as possible, then expose reciprocals, then CSE >again >>> because exposing reciprocals can expose new CSE opportunities. >> >> I agree there are phase ordering issues and various problems in >> reassociation, CSE and division optimizations not being able to find >and >> optimize complex cases that are worthwhile. >> >> However I don't agree doing CSE before reciprocals is a good idea. We >> want to expose reciprocals early on, even if that means we find fewer >> CSEs as a result - again because division is so much more expensive >than >> any other operation. CSE is generally not smart enough to CSE a * x >in >> a * b * x and a * c * x, something which is likely to happen quite >frequently - >> unlike the made up division examples here. >We have much stronger reassociation capabilities for multiplication -- >it's a well understood problem and if we fail to pick up a * x across >those two kind of expressions, I'd consider our reassociation code as >failing pretty badly, particularly for integer types. > >BUt yes, division is expensive. And I'm all for a tranformation that >turns a division into a multiplication. That's almost always a win. > >> >> The first question is whether you see it as a canonicalization. If >so, then >> match.pd should be fine. >Pulling the constant part out of the denominator and turning it into a >multiplication by the recip constant should likely be considered >canonicalization. I think Richi largely agreed to this in the thread >as >well and asked for that hunk of the patch to be extracted and submitted >independent of the other changes so that it could go ahead and move >forward while we figure out the rest. > >Note that tree-ssa-reassoc.c has a fair amount of this kind of >canonicalization. In fact, I think that's where we handle things like >pulling out negations. You may actually do better handling it there. > > > >Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-09-16 21:39 ` Bernhard Reutner-Fischer@ 2017-10-13 18:18 ` Jeff Law2017-10-13 18:53 ` Wilco Dijkstra 0 siblings, 1 reply; 22+ messages in thread From: Jeff Law @ 2017-10-13 18:18 UTC (permalink / raw) To: Bernhard Reutner-Fischer, gcc-patches, Wilco Dijkstra, Jackson Woodruff, Richard Biener Cc: kyrylo.tkachov, Joseph S. Myers, nd On 09/16/2017 03:39 PM, Bernhard Reutner-Fischer wrote: > On 15 September 2017 18:50:26 CEST, Jeff Law <law@redhat.com> wrote: >> On 09/13/2017 03:20 PM, Wilco Dijkstra wrote: >>> Jeff Law wrote: >>>> On 09/06/2017 03:55 AM, Jackson Woodruff wrote: >>>>> On 08/30/2017 01:46 PM, Richard Biener wrote: >>> >>>>>> rdivtmp = 1 / (y*C); >>>>>> tem = x *rdivtmp; >>>>>> tem2= z * rdivtmp; >>>>>> >>>>>> instead of >>>>>> >>>>>> rdivtmp = 1/y; >>>>>> tem = x * 1/C * rdivtmp; >>>>>> tem2 = z * 1/C * rdivtmp; >>>>> >>>>> Ideally we would be able to CSE that into >>>>> >>>>> rdivtmp = 1/y * 1/C; >>>>> tem = x * rdivtmp; >>>>> tem2 = z * rdivtmp; >>>> So why is your sequence significantly better than Richi's desired >>>> seqeuence? They both seem to need 3 mults and a division (which in >> both >>>> cases might be a reciprocal estimation). In Richi's sequence we >> have >>>> to mult and feed the result as an operand into the reciprocal insn. >> In >>>> yours we feed the result of the reciprocal into the multiply. >>> >>> Basically this stuff happens a lot in real code, which is exactly why >> I proposed it. >>> I even provided counts of how many divisions each transformation >> avoids: >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. >> I don't doubt that it happens in real code. There's a reason why MIPS >> IV added recip and rsqrt 20 years ago. Our ability to exploit them has >> always been fairly limited though. >> >> What I'm trying to avoid is a transformation where the two forms are >> roughly equal in terms of what they expose vs what they hide. >> >> If pulling the 1/c out is consistently better, then that's obviously >> good. If it's essentially a toss-up because of the other interactions >> with CSE, then we need to think about it more deeply. >> >> I _suspect_ that pulling the 1/c out is generally better, but something >> more concrete than my intuition would be helpful. >> >> >> >> >>> >>> Note this transformation is a canonicalization - if you can't merge a >>> constant somehow, moving it out to the LHS can expose more >>> opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> >> (C3 * x) / y. >>> Same for negation as it behaves like * -1. >> FWIW, I generally agree. >> >>> >>> The key here is that it is at least an order of magnitude worse if >> you have >>> to execute an extra division than if you have an extra multiply. >> No doubt. I'll trade a division for a multiply any day. Similarly 1/C >> is just a constant, so I consider it essentially free. >> >> >>> >>>> ISTM in the end if y*C or 1/(y*C) is a CSE, then Richi's sequence >> wins. >>>> Similarly if 1/y is a CSE, then yours wins. Is there some reason >> to >>>> believe that one is a more likely CSE than the other? >>> >>> The idea is that 1/y is much more likely a CSE than 1/(y*C). >> Do we have anything other than intuition to back that up? >>> >>> We could make the pattern only fire in single use cases and see >> whether >>> that makes a difference. It would be easy to test old vs new vs >> single-use >>> new and count how many divisions we end up with. We could add 1/ (y * >> C) >>> to the reciprocal phase if it is unacceptable as a canonicalization, >> but then >>> you won't be able to optimize (C1 * x * C2) / y. >> We could. I think the question would then become does restricting to >> the single-use case kill too many opportunities. >> >> Sigh. I think the 4 of us could go round and round on this forever in >> the pursuit of the perfect code. Though in reality I'm happy with a >> clear improvement. > > > Btw reminds me a little bit of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417 > which IIRC never was applied. Maybe someone could talk Denys (a colleague of yours nowadays, Jeff) into submitting a real patch? ;) Denys has been a Red Hatter for a long time (approaching 10 years now I think). He's not in the compiler group, but is in the same organization as the compiler group. Tege hasn't worked on GCC regularly in years and Denys hasn't ever really engaged the GCC community all that well. I wouldn't expect 28417 to move forward without something other than Tege and Denys pushing on it. jeff ^ permalink raw reply [flat|nested] 22+ messages in thread

*Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)2017-10-13 18:18 ` Jeff Law@ 2017-10-13 18:53 ` Wilco Dijkstra0 siblings, 0 replies; 22+ messages in thread From: Wilco Dijkstra @ 2017-10-13 18:53 UTC (permalink / raw) To: Jeff Law, Bernhard Reutner-Fischer, gcc-patches, Jackson Woodruff, Richard Biener Cc: kyrylo.tkachov, Joseph S. Myers, nd Jeff Law wrote: >> Btw reminds me a little bit of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417 > I wouldn't expect 28417 > to move forward without something other than Tege and Denys pushing on it. Hmm that doesn't look optimal. You can typically do a multiply with the magic constant arranged in such a way that on 32-bit targets you don't need a shift at all. There isn't much to the proof either, code like this is always self-proving: it tries several possible magic constants until a solution is found that has no error for the max input. Given there are usually several valid solutions, you can then select the best shift/magic value combination. Wilco ^ permalink raw reply [flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-10-13 18:52 UTC | newest]Thread overview:22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-10 14:11 [PATCH] Factor out division by squares and remove division around comparisons (1/2) Jackson Woodruff 2017-08-10 15:26 ` Jackson Woodruff 2017-08-11 0:15 ` Joseph Myers 2017-08-15 14:22 ` Richard Biener 2017-08-15 14:43 ` Wilco Dijkstra 2017-08-17 10:20 ` Richard Biener 2017-08-17 10:29 ` Wilco Dijkstra 2017-08-17 10:39 ` Richard Biener 2017-08-17 12:46 ` Joseph Myers 2017-08-24 21:26 ` Jeff Law 2017-08-29 12:13 ` Jackson Woodruff 2017-08-29 13:31 ` Richard Biener 2017-08-30 10:45 ` Jackson Woodruff 2017-08-30 13:26 ` Richard Biener 2017-09-06 9:55 ` Jackson Woodruff 2017-09-13 18:37 ` Jeff Law 2017-09-13 21:20 ` Wilco Dijkstra 2017-09-15 12:07 ` Richard Biener 2017-09-15 16:50 ` Jeff Law 2017-09-16 21:39 ` Bernhard Reutner-Fischer 2017-10-13 18:18 ` Jeff Law 2017-10-13 18:53 ` Wilco Dijkstra

This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).