Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
@ 2017-06-29 11:20 Wilco Dijkstra
  2017-06-29 11:41 ` Richard Biener
  2017-06-29 13:43 ` Jeff Law
  0 siblings, 2 replies; 10+ messages in thread
From: Wilco Dijkstra @ 2017-06-29 11:20 UTC (permalink / raw)
  To: Richard Biener, Naveen.Hurugalawadi; +Cc: nd, GCC Patches

Richard Biener wrote:
> Hurugalawadi, Naveen wrote:
> > The code (m1 > m2) * d code should be optimized as m1> m2 ? d : 0.

> What's the reason of this transform?  I expect that the HW multiplier
> is quite fast given one operand is either zero or one and a multiplication
> is a gimple operation that's better handled in optimizations than
> COND_EXPRs which eventually expand to conditional code which
> would be much slower.

Even really fast multipliers have several cycles latency, and this is generally
fixed irrespectively of the inputs. Maybe you were thinking about division?

Additionally integer multiply typically has much lower throughput than other 
ALU operations like conditional move - a modern CPU may have 4 ALUs
but only 1 multiplier, so removing redundant integer multiplies is always good.

Note (m1 > m2) is also a conditional expression which will result in branches
for floating point expressions and on some targets even for integers. Moving
the multiply into the conditional expression generates the best code:

Integer version:
f1:
	cmp    w0, 100
	csel   w0, w1, wzr, gt
	ret
f2:
	cmp    w0, 100
	cset   w0, gt
	mul    w0, w0, w1
	ret

Float version:
f3:
	movi   v1.2s, #0
	cmp    w0, 100
	fcsel  s0, s0, s1, gt
	ret
f4:
	cmp    w0, 100
	bgt    .L8
	movi   v1.2s, #0
	fmul   s0, s0, s1  // eh???
.L8:
	ret

Wilco

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-06-29 11:20 [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0 Wilco Dijkstra
@ 2017-06-29 11:41 ` Richard Biener
  2017-06-29 15:10   ` Wilco Dijkstra
  2017-06-29 13:43 ` Jeff Law
  1 sibling, 1 reply; 10+ messages in thread
From: Richard Biener @ 2017-06-29 11:41 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Naveen.Hurugalawadi, nd, GCC Patches

On Thu, Jun 29, 2017 at 1:20 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Richard Biener wrote:
>> Hurugalawadi, Naveen wrote:
>> > The code (m1 > m2) * d code should be optimized as m1> m2 ? d : 0.
>
>> What's the reason of this transform?  I expect that the HW multiplier
>> is quite fast given one operand is either zero or one and a multiplication
>> is a gimple operation that's better handled in optimizations than
>> COND_EXPRs which eventually expand to conditional code which
>> would be much slower.
>
> Even really fast multipliers have several cycles latency, and this is generally
> fixed irrespectively of the inputs. Maybe you were thinking about division?
>
> Additionally integer multiply typically has much lower throughput than other
> ALU operations like conditional move - a modern CPU may have 4 ALUs
> but only 1 multiplier, so removing redundant integer multiplies is always good.
>
> Note (m1 > m2) is also a conditional expression which will result in branches
> for floating point expressions and on some targets even for integers. Moving
> the multiply into the conditional expression generates the best code:
>
> Integer version:
> f1:
>         cmp    w0, 100
>         csel   w0, w1, wzr, gt
>         ret
> f2:
>         cmp    w0, 100
>         cset   w0, gt
>         mul    w0, w0, w1
>         ret
>
> Float version:
> f3:
>         movi   v1.2s, #0
>         cmp    w0, 100
>         fcsel  s0, s0, s1, gt
>         ret
> f4:
>         cmp    w0, 100
>         bgt    .L8
>         movi   v1.2s, #0
>         fmul   s0, s0, s1  // eh???
> .L8:
>         ret

But then

int f (int m, int c)
{
  return (m & 1) * c;
}
int g (int m, int c)
{
  if (m & 1 != 0)
    return c;
  return 0;
}

f:
.LFB0:
        .cfi_startproc
        andl    $1, %edi
        movl    %edi, %eax
        imull   %esi, %eax
        ret
g:
.LFB1:
        .cfi_startproc
        movl    %edi, %eax
        andl    $1, %eax
        cmovne  %esi, %eax
        ret

anyway.  As a general comment to the patch please do it as
a pattern in match.pd

(match boolean_value_range_p
 @0
 (if (INTEGRAL_TYPE_P (type)
      && TYPE_PRECISION (type) == 1)))
(match boolean_value_range_p
 INTEGER_CST
 (if (integer_zerop (t) || integer_onep (t))))
(match boolean_value_range_p
 SSA_NAME
 (if (INTEGRAL_TYPE_P (type)
      && ~get_nonzero_bits (t) == 1)))

(simplify
 (mult:c boolean_value_range_p@0 @1)
 (cond @0 @1 @0))

or something like that.

Richard.

> Wilco

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-06-29 11:20 [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0 Wilco Dijkstra
  2017-06-29 11:41 ` Richard Biener
@ 2017-06-29 13:43 ` Jeff Law
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff Law @ 2017-06-29 13:43 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Biener, Naveen.Hurugalawadi; +Cc: nd, GCC Patches

On 06/29/2017 05:20 AM, Wilco Dijkstra wrote:
> Richard Biener wrote:
>> Hurugalawadi, Naveen wrote:
>>> The code (m1 > m2) * d code should be optimized as m1> m2 ? d : 0.
> 
>> What's the reason of this transform?  I expect that the HW multiplier
>> is quite fast given one operand is either zero or one and a multiplication
>> is a gimple operation that's better handled in optimizations than
>> COND_EXPRs which eventually expand to conditional code which
>> would be much slower.
> 
> Even really fast multipliers have several cycles latency, and this is generally
> fixed irrespectively of the inputs. Maybe you were thinking about division?
And on some targets, just getting the arguments into the right register
bank is many cycles.  Think HPPA where integer multiply occurs in the
floating point unit.  Though I don't think that oddity should drive this
discussion.


> 
> Additionally integer multiply typically has much lower throughput than other 
> ALU operations like conditional move - a modern CPU may have 4 ALUs
> but only 1 multiplier, so removing redundant integer multiplies is always good.
I'd tend to agree in general.

jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-06-29 11:41 ` Richard Biener
@ 2017-06-29 15:10   ` Wilco Dijkstra
  2017-06-30  8:16     ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Wilco Dijkstra @ 2017-06-29 15:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: Naveen.Hurugalawadi, nd, GCC Patches

Richard Biener wrote:

> int f (int m, int c)
> {
>  return (m & 1) * c;
> }

This case (integer[0,1] rather than boolean input) should be transformed into c & -(m & 1).

Wilco

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-06-29 15:10   ` Wilco Dijkstra
@ 2017-06-30  8:16     ` Richard Biener
  2017-07-04 11:13       ` Hurugalawadi, Naveen
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2017-06-30  8:16 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Naveen.Hurugalawadi, nd, GCC Patches

On Thu, Jun 29, 2017 at 5:10 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Richard Biener wrote:
>
>> int f (int m, int c)
>> {
>>  return (m & 1) * c;
>> }
>
> This case (integer[0,1] rather than boolean input) should be transformed into c & -(m & 1).

The proposed patch handled both the same.  This means the pattern
shouldn't use range-info
but instead match a more complex

(mult (convert (cmp @0 @1)) @2)

?  Note that from a gimple perspective c & -(m & 1) is more complex
than (m & 1) * c
and thus non-canonical.

> Wilco

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-06-30  8:16     ` Richard Biener
@ 2017-07-04 11:13       ` Hurugalawadi, Naveen
  2017-07-19  2:58         ` [PING} " Hurugalawadi, Naveen
  2017-07-19  6:09         ` Jeff Law
  0 siblings, 2 replies; 10+ messages in thread
From: Hurugalawadi, Naveen @ 2017-07-04 11:13 UTC (permalink / raw)
  To: Richard Biener, Wilco Dijkstra; +Cc: nd, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 610 bytes --]

Hi,

Thanks for the review and comments on the patch.

>> The proposed patch handled both the same.  This means the pattern
>> shouldn't use range-info but instead match a more complex

The patch handles as per the discussion by matching the pattern
in match.pd.

Bootstrapped and Regression tested on AArch64 and X86_64.
Please review the patch and let us know if its okay?

Thanks,
Naveen

2017-07-04  Naveen H.S  <Naveen.Hurugalawadi@cavium.com>

gcc
	* match.pd (((m1 >/</>=/<= m2) * d -> (m1 >/</>=/<= m2) ? d : 0) New
	pattern.

gcc/testsuite
	* gcc.dg/tree-ssa/vrp116.c: New Test.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: mul-match.patch --]
[-- Type: text/x-diff; name="mul-match.patch", Size: 1030 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index 4c64b21..d914db1 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1088,6 +1088,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
       && tree_nop_conversion_p (type, TREE_TYPE (@1)))
   (convert (bit_and (bit_not @1) @0))))
 
+/* (m1 CMP m2) * d -> (m1 CMP m2) ? d : 0  */
+(for cmp (gt lt ge le)
+(simplify
+ (mult (convert (cmp @0 @1)) @2)
+  (cond (cmp @0 @1) @2 { build_zero_cst (type); })))
+
 /* For integral types with undefined overflow and C != 0 fold
    x * C EQ/NE y * C into x EQ/NE y.  */
 (for cmp (eq ne)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c b/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c
new file mode 100644
index 0000000..d9d7b23
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-vrp1" } */
+
+int
+f (int m1, int m2, int c)
+{
+  int d = m1 > m2;
+  int e = d * c;
+  return e ? m1 : m2;
+}
+
+/* { dg-final { scan-tree-dump-times "\\? c_\[0-9\]\\(D\\) : 0" 1 "vrp1" } } */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PING} [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-07-04 11:13       ` Hurugalawadi, Naveen
@ 2017-07-19  2:58         ` Hurugalawadi, Naveen
  2017-07-19  6:09         ` Jeff Law
  1 sibling, 0 replies; 10+ messages in thread
From: Hurugalawadi, Naveen @ 2017-07-19  2:58 UTC (permalink / raw)
  To: Richard Biener, Wilco Dijkstra; +Cc: nd, GCC Patches

Hi,  

Please consider this as a personal reminder to review the patch
at following link and let me know your comments on the same.  

https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00178.html

Thanks,
Naveen


    

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-07-04 11:13       ` Hurugalawadi, Naveen
  2017-07-19  2:58         ` [PING} " Hurugalawadi, Naveen
@ 2017-07-19  6:09         ` Jeff Law
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff Law @ 2017-07-19  6:09 UTC (permalink / raw)
  To: Hurugalawadi, Naveen, Richard Biener, Wilco Dijkstra; +Cc: nd, GCC Patches

On 07/04/2017 05:13 AM, Hurugalawadi, Naveen wrote:
> Hi,
> 
> Thanks for the review and comments on the patch.
> 
>>> The proposed patch handled both the same.  This means the pattern
>>> shouldn't use range-info but instead match a more complex
> 
> The patch handles as per the discussion by matching the pattern
> in match.pd.
> 
> Bootstrapped and Regression tested on AArch64 and X86_64.
> Please review the patch and let us know if its okay?
> 
> Thanks,
> Naveen
> 
> 2017-07-04  Naveen H.S  <Naveen.Hurugalawadi@cavium.com>
> 
> gcc
> 	* match.pd (((m1 >/</>=/<= m2) * d -> (m1 >/</>=/<= m2) ? d : 0) New
> 	pattern.
> 
> gcc/testsuite
> 	* gcc.dg/tree-ssa/vrp116.c: New Test.
> 
OK for the trunk.  Sorry for the delay.

Presumably EQ/NE are handled by a different pattern?

Jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0
  2017-06-29  5:06 Hurugalawadi, Naveen
@ 2017-06-29  9:08 ` Richard Biener
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Biener @ 2017-06-29  9:08 UTC (permalink / raw)
  To: Hurugalawadi, Naveen; +Cc: gcc-patches

On Thu, Jun 29, 2017 at 7:06 AM, Hurugalawadi, Naveen
<Naveen.Hurugalawadi@cavium.com> wrote:
> Hi,
>
> The code (m1 > m2) * d code should be optimized as m1> m2 ? d : 0.
>
> The patch optimizes it inside tree-vrp.c when simplifying with the range
> inside simplify_stmt_using_ranges. If a multiply is found and either side
> has a range [0,1], then transform it.
>
> Ex:- d * c where d has a range of [0,1] transform it to be
> COND_EXPR(d != 0, c, 0).
> The other optimization passes should prop m1 > m2.
>
> Bootstrapped and Regression tested on AArch64 and X86_64.
> Please review the patch and let us know if its okay?

What's the reason of this transform?  I expect that the HW multiplier
is quite fast given one operand is either zero or one and a multiplication
is a gimple operation that's better handled in optimizations than
COND_EXPRs which eventually expand to conditional code which
would be much slower.

Thanks,
Richard.

> Thanks,
> Naveen
>
> 2017-06-28  Naveen H.S  <Naveen.Hurugalawadi@cavium.com>
>
> gcc
>         * tree-vrp.c (simplify_stmt_using_ranges): Add case for
>         optimizing a case of multiplication.
>         (simplify_mult_ops_using_ranges): New.
>
> gcc/testsuite
>         * gcc.dg/tree-ssa/vrp116.c: New Test.
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Transform  (m1 > m2) * d into m1> m2 ? d : 0
@ 2017-06-29  5:06 Hurugalawadi, Naveen
  2017-06-29  9:08 ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Hurugalawadi, Naveen @ 2017-06-29  5:06 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 809 bytes --]

Hi, 

The code (m1 > m2) * d code should be optimized as m1> m2 ? d : 0.

The patch optimizes it inside tree-vrp.c when simplifying with the range
inside simplify_stmt_using_ranges. If a multiply is found and either side
has a range [0,1], then transform it.

Ex:- d * c where d has a range of [0,1] transform it to be
COND_EXPR(d != 0, c, 0).
The other optimization passes should prop m1 > m2.

Bootstrapped and Regression tested on AArch64 and X86_64.
Please review the patch and let us know if its okay?

Thanks,
Naveen

2017-06-28  Naveen H.S  <Naveen.Hurugalawadi@cavium.com>

gcc
	* tree-vrp.c (simplify_stmt_using_ranges): Add case for
	optimizing a case of multiplication.
	(simplify_mult_ops_using_ranges): New.   

gcc/testsuite
	* gcc.dg/tree-ssa/vrp116.c: New Test.



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: mul-vrp.patch --]
[-- Type: text/x-diff; name="mul-vrp.patch", Size: 2329 bytes --]

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c b/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c
new file mode 100644
index 0000000..d9d7b23
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-vrp1" } */
+
+int
+f (int m1, int m2, int c)
+{
+  int d = m1 > m2;
+  int e = d * c;
+  return e ? m1 : m2;
+}
+
+/* { dg-final { scan-tree-dump-times "\\? c_\[0-9\]\\(D\\) : 0" 1 "vrp1" } } */
diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c
index 9ca3924..291b87f 100644
--- a/gcc/tree-vrp.c
+++ b/gcc/tree-vrp.c
@@ -9146,6 +9146,46 @@ vrp_visit_phi_node (gphi *phi)
   return SSA_PROP_NOT_INTERESTING;
 }
 
+static bool
+simplify_mult_ops_using_ranges (gimple_stmt_iterator * gsi, gimple *stmt)
+{
+  enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
+  tree op0, op1, lhs;
+
+  op0 = gimple_assign_rhs1 (stmt);
+  op1 = gimple_assign_rhs2 (stmt);
+  lhs = gimple_assign_lhs (stmt);
+
+  if (!op_with_boolean_value_range_p (op0)
+      && !op_with_boolean_value_range_p (op1))
+    return false;
+
+  if (rhs_code == MULT_EXPR)
+    {
+      if (op_with_boolean_value_range_p (op0))
+	{
+	  tree t = build_int_cst (TREE_TYPE (lhs), 0);
+	  tree tmp = build3 (COND_EXPR, TREE_TYPE (lhs),
+			     build2 (NE_EXPR, boolean_type_node, op0, t),
+			     op1, t);
+	  gimple *new_assign = gimple_build_assign (lhs, tmp);
+	  gsi_replace (gsi, new_assign, true);
+	  return true;
+	}
+      if (op_with_boolean_value_range_p (op1))
+	{
+	  tree t = build_int_cst (TREE_TYPE (lhs), 0);
+	  tree tmp = build3 (COND_EXPR, TREE_TYPE (lhs),
+			     build2 (NE_EXPR, boolean_type_node, op1, t),
+			     op0, t);
+	  gimple *new_assign = gimple_build_assign (lhs, tmp);
+	  gsi_replace (gsi, new_assign, true);
+	  return true;
+	}
+    }
+  return false;
+}
+
 /* Simplify boolean operations if the source is known
    to be already a boolean.  */
 static bool
@@ -10345,6 +10385,11 @@ simplify_stmt_using_ranges (gimple_stmt_iterator *gsi)
 	    return simplify_div_or_mod_using_ranges (gsi, stmt);
 	  break;
 
+	case MULT_EXPR:
+	  if (INTEGRAL_TYPE_P (TREE_TYPE (rhs1)))
+	    return simplify_mult_ops_using_ranges (gsi, stmt);
+	  break;
+
       /* Transform ABS (X) into X or -X as appropriate.  */
 	case ABS_EXPR:
 	  if (TREE_CODE (rhs1) == SSA_NAME

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-07-19  6:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29 11:20 [PATCH] Transform (m1 > m2) * d into m1> m2 ? d : 0 Wilco Dijkstra
2017-06-29 11:41 ` Richard Biener
2017-06-29 15:10   ` Wilco Dijkstra
2017-06-30  8:16     ` Richard Biener
2017-07-04 11:13       ` Hurugalawadi, Naveen
2017-07-19  2:58         ` [PING} " Hurugalawadi, Naveen
2017-07-19  6:09         ` Jeff Law
2017-06-29 13:43 ` Jeff Law
  -- strict thread matches above, loose matches on Subject: below --
2017-06-29  5:06 Hurugalawadi, Naveen
2017-06-29  9:08 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).