[Patch/ccmp] Cost instruction sequences to choose better expand order

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [Patch/ccmp] Cost instruction sequences to choose better expand order
@ 2015-09-18 16:08 Jiong Wang
  2015-09-21 11:40 ` Bernd Schmidt
  0 siblings, 1 reply; 5+ messages in thread
From: Jiong Wang @ 2015-09-18 16:08 UTC (permalink / raw)
  To: GCC Patches; +Cc: Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 2299 bytes --]


Current conditional compare (CCMP) support in GCC aim to optimize
short circuit for cascade comparision, given a simple conditional
compare candidate:

  if (a == 17 || a == 32)

it's represented like the following in IR:

  t0 = a == 17
  t1 = a == 32
  t2 = t0 || t1


Normally, CCMP contains two parts, the first part calculate the initial
condition using normal compare instruction like "cmp", then the second
part can use "ccmp" instruction and exected conditionally based on the
condition generated in the first step.

The problem is current implementation always expand t0 first, then
t1. While the expand order need to consider the rtx costs, because "cmp"
and "ccmp" may have different restrictions that the expand order will
result in performance differences.

For example on AArch64, "ccmp" only accept immediate within -31 ~ 31
while "cmp" accept wider range, so if we expand "a == 32" in the second
step, then it will use "ccmp", and thus an extra "mov reg, 32"
instruction is generated because "32" is out of the range. While if we
expand "a == 32" first, then it's fine as "cmp" can encoding it.

Instruction difference for a simple testcase is listed below:

int foo(int a)
{
  if (a == 17 || a == 32)
    return 1;
  else
    return 2;
}

before
===
foo:
        mov     w1, 32
        cmp     w0, 17
        ccmp    w0, w1, 4, ne
        cset    w0, ne
        add     w0, w0, 1
        ret

after
===
foo:
        cmp     w0, 32
        ccmp    w0, 17, 4, ne
        cset    w0, ne
        add     w0, w0, 1
        ret

this patch still haven't fix other complexer situations, for example given:

 if (a == 1 || a = 29 || a == 32)

there is recursive call of expand_ccmp_expr_1 while this patch only
handle the inner most call where the incoming gimple is with both
operands be comparision operations.

NOTE: AArch64 backend can't cost CCMP instruction accurately, so I marked
the testcase as XFAIL which will be removed once we fix the cost issue.

2015-09-18  Jiong Wang  <jiong.wang@arm.com>

gcc/
  * ccmp.c (expand_ccmp_expr_1): Costs the instruction sequences
  generated from different expand order.
  
gcc/testsuite/
  * gcc.target/aarch64/ccmp_1.c: New testcase.
  
-- 
Regards,
Jiong


[-- Attachment #2: ccmp-fix.patch --]
[-- Type: text/x-diff, Size: 3387 bytes --]

diff --git a/gcc/ccmp.c b/gcc/ccmp.c
index 3c3fbcd..dc985f6 100644
--- a/gcc/ccmp.c
+++ b/gcc/ccmp.c
@@ -51,6 +51,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-outof-ssa.h"
 #include "cfgexpand.h"
 #include "ccmp.h"
+#include "predict.h"

 /* The following functions expand conditional compare (CCMP) instructions.
    Here is a short description about the over all algorithm:
@@ -165,6 +166,8 @@ expand_ccmp_next (gimple g, enum tree_code code, rtx prev,
 static rtx
 expand_ccmp_expr_1 (gimple g, rtx *prep_seq, rtx *gen_seq)
 {
+  rtx prep_seq_1, gen_seq_1;
+  rtx prep_seq_2, gen_seq_2;
   tree exp = gimple_assign_rhs_to_tree (g);
   enum tree_code code = TREE_CODE (exp);
   gimple gs0 = get_gimple_for_ssa_name (TREE_OPERAND (exp, 0));
@@ -180,19 +183,62 @@ expand_ccmp_expr_1 (gimple g, rtx *prep_seq, rtx *gen_seq)
     {
       if (TREE_CODE_CLASS (code1) == tcc_comparison)
 	{
-	  int unsignedp0;
-	  enum rtx_code rcode0;
+	  int unsignedp0, unsignedp1;
+	  enum rtx_code rcode0, rcode1;
+	  int speed_p = optimize_insn_for_speed_p ();
+	  rtx tmp2, ret, ret2;
+	  unsigned cost1 = MAX_COST;
+	  unsigned cost2 = MAX_COST;
+	  bool first_only_p = false;
+	  bool second_only_p = false;

 	  unsignedp0 = TYPE_UNSIGNED (TREE_TYPE (gimple_assign_rhs1 (gs0)));
+	  unsignedp1 = TYPE_UNSIGNED (TREE_TYPE (gimple_assign_rhs1 (gs1)));
 	  rcode0 = get_rtx_code (code0, unsignedp0);
-
-	  tmp = targetm.gen_ccmp_first (prep_seq, gen_seq, rcode0,
+	  rcode1 = get_rtx_code (code1, unsignedp1);
+	  tmp = targetm.gen_ccmp_first (&prep_seq_1, &gen_seq_1, rcode0,
 					gimple_assign_rhs1 (gs0),
 					gimple_assign_rhs2 (gs0));
-	  if (!tmp)
-	    return NULL_RTX;
+	  tmp2 = targetm.gen_ccmp_first (&prep_seq_2, &gen_seq_2, rcode1,
+					 gimple_assign_rhs1 (gs1),
+					 gimple_assign_rhs2 (gs1));

-	  return expand_ccmp_next (gs1, code, tmp, prep_seq, gen_seq);
+	  if (! tmp && ! tmp2)
+	    return NULL_RTX;
+	  else if (! tmp)
+	    second_only_p = true;
+	  else if (! tmp2)
+	    first_only_p = true;
+
+
+	  if (! second_only_p)
+	    {
+	      ret = expand_ccmp_next (gs1, code, tmp, &prep_seq_1, &gen_seq_1);
+	      cost1 = seq_cost (safe_as_a <rtx_insn *> (prep_seq_1),
+				speed_p);
+	      cost1 += seq_cost (safe_as_a <rtx_insn *> (gen_seq_1), speed_p);
+	    }
+	  if (! first_only_p)
+	    {
+	      ret2 = expand_ccmp_next (gs0, code, tmp2, &prep_seq_2,
+				       &gen_seq_2);
+	      cost2 = seq_cost (safe_as_a <rtx_insn *> (prep_seq_2),
+				speed_p);
+	      cost2 += seq_cost (safe_as_a <rtx_insn *> (gen_seq_2), speed_p);
+	    }
+
+	  if (cost1 > cost2)
+	    {
+	      *prep_seq = prep_seq_2;
+	      *gen_seq = gen_seq_2;
+	      ret = ret2;
+	    }
+	  else
+	    {
+	      *prep_seq = prep_seq_1;
+	      *gen_seq = gen_seq_1;
+	    }
+	  return ret;
 	}
       else
 	{
diff --git a/gcc/testsuite/gcc.target/aarch64/ccmp_1.c b/gcc/testsuite/gcc.target/aarch64/ccmp_1.c
new file mode 100644
index 0000000..f431af6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ccmp_1.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (int a)
+{
+  if (a == 17 || a == 32)
+    return 1;
+  else
+    return 2;
+}
+
+/* { dg-final { scan-assembler "cmp\t\.*32" { xfail aarch64*-*-* } } } */

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch/ccmp] Cost instruction sequences to choose better expand order
  2015-09-18 16:08 [Patch/ccmp] Cost instruction sequences to choose better expand order Jiong Wang
@ 2015-09-21 11:40 ` Bernd Schmidt
  2015-09-21 14:00   ` Jiong Wang
  2015-09-21 14:01   ` pinskia
  0 siblings, 2 replies; 5+ messages in thread
From: Bernd Schmidt @ 2015-09-21 11:40 UTC (permalink / raw)
  To: Jiong Wang, GCC Patches; +Cc: Richard Henderson

On 09/18/2015 05:21 PM, Jiong Wang wrote:
>
> Current conditional compare (CCMP) support in GCC aim to optimize
> short circuit for cascade comparision, given a simple conditional
> compare candidate:
>
>    if (a == 17 || a == 32)
[...]
> The problem is current implementation always expand t0 first, then
> t1. While the expand order need to consider the rtx costs, because "cmp"
> and "ccmp" may have different restrictions that the expand order will
> result in performance differences.
>
> For example on AArch64, "ccmp" only accept immediate within -31 ~ 31
> while "cmp" accept wider range, so if we expand "a == 32" in the second
> step, then it will use "ccmp", and thus an extra "mov reg, 32"
> instruction is generated because "32" is out of the range. While if we
> expand "a == 32" first, then it's fine as "cmp" can encoding it.

I've played with this patch a bit with an aarch64 cross compiler. First 
of all - it doesn't seem to work, I get identical costs and the swapping 
doesn't happen. Did you forget to include a rtx_cost patch?

I was a little worried about whether this would be expensive for longer 
sequences of conditions, but it seems like it looks only at leafs where 
we have two comparisons, so that cost should be minimal. However, it's 
possible there's room for improvement in code generation. I was curious 
and looked at a slightly more complex testcase

int
foo (int a)
{
   if (a == 17 || a == 32 || a == 47 || a == 53 || a == 66 || a == 72)
     return 1;
   else
     return 2;
}

and this doesn't generate a sequence of ccmps as might have been 
expected; we only get pairs of comparisons merged with a bit_ior:

   D.2699 = a == 17;
   D.2700 = a == 32;
   D.2701 = D.2699 | D.2700;
   if (D.2701 != 0) goto <D.2697>; else goto <D.2702>;
   <D.2702>:
   D.2703 = a == 47;
   D.2704 = a == 53;
   D.2705 = D.2703 | D.2704;
   if (D.2705 != 0) goto <D.2697>; else goto <D.2706>;

and the ccmp expander doesn't see the entire thing. I found that a 
little surprising TBH.


Bernd

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch/ccmp] Cost instruction sequences to choose better expand order
  2015-09-21 11:40 ` Bernd Schmidt
@ 2015-09-21 14:00   ` Jiong Wang
  2015-09-23 11:08     ` Bernd Schmidt
  2015-09-21 14:01   ` pinskia
  1 sibling, 1 reply; 5+ messages in thread
From: Jiong Wang @ 2015-09-21 14:00 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, Richard Henderson


Bernd Schmidt writes:

> On 09/18/2015 05:21 PM, Jiong Wang wrote:
>>
>> Current conditional compare (CCMP) support in GCC aim to optimize
>> short circuit for cascade comparision, given a simple conditional
>> compare candidate:
>>
>>    if (a == 17 || a == 32)
> [...]
>> The problem is current implementation always expand t0 first, then
>> t1. While the expand order need to consider the rtx costs, because "cmp"
>> and "ccmp" may have different restrictions that the expand order will
>> result in performance differences.
>>
>> For example on AArch64, "ccmp" only accept immediate within -31 ~ 31
>> while "cmp" accept wider range, so if we expand "a == 32" in the second
>> step, then it will use "ccmp", and thus an extra "mov reg, 32"
>> instruction is generated because "32" is out of the range. While if we
>> expand "a == 32" first, then it's fine as "cmp" can encoding it.
>
> I've played with this patch a bit with an aarch64 cross compiler. First 
> of all - it doesn't seem to work, I get identical costs and the swapping 
> doesn't happen. Did you forget to include a rtx_cost patch?

No. Please see NOTE part of the description. AArch64 doesn't cost ccmp
currently. It will be fixed by a seperate patch later. The testcase is
thus marked as XFAIL.

>
> I was a little worried about whether this would be expensive for longer 
> sequences of conditions, but it seems like it looks only at leafs where 
> we have two comparisons, so that cost should be minimal. However, it's 
> possible there's room for improvement in code generation. I was curious 
> and looked at a slightly more complex testcase
>
> int
> foo (int a)
> {
>    if (a == 17 || a == 32 || a == 47 || a == 53 || a == 66 || a == 72)
>      return 1;
>    else
>      return 2;
> }

Yes, this patch only cost the inner most call of the recursive call of
expand_ccmp_expr_1 which is simple and common situation.
>
> and this doesn't generate a sequence of ccmps as might have been 
> expected; we only get pairs of comparisons merged with a bit_ior:
>
>    D.2699 = a == 17;
>    D.2700 = a == 32;
>    D.2701 = D.2699 | D.2700;
>    if (D.2701 != 0) goto <D.2697>; else goto <D.2702>;
>    <D.2702>:
>    D.2703 = a == 47;
>    D.2704 = a == 53;
>    D.2705 = D.2703 | D.2704;
>    if (D.2705 != 0) goto <D.2697>; else goto <D.2706>;
>
> and the ccmp expander doesn't see the entire thing. I found that a 
> little surprising TBH.

It's because fold_truth_andor has re-written the tree to avoid the use
short circuit given LOGICAL_OP_NON_SHORT_CIRCUIT returns true.

-- 
Regards,
Jiong

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch/ccmp] Cost instruction sequences to choose better expand order
  2015-09-21 11:40 ` Bernd Schmidt
  2015-09-21 14:00   ` Jiong Wang
@ 2015-09-21 14:01   ` pinskia
  1 sibling, 0 replies; 5+ messages in thread
From: pinskia @ 2015-09-21 14:01 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jiong Wang, GCC Patches, Richard Henderson



> On Sep 21, 2015, at 4:39 AM, Bernd Schmidt <bschmidt@redhat.com> wrote:
> 
>> On 09/18/2015 05:21 PM, Jiong Wang wrote:
>> 
>> Current conditional compare (CCMP) support in GCC aim to optimize
>> short circuit for cascade comparision, given a simple conditional
>> compare candidate:
>> 
>>   if (a == 17 || a == 32)
> [...]
>> The problem is current implementation always expand t0 first, then
>> t1. While the expand order need to consider the rtx costs, because "cmp"
>> and "ccmp" may have different restrictions that the expand order will
>> result in performance differences.
>> 
>> For example on AArch64, "ccmp" only accept immediate within -31 ~ 31
>> while "cmp" accept wider range, so if we expand "a == 32" in the second
>> step, then it will use "ccmp", and thus an extra "mov reg, 32"
>> instruction is generated because "32" is out of the range. While if we
>> expand "a == 32" first, then it's fine as "cmp" can encoding it.
> 
> I've played with this patch a bit with an aarch64 cross compiler. First of all - it doesn't seem to work, I get identical costs and the swapping doesn't happen. Did you forget to include a rtx_cost patch?
> 
> I was a little worried about whether this would be expensive for longer sequences of conditions, but it seems like it looks only at leafs where we have two comparisons, so that cost should be minimal. However, it's possible there's room for improvement in code generation. I was curious and looked at a slightly more complex testcase
> 
> int
> foo (int a)
> {
>  if (a == 17 || a == 32 || a == 47 || a == 53 || a == 66 || a == 72)
>    return 1;
>  else
>    return 2;
> }
> 
> and this doesn't generate a sequence of ccmps as might have been expected; we only get pairs of comparisons merged with a bit_ior:
> 
>  D.2699 = a == 17;
>  D.2700 = a == 32;
>  D.2701 = D.2699 | D.2700;
>  if (D.2701 != 0) goto <D.2697>; else goto <D.2702>;
>  <D.2702>:
>  D.2703 = a == 47;
>  D.2704 = a == 53;
>  D.2705 = D.2703 | D.2704;
>  if (D.2705 != 0) goto <D.2697>; else goto <D.2706>;
> 
> and the ccmp expander doesn't see the entire thing. I found that a little surprising TBH.

This is a known issue with fold-const folding too early. Replace || with | and add some parentheses and you should get a string of ccmp's. 

Thanks,
Andrew


> 
> 
> Bernd

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch/ccmp] Cost instruction sequences to choose better expand order
  2015-09-21 14:00   ` Jiong Wang
@ 2015-09-23 11:08     ` Bernd Schmidt
  0 siblings, 0 replies; 5+ messages in thread
From: Bernd Schmidt @ 2015-09-23 11:08 UTC (permalink / raw)
  To: Jiong Wang; +Cc: GCC Patches, Richard Henderson

> No. Please see NOTE part of the description. AArch64 doesn't cost ccmp
> currently. It will be fixed by a seperate patch later. The testcase is
> thus marked as XFAIL.

I'd prefer to do things in the right order. Your patch is approved, but 
please commit only after you can remove the xfail from the testcase.


Bernd

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-23 10:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-18 16:08 [Patch/ccmp] Cost instruction sequences to choose better expand order Jiong Wang
2015-09-21 11:40 ` Bernd Schmidt
2015-09-21 14:00   ` Jiong Wang
2015-09-23 11:08     ` Bernd Schmidt
2015-09-21 14:01   ` pinskia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).