[PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
@ 2014-12-04 11:00 Jiong Wang
  2014-12-04 11:07 ` Richard Biener
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2014-12-04 11:00 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2575 bytes --]

For PR62173, the ideal solution is to resolve the problem on tree level ivopt pass.

While, apart from the tree level issue, PR 62173 also exposed another two RTL level issues.
one of them is looks like we could improve RTL level loop invariant hoisting by re-shuffle insns.

for Seb's testcase

void bar(int i) {
   char A[10];
   int d = 0;
   while (i > 0)
   A[d++] = i--;

   while (d > 0)
   foo(A[d--]);
}

the insn sequences to calculate A[I]'s address looks like:

(insn 76 75 77 22 (set (reg/f:DI 109)
   (plus:DI (reg/f:DI 64 sfp)
   (reg:DI 108 [ i ]))) seb-pop.c:8 84 {*adddi3_aarch64}
   (expr_list:REG_DEAD (reg:DI 108 [ i ])
   (nil)))
(insn 77 76 78 22 (set (reg:SI 110 [ D.2633 ])
   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 109)
   (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8]))) seb-pop.c:8 76 {*zero_extendqisi2_aarch64}
   (expr_list:REG_DEAD (reg/f:DI 109)
   (nil)))

while for most RISC archs, reg + reg addressing is typical, so if we re-shuffle
the instruction sequences into the following:

(insn 96 94 97 22 (set (reg/f:DI 129)
   (plus:DI (reg/f:DI 64 sfp)
   (const_int -16 [0xfffffffffffffff0]))) seb-pop.c:8 84 {*adddi3_aarch64}
   (nil))
(insn 97 96 98 22 (set (reg:DI 130 [ i ])
   (sign_extend:DI (reg/v:SI 97 [ i ]))) seb-pop.c:8 70 {*extendsidi2_aarch64}
   (expr_list:REG_DEAD (reg/v:SI 97 [ i ])
   (nil)))
(insn 98 97 99 22 (set (reg:SI 131 [ D.2633 ])
   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 129)
   (reg:DI 130 [ i ])) [0 A S1 A8]))) seb-pop.c:8 76 {*zero_extendqisi2_aarch64}
   (expr_list:REG_DEAD (reg:DI 130 [ i ])
   (expr_list:REG_DEAD (reg/f:DI 129)
   (nil))))

which means re-associate the constant imm with the virtual frame pointer.

transform

      RA <- fixed_reg + RC
      RD <- MEM (RA + const_offset)

   into:

      RA <- fixed_reg + const_offset
      RD <- MEM (RA + RC)

then RA <- fixed_reg + const_offset is actually loop invariant, so the later
RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate what tree
level ivopts could not sort out.

and this patch only tries to re-shuffle instructions within single basic block which
is a inner loop which is perf critical.

I am reusing the loop info in fwprop because there is loop info and it's run before
GCSE.

verified on aarch64 and mips64, the array base address hoisted out of loop.

bootstrap ok on x86-64 and aarch64.

comments?

thanks.

gcc/
   PR62173
   fwprop.c (prepare_for_gcse_pre): New function.
   (fwprop_done): Call it.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: loop-reassociate.patch --]
[-- Type: text/x-patch; name=loop-reassociate.patch, Size: 3811 bytes --]

diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 377b33c..b2a5918 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1399,6 +1399,133 @@ forward_propagate_into (df_ref use)
   return false;
 }
 
+/* Loop invariant variable hoisting for critical code has
+   important impact on the performance.
+
+   The RTL GCSE PRE pass could detect more hoisting opportunities
+   if we re-shuffle the instructions to associate fixed registers
+   with constant.
+
+   This function try to transform
+
+     RA <- RB_fixed + RC
+     RD <- MEM (RA + const_offset)
+
+  into:
+
+     RA <- RB_fixed + const_offset
+     RD <- MEM (RA + RC)
+
+  If RA is DEAD after the second instruction.
+
+  After this change, the first instruction is loop invariant.  */
+
+static void
+prepare_for_gcse_pre ()
+{
+  struct loop *loop;
+
+  if (! current_loops)
+    return;
+
+  FOR_EACH_LOOP (loop, LI_INCLUDE_ROOT)
+    {
+      if (loop && loop->header && loop->latch
+	  && loop->header->index == loop->latch->index)
+	{
+	  rtx_insn *insn, *next_insn;
+	  rtx single_set1, single_set2, old_dest;
+	  rtx op0, op0_;
+	  rtx op1, op1_;
+	  rtx inner;
+	  rtx *mem_plus_loc;
+
+	  basic_block bb = BASIC_BLOCK_FOR_FN (cfun, loop->header->index);
+
+	  FOR_BB_INSNS (bb, insn)
+	    {
+	      if (! NONDEBUG_INSN_P (insn))
+		continue;
+
+	      single_set1 = single_set (insn);
+
+	      if (! single_set1
+		  || GET_CODE (SET_SRC (single_set1)) != PLUS)
+		continue;
+
+	      old_dest = SET_DEST (single_set1);
+	      op0 = XEXP (SET_SRC (single_set1), 0);
+	      op1 = XEXP (SET_SRC (single_set1), 1);
+
+	      if (op1 == frame_pointer_rtx
+		  || op1 == stack_pointer_rtx
+		  || op1 == virtual_stack_vars_rtx)
+		std::swap (op0, op1);
+
+	      if (! (REG_P (old_dest) && REG_P (op0) && REG_P (op1)
+		     && (op0 == frame_pointer_rtx
+			 || op0 == stack_pointer_rtx
+			 || op0 == virtual_stack_vars_rtx)))
+		continue;
+
+	      if (! (next_insn = next_real_insn (insn)))
+		break;
+
+	      do
+		{
+		  if (DEBUG_INSN_P (next_insn))
+		    continue;
+
+		  single_set2 = single_set (next_insn);
+
+		  if (!single_set2 || ! REG_P (SET_DEST (single_set2)))
+		    continue;
+
+		  inner = SET_SRC (single_set2);
+
+		  if (GET_CODE (inner) == ZERO_EXTEND
+		      || GET_CODE (inner) == SIGN_EXTEND
+		      || GET_CODE (inner) == TRUNCATE)
+		    inner = XEXP (inner, 0);
+
+		  if (! MEM_P (inner)
+		      || GET_CODE (XEXP (inner, 0)) != PLUS)
+		    continue;
+
+		  mem_plus_loc = &XEXP (inner, 0);
+		  op0_ = XEXP (XEXP (inner, 0), 0);
+		  op1_ = XEXP (XEXP (inner, 0), 1);
+
+		  if (REG_P (op0_) && CONST_INT_P (op1_)
+		      && rtx_equal_p (op0_, old_dest)
+		      && GET_MODE (op0_) == GET_MODE (op1))
+		    {
+		      rtx new_src;
+
+		      if (find_regno_note (next_insn, REG_DEAD,
+					   REGNO (old_dest)))
+			{
+			  new_src = plus_constant (GET_MODE (op0), op0,
+						   INTVAL (op1_));
+			  validate_change (insn, &SET_SRC (single_set1),
+					   new_src, 1);
+			  new_src = gen_rtx_PLUS (GET_MODE (op0_), op0_, op1);
+			  validate_change (next_insn, mem_plus_loc, new_src, 1);
+			  if (apply_change_group () && dump_file)
+			    fprintf (dump_file,
+				     "\nRe-associate insn %d and %d for later"
+				     " RTL loop invariant hoisting.\n",
+				     INSN_UID (insn), INSN_UID (next_insn));
+			}
+		      break;
+		    }
+		} while ((next_insn = next_real_insn (next_insn))
+			 && bb == BLOCK_FOR_INSN (next_insn));
+	    }
+	}
+    }
+}
+
 \f
 static void
 fwprop_init (void)
@@ -1424,6 +1551,7 @@ fwprop_init (void)
 static void
 fwprop_done (void)
 {
+  prepare_for_gcse_pre ();
   loop_optimizer_finalize ();
 
   use_def_ref.release ();

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-04 11:07 ` Richard Biener
@ 2014-12-04 11:07   ` Richard Biener
  2014-12-04 19:32     ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Richard Biener @ 2014-12-04 11:07 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches

On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>> For PR62173, the ideal solution is to resolve the problem on tree level
>> ivopt pass.
>>
>> While, apart from the tree level issue, PR 62173 also exposed another two
>> RTL level issues.
>> one of them is looks like we could improve RTL level loop invariant hoisting
>> by re-shuffle insns.
>>
>> for Seb's testcase
>>
>> void bar(int i) {
>>   char A[10];
>>   int d = 0;
>>   while (i > 0)
>>   A[d++] = i--;
>>
>>   while (d > 0)
>>   foo(A[d--]);
>> }
>>
>> the insn sequences to calculate A[I]'s address looks like:
>>
>> (insn 76 75 77 22 (set (reg/f:DI 109)
>>   (plus:DI (reg/f:DI 64 sfp)
>>   (reg:DI 108 [ i ]))) seb-pop.c:8 84 {*adddi3_aarch64}
>>   (expr_list:REG_DEAD (reg:DI 108 [ i ])
>>   (nil)))
>> (insn 77 76 78 22 (set (reg:SI 110 [ D.2633 ])
>>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 109)
>>   (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8]))) seb-pop.c:8 76
>> {*zero_extendqisi2_aarch64}
>>   (expr_list:REG_DEAD (reg/f:DI 109)
>>   (nil)))
>>
>> while for most RISC archs, reg + reg addressing is typical, so if we
>> re-shuffle
>> the instruction sequences into the following:
>>
>> (insn 96 94 97 22 (set (reg/f:DI 129)
>>   (plus:DI (reg/f:DI 64 sfp)
>>   (const_int -16 [0xfffffffffffffff0]))) seb-pop.c:8 84 {*adddi3_aarch64}
>>   (nil))
>> (insn 97 96 98 22 (set (reg:DI 130 [ i ])
>>   (sign_extend:DI (reg/v:SI 97 [ i ]))) seb-pop.c:8 70
>> {*extendsidi2_aarch64}
>>   (expr_list:REG_DEAD (reg/v:SI 97 [ i ])
>>   (nil)))
>> (insn 98 97 99 22 (set (reg:SI 131 [ D.2633 ])
>>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 129)
>>   (reg:DI 130 [ i ])) [0 A S1 A8]))) seb-pop.c:8 76
>> {*zero_extendqisi2_aarch64}
>>   (expr_list:REG_DEAD (reg:DI 130 [ i ])
>>   (expr_list:REG_DEAD (reg/f:DI 129)
>>   (nil))))
>>
>> which means re-associate the constant imm with the virtual frame pointer.
>>
>> transform
>>
>>      RA <- fixed_reg + RC
>>      RD <- MEM (RA + const_offset)
>>
>>   into:
>>
>>      RA <- fixed_reg + const_offset
>>      RD <- MEM (RA + RC)
>>
>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>> what tree
>> level ivopts could not sort out.
>
> There is a LIM pass after gimple ivopts - if the invariantness is already
> visible there why not handle it there similar to the special-cases in
> rewrite_bittest and rewrite_reciprocal?
>
> And of course similar tricks could be applied on the RTL level to
> RTL invariant motion?

Oh, and the patch misses a testcase.

> Thanks,
> Richard.
>
>> and this patch only tries to re-shuffle instructions within single basic
>> block which
>> is a inner loop which is perf critical.
>>
>> I am reusing the loop info in fwprop because there is loop info and it's run
>> before
>> GCSE.
>>
>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>
>> bootstrap ok on x86-64 and aarch64.
>>
>> comments?
>>
>> thanks.
>>
>> gcc/
>>   PR62173
>>   fwprop.c (prepare_for_gcse_pre): New function.
>>   (fwprop_done): Call it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-04 11:00 [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting Jiong Wang
@ 2014-12-04 11:07 ` Richard Biener
  2014-12-04 11:07   ` Richard Biener
  0 siblings, 1 reply; 37+ messages in thread
From: Richard Biener @ 2014-12-04 11:07 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches

On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
> For PR62173, the ideal solution is to resolve the problem on tree level
> ivopt pass.
>
> While, apart from the tree level issue, PR 62173 also exposed another two
> RTL level issues.
> one of them is looks like we could improve RTL level loop invariant hoisting
> by re-shuffle insns.
>
> for Seb's testcase
>
> void bar(int i) {
>   char A[10];
>   int d = 0;
>   while (i > 0)
>   A[d++] = i--;
>
>   while (d > 0)
>   foo(A[d--]);
> }
>
> the insn sequences to calculate A[I]'s address looks like:
>
> (insn 76 75 77 22 (set (reg/f:DI 109)
>   (plus:DI (reg/f:DI 64 sfp)
>   (reg:DI 108 [ i ]))) seb-pop.c:8 84 {*adddi3_aarch64}
>   (expr_list:REG_DEAD (reg:DI 108 [ i ])
>   (nil)))
> (insn 77 76 78 22 (set (reg:SI 110 [ D.2633 ])
>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 109)
>   (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8]))) seb-pop.c:8 76
> {*zero_extendqisi2_aarch64}
>   (expr_list:REG_DEAD (reg/f:DI 109)
>   (nil)))
>
> while for most RISC archs, reg + reg addressing is typical, so if we
> re-shuffle
> the instruction sequences into the following:
>
> (insn 96 94 97 22 (set (reg/f:DI 129)
>   (plus:DI (reg/f:DI 64 sfp)
>   (const_int -16 [0xfffffffffffffff0]))) seb-pop.c:8 84 {*adddi3_aarch64}
>   (nil))
> (insn 97 96 98 22 (set (reg:DI 130 [ i ])
>   (sign_extend:DI (reg/v:SI 97 [ i ]))) seb-pop.c:8 70
> {*extendsidi2_aarch64}
>   (expr_list:REG_DEAD (reg/v:SI 97 [ i ])
>   (nil)))
> (insn 98 97 99 22 (set (reg:SI 131 [ D.2633 ])
>   (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 129)
>   (reg:DI 130 [ i ])) [0 A S1 A8]))) seb-pop.c:8 76
> {*zero_extendqisi2_aarch64}
>   (expr_list:REG_DEAD (reg:DI 130 [ i ])
>   (expr_list:REG_DEAD (reg/f:DI 129)
>   (nil))))
>
> which means re-associate the constant imm with the virtual frame pointer.
>
> transform
>
>      RA <- fixed_reg + RC
>      RD <- MEM (RA + const_offset)
>
>   into:
>
>      RA <- fixed_reg + const_offset
>      RD <- MEM (RA + RC)
>
> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
> what tree
> level ivopts could not sort out.

There is a LIM pass after gimple ivopts - if the invariantness is already
visible there why not handle it there similar to the special-cases in
rewrite_bittest and rewrite_reciprocal?

And of course similar tricks could be applied on the RTL level to
RTL invariant motion?

Thanks,
Richard.

> and this patch only tries to re-shuffle instructions within single basic
> block which
> is a inner loop which is perf critical.
>
> I am reusing the loop info in fwprop because there is loop info and it's run
> before
> GCSE.
>
> verified on aarch64 and mips64, the array base address hoisted out of loop.
>
> bootstrap ok on x86-64 and aarch64.
>
> comments?
>
> thanks.
>
> gcc/
>   PR62173
>   fwprop.c (prepare_for_gcse_pre): New function.
>   (fwprop_done): Call it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-04 11:07   ` Richard Biener
@ 2014-12-04 19:32     ` Jiong Wang
  2014-12-15 15:29       ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2014-12-04 19:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On 04/12/14 11:07, Richard Biener wrote:

> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>
>>
>> which means re-associate the constant imm with the virtual frame pointer.
>>
>> transform
>>
>>       RA <- fixed_reg + RC
>>       RD <- MEM (RA + const_offset)
>>
>>    into:
>>
>>       RA <- fixed_reg + const_offset
>>       RD <- MEM (RA + RC)
>>
>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>> what tree
>> level ivopts could not sort out.
>> There is a LIM pass after gimple ivopts - if the invariantness is already
>> visible there why not handle it there similar to the special-cases in
>> rewrite_bittest and rewrite_reciprocal?

maybe, needs further check.

>>
>> And of course similar tricks could be applied on the RTL level to
>> RTL invariant motion?

Thanks. I just checked the code, yes, loop invariant motion pass
is the natural place to integrate such multi-insns invariant analysis trick.

those code could be integrated into loop-invariant.c cleanly, but
then I found although the invariant detected successfully but it's not moved
out of loop because of cost issue, and looks like the patch committed to fix PR33928
is trying to prevent such cheap address be hoisted to reduce register pressure.

  805       /* ??? Try to determine cheapness of address computation.  Unfortunately
  806          the address cost is only a relative measure, we can't really compare
  807          it with any absolute number, but only with other address costs.
  808          But here we don't have any other addresses, so compare with a magic
  809          number anyway.  It has to be large enough to not regress PR33928
  810          (by avoiding to move reg+8,reg+16,reg+24 invariants), but small
  811          enough to not regress 410.bwaves either (by still moving reg+reg
  812          invariants).
  813          See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
  814       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
  815                                          ADDR_SPACE_GENERIC, speed) < 3;

I think that maybe necessary for x86 which is short of register, while for RISC,
it may not be that necessary, especially the whole register pressure is not big.

currently, what I could think of is for this transformation below, we should
increase the costs:

A
==
      RA <- virtual_stack_var + RC
      RD <- MEM (RA + const_offset)

   into:

B
==
      RA <- virtual_stack_var + const_offset  <--B
      RD <- MEM (RA + RC)

because the cost is not that cheap, if there is not re-assocation of virtual_stack_var
with const_offset, then lra elimination will create another instruction to hold the
elimination result, so format A will actually be

      RT <- real_stack_pointer + elimination_offset
      RA <- RT + RC
      RD <- MEM (RA + const_offset)

so, the re-assocation and later hoisting of invariant B could actually save two instructions
in the loop, this is why there are 15% perf gap for bzip2 under some situation.

On aarch64, for Seb's testcase,

before IV hoisting
===
         b       .L4
.L37:
         sub     w19, w19, #1
.L4:
	add	x1, x29, 48
	add	x0, x1, x0, sxtw
	ldrb	w0, [x0, -16]
         bl      foo
         mov     w0, w19
         cbnz    w19, .L37
         ldr     x19, [sp, 16]
         ldp     x29, x30, [sp], 48
         ret

after this transformation, and the IV hoisting
(need one extra callee saved, x20)
===
.L3:
         add     x20, x29, 32  <-- IV hoisted
         b       .L4
.L37:
         sub     w19, w19, #1
.L4:
         ldrb    w0, [x20, w0, sxtw]
         bl      foo
         mov     w0, w19
         cbnz    w19, .L37
         ldp     x19, x20, [sp, 16]
         ldp     x29, x30, [sp], 48
         ret

> Oh, and the patch misses a testcase.

It's at the start of the email, will include in the patch.

void bar(int i) {
   char A[10];
   int d = 0;
   while (i > 0)
   A[d++] = i--;

   while (d > 0)
   foo(A[d--]);
}

>
>> Thanks,
>> Richard.
>>
>>> and this patch only tries to re-shuffle instructions within single basic
>>> block which
>>> is a inner loop which is perf critical.
>>>
>>> I am reusing the loop info in fwprop because there is loop info and it's run
>>> before
>>> GCSE.
>>>
>>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>>
>>> bootstrap ok on x86-64 and aarch64.
>>>
>>> comments?
>>>
>>> thanks.
>>>
>>> gcc/
>>>    PR62173
>>>    fwprop.c (prepare_for_gcse_pre): New function.
>>>    (fwprop_done): Call it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-04 19:32     ` Jiong Wang
@ 2014-12-15 15:29       ` Jiong Wang
  2014-12-15 15:36         ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2014-12-15 15:29 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On 04/12/14 19:32, Jiong Wang wrote:
> On 04/12/14 11:07, Richard Biener wrote:
>
>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>>
>>>
>>> which means re-associate the constant imm with the virtual frame pointer.
>>>
>>> transform
>>>
>>>        RA <- fixed_reg + RC
>>>        RD <- MEM (RA + const_offset)
>>>
>>>     into:
>>>
>>>        RA <- fixed_reg + const_offset
>>>        RD <- MEM (RA + RC)
>>>
>>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>>> what tree
>>> level ivopts could not sort out.
>>> There is a LIM pass after gimple ivopts - if the invariantness is already
>>> visible there why not handle it there similar to the special-cases in
>>> rewrite_bittest and rewrite_reciprocal?
> maybe, needs further check.
>
>>> And of course similar tricks could be applied on the RTL level to
>>> RTL invariant motion?
> Thanks. I just checked the code, yes, loop invariant motion pass
> is the natural place to integrate such multi-insns invariant analysis trick.
>
> those code could be integrated into loop-invariant.c cleanly, but
> then I found although the invariant detected successfully but it's not moved
> out of loop because of cost issue, and looks like the patch committed to fix PR33928
> is trying to prevent such cheap address be hoisted to reduce register pressure.
>
>
>    805       /* ??? Try to determine cheapness of address computation.  Unfortunately
>    806          the address cost is only a relative measure, we can't really compare
>    807          it with any absolute number, but only with other address costs.
>    808          But here we don't have any other addresses, so compare with a magic
>    809          number anyway.  It has to be large enough to not regress PR33928
>    810          (by avoiding to move reg+8,reg+16,reg+24 invariants), but small
>    811          enough to not regress 410.bwaves either (by still moving reg+reg
>    812          invariants).
>    813          See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
>    814       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
>    815                                          ADDR_SPACE_GENERIC, speed) < 3;
>
>
> I think that maybe necessary for x86 which is short of register, while for RISC,
> it may not be that necessary, especially the whole register pressure is not big.
>
> currently, what I could think of is for this transformation below, we should
> increase the costs:
>    
> A
> ==
>        RA <- virtual_stack_var + RC
>        RD <- MEM (RA + const_offset)
>
>     into:
>
> B
> ==
>        RA <- virtual_stack_var + const_offset  <--B
>        RD <- MEM (RA + RC)
>
> because the cost is not that cheap, if there is not re-assocation of virtual_stack_var
> with const_offset, then lra elimination will create another instruction to hold the
> elimination result, so format A will actually be
>
>        RT <- real_stack_pointer + elimination_offset
>        RA <- RT + RC
>        RD <- MEM (RA + const_offset)
>
> so, the re-assocation and later hoisting of invariant B could actually save two instructions
> in the loop, this is why there are 15% perf gap for bzip2 under some situation.

updated patch.

moved this instruction shuffling trick to rtl loop invariant pass.
as described above, this patch tries to transform A to B, so that
after the transformation:

   * RA <- virtual_stack_var + const_offset  could be hoisted out of the loop
   * easy the work of lra elimination as virtual_stack_var is associated with const_offset
     that the elimination offset could be combined with const_offset automatically.

current rtl loop invariant pass treat "reg <- reg + off" as cheap address, while although
"reg <- virtual_stack_var + offset" fall into the same format, but it's not that cheap as
we could also save one lra elimination instruction. so this patch will mark
"reg <- virtual_stack_var + offset" transformed from A to be expensive, so that it could be
hoisted later.

after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*. because there are one
glitch in aarch64_legitimize_address which cause unnecessary complex instructions sequences
generated when legitimize some addresses. and if we fix that, we will get cheaper address for
those cases which is generally good, and the cheaper address will cause tree level IVOPT do
more IVOPT optimization which is generally good also, but from the speck2k result, there
are actually around 1% code size regression on two cases, the reason is for target support
post-index address, doing IVOPT may not always be the best choice because we lost the advantage
of using post-index addressing.

on aarch64, for the following testcase, the ivopted version is complexer than not ivopted version.

     while (oargc--) *(nargv++) = *(oargv++);

so, I sent the generic fix here only, as it's an independent patch, and could benefit other targets
like powerpc64 although the issue on aarch64 is still not resolved before we figure out how to let
the correct fix on aarch64_legitimize_address do not cause code size regression on benchmark caused
by sub-optimal tree level IVOPT.

and the testcase is marked to run on powerpc64 only at current time.

bootstrap OK on x86-64, no regression.
bootstrap OK on AArch64, and from speck2k compile dump, there do have a few more RTL loop
invariants get hoisted.

ok for trunk?

gcc/
PR62173
   loop-invariant.c.c (expensive_addr): New hash_table.
   (need_expensive_addr_check_p): New bool.
   (find_exits): Rename to "find_exists_and_reshuffle.
   Support re-shuffle instructions for better loop invariant hoisting.
   (create_new_invariant): Mark address as expensive if it's generated by re-shuffle.
   (init_inv_motion_data): Initialize expensive_addr and need_expensive_addr_check_p.
   (free_inv_motion_data): Release expensive_addr.

gcc/testssuite/
   PR62173
   gcc.dg/pr62173.c: New test.

>> Oh, and the patch misses a testcase.
> It's at the start of the email, will include in the patch.
>
> void bar(int i) {
>     char A[10];
>     int d = 0;
>     while (i > 0)
>     A[d++] = i--;
>
>     while (d > 0)
>     foo(A[d--]);
> }
>
>>> Thanks,
>>> Richard.
>>>
>>>> and this patch only tries to re-shuffle instructions within single basic
>>>> block which
>>>> is a inner loop which is perf critical.
>>>>
>>>> I am reusing the loop info in fwprop because there is loop info and it's run
>>>> before
>>>> GCSE.
>>>>
>>>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>>>
>>>> bootstrap ok on x86-64 and aarch64.
>>>>
>>>> comments?
>>>>
>>>> thanks.
>>>>
>>>> gcc/
>>>>     PR62173
>>>>     fwprop.c (prepare_for_gcse_pre): New function.
>>>>     (fwprop_done): Call it.
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-15 15:29       ` Jiong Wang
@ 2014-12-15 15:36         ` Jiong Wang
  2014-12-17 16:19           ` Richard Biener
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2014-12-15 15:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 7445 bytes --]


On 15/12/14 15:28, Jiong Wang wrote:
> On 04/12/14 19:32, Jiong Wang wrote:
>> On 04/12/14 11:07, Richard Biener wrote:
>>
>>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>>>
>>>>
>>>> which means re-associate the constant imm with the virtual frame pointer.
>>>>
>>>> transform
>>>>
>>>>         RA <- fixed_reg + RC
>>>>         RD <- MEM (RA + const_offset)
>>>>
>>>>      into:
>>>>
>>>>         RA <- fixed_reg + const_offset
>>>>         RD <- MEM (RA + RC)
>>>>
>>>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>>>> what tree
>>>> level ivopts could not sort out.
>>>> There is a LIM pass after gimple ivopts - if the invariantness is already
>>>> visible there why not handle it there similar to the special-cases in
>>>> rewrite_bittest and rewrite_reciprocal?
>> maybe, needs further check.
>>
>>>> And of course similar tricks could be applied on the RTL level to
>>>> RTL invariant motion?
>> Thanks. I just checked the code, yes, loop invariant motion pass
>> is the natural place to integrate such multi-insns invariant analysis trick.
>>
>> those code could be integrated into loop-invariant.c cleanly, but
>> then I found although the invariant detected successfully but it's not moved
>> out of loop because of cost issue, and looks like the patch committed to fix PR33928
>> is trying to prevent such cheap address be hoisted to reduce register pressure.
>>
>>
>>     805       /* ??? Try to determine cheapness of address computation.  Unfortunately
>>     806          the address cost is only a relative measure, we can't really compare
>>     807          it with any absolute number, but only with other address costs.
>>     808          But here we don't have any other addresses, so compare with a magic
>>     809          number anyway.  It has to be large enough to not regress PR33928
>>     810          (by avoiding to move reg+8,reg+16,reg+24 invariants), but small
>>     811          enough to not regress 410.bwaves either (by still moving reg+reg
>>     812          invariants).
>>     813          See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
>>     814       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
>>     815                                          ADDR_SPACE_GENERIC, speed) < 3;
>>
>>
>> I think that maybe necessary for x86 which is short of register, while for RISC,
>> it may not be that necessary, especially the whole register pressure is not big.
>>
>> currently, what I could think of is for this transformation below, we should
>> increase the costs:
>>     
>> A
>> ==
>>         RA <- virtual_stack_var + RC
>>         RD <- MEM (RA + const_offset)
>>
>>      into:
>>
>> B
>> ==
>>         RA <- virtual_stack_var + const_offset  <--B
>>         RD <- MEM (RA + RC)
>>
>> because the cost is not that cheap, if there is not re-assocation of virtual_stack_var
>> with const_offset, then lra elimination will create another instruction to hold the
>> elimination result, so format A will actually be
>>
>>         RT <- real_stack_pointer + elimination_offset
>>         RA <- RT + RC
>>         RD <- MEM (RA + const_offset)
>>
>> so, the re-assocation and later hoisting of invariant B could actually save two instructions
>> in the loop, this is why there are 15% perf gap for bzip2 under some situation.
> updated patch.
>
> moved this instruction shuffling trick to rtl loop invariant pass.
> as described above, this patch tries to transform A to B, so that
> after the transformation:
>
>     * RA <- virtual_stack_var + const_offset  could be hoisted out of the loop
>     * easy the work of lra elimination as virtual_stack_var is associated with const_offset
>       that the elimination offset could be combined with const_offset automatically.
>
> current rtl loop invariant pass treat "reg <- reg + off" as cheap address, while although
> "reg <- virtual_stack_var + offset" fall into the same format, but it's not that cheap as
> we could also save one lra elimination instruction. so this patch will mark
> "reg <- virtual_stack_var + offset" transformed from A to be expensive, so that it could be
> hoisted later.
>
> after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*. because there are one
> glitch in aarch64_legitimize_address which cause unnecessary complex instructions sequences
> generated when legitimize some addresses. and if we fix that, we will get cheaper address for
> those cases which is generally good, and the cheaper address will cause tree level IVOPT do
> more IVOPT optimization which is generally good also, but from the speck2k result, there
> are actually around 1% code size regression on two cases, the reason is for target support
> post-index address, doing IVOPT may not always be the best choice because we lost the advantage
> of using post-index addressing.
>
> on aarch64, for the following testcase, the ivopted version is complexer than not ivopted version.
>
>       while (oargc--) *(nargv++) = *(oargv++);
>    
> so, I sent the generic fix here only, as it's an independent patch, and could benefit other targets
> like powerpc64 although the issue on aarch64 is still not resolved before we figure out how to let
> the correct fix on aarch64_legitimize_address do not cause code size regression on benchmark caused
> by sub-optimal tree level IVOPT.
>
> and the testcase is marked to run on powerpc64 only at current time.
>
> bootstrap OK on x86-64, no regression.
> bootstrap OK on AArch64, and from speck2k compile dump, there do have a few more RTL loop
> invariants get hoisted.
>
> ok for trunk?
>
> gcc/
> PR62173
>     loop-invariant.c.c (expensive_addr): New hash_table.
>     (need_expensive_addr_check_p): New bool.
>     (find_exits): Rename to "find_exists_and_reshuffle.
>     Support re-shuffle instructions for better loop invariant hoisting.
>     (create_new_invariant): Mark address as expensive if it's generated by re-shuffle.
>     (init_inv_motion_data): Initialize expensive_addr and need_expensive_addr_check_p.
>     (free_inv_motion_data): Release expensive_addr.
>
> gcc/testssuite/
>     PR62173
>     gcc.dg/pr62173.c: New test.

sorry, patch uploaded.

>
>>> Oh, and the patch misses a testcase.
>> It's at the start of the email, will include in the patch.
>>
>> void bar(int i) {
>>      char A[10];
>>      int d = 0;
>>      while (i > 0)
>>      A[d++] = i--;
>>
>>      while (d > 0)
>>      foo(A[d--]);
>> }
>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> and this patch only tries to re-shuffle instructions within single basic
>>>>> block which
>>>>> is a inner loop which is perf critical.
>>>>>
>>>>> I am reusing the loop info in fwprop because there is loop info and it's run
>>>>> before
>>>>> GCSE.
>>>>>
>>>>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>>>>
>>>>> bootstrap ok on x86-64 and aarch64.
>>>>>
>>>>> comments?
>>>>>
>>>>> thanks.
>>>>>
>>>>> gcc/
>>>>>      PR62173
>>>>>      fwprop.c (prepare_for_gcse_pre): New function.
>>>>>      (fwprop_done): Call it.
>>
>
>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: rtl-iv.patch --]
[-- Type: text/x-patch; name=rtl-iv.patch, Size: 6874 bytes --]

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index 19e536f..7a72f10 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -186,6 +186,8 @@ typedef struct invariant *invariant_p;
 /* The invariants.  */
 
 static vec<invariant_p> invariants;
+static hash_table <pointer_hash <rtx_insn> > *expensive_addr;
+static bool need_expensive_addr_check_p;
 
 /* Check the size of the invariant table and realloc if necessary.  */
 
@@ -576,18 +578,37 @@ compute_always_reached (struct loop *loop, basic_block *body,
 
 /* Finds exits out of the LOOP with body BODY.  Marks blocks in that we may
    exit the loop by cfg edge to HAS_EXIT and MAY_EXIT.  In MAY_EXIT
-   additionally mark blocks that may exit due to a call.  */
+   additionally mark blocks that may exit due to a call.
+
+   This function also try to transform
+
+     RA <- fixed_reg + RC
+     RD <- MEM (RA + const_offset)
+
+  into:
+
+     RA <- fixed_reg + const_offset
+     RD <- MEM (RA + RC) <- pos0
+
+  If RA is DEAD after pos0.
+
+  After this change, the first instruction is loop invariant.  */
 
 static void
-find_exits (struct loop *loop, basic_block *body,
-	    bitmap may_exit, bitmap has_exit)
+find_exits_and_reshuffle (struct loop *loop, basic_block *body,
+			  bitmap may_exit, bitmap has_exit)
 {
   unsigned i;
   edge_iterator ei;
   edge e;
   struct loop *outermost_exit = loop, *aexit;
   bool has_call = false;
-  rtx_insn *insn;
+  rtx_insn *insn, *next_insn;
+  rtx single_set1, single_set2, old_dest;
+  rtx op0, op0_;
+  rtx old_op1, op1, op1_;
+  rtx inner;
+  rtx *mem_plus_loc;
 
   for (i = 0; i < loop->num_nodes; i++)
     {
@@ -597,12 +618,116 @@ find_exits (struct loop *loop, basic_block *body,
 	    {
 	      if (CALL_P (insn)
 		  && (RTL_LOOPING_CONST_OR_PURE_CALL_P (insn)
-		      || !RTL_CONST_OR_PURE_CALL_P (insn)))
+		      || !RTL_CONST_OR_PURE_CALL_P (insn))
+		  && ! has_call)
 		{
 		  has_call = true;
 		  bitmap_set_bit (may_exit, i);
-		  break;
 		}
+
+	      if (! NONDEBUG_INSN_P (insn))
+		continue;
+
+	      single_set1 = single_set (insn);
+
+	      if (! single_set1
+		  || GET_CODE (SET_SRC (single_set1)) != PLUS)
+		continue;
+
+	      old_dest = SET_DEST (single_set1);
+	      op0 = XEXP (SET_SRC (single_set1), 0);
+	      old_op1 = op1 = XEXP (SET_SRC (single_set1), 1);
+
+	      if (op1 == frame_pointer_rtx
+		  || op1 == stack_pointer_rtx
+		  || op1 == virtual_stack_vars_rtx)
+		{
+		  old_op1 = op0;
+		  std::swap (op0, op1);
+		}
+
+	      if (GET_CODE (op1) == SIGN_EXTEND
+		  || GET_CODE (op1) == ZERO_EXTEND)
+		op1 = XEXP (op1, 0);
+
+	      if (! (REG_P (old_dest) && REG_P (op0) && REG_P (op1)
+		     && (op0 == frame_pointer_rtx
+			 || op0 == stack_pointer_rtx
+			 || op0 == virtual_stack_vars_rtx)))
+		continue;
+
+	      if (! (next_insn = next_real_insn (insn)))
+		break;
+
+	      do
+		{
+		  if (DEBUG_INSN_P (next_insn))
+		    continue;
+
+		  single_set2 = single_set (next_insn);
+
+		  if (!single_set2
+		      || (! REG_P (SET_DEST (single_set2))
+			  && ! MEM_P (SET_DEST (single_set2))))
+		    continue;
+
+		  if (REG_P (SET_DEST (single_set2)))
+		    inner = SET_SRC (single_set2);
+		  else
+		    inner = SET_DEST (single_set2);
+
+		  if (GET_CODE (inner) == ZERO_EXTEND
+		      || GET_CODE (inner) == SIGN_EXTEND
+		      || GET_CODE (inner) == TRUNCATE)
+		    inner = XEXP (inner, 0);
+
+		  if (! MEM_P (inner)
+		      || GET_CODE (XEXP (inner, 0)) != PLUS)
+		    continue;
+
+		  mem_plus_loc = &XEXP (inner, 0);
+		  op0_ = XEXP (XEXP (inner, 0), 0);
+		  op1_ = XEXP (XEXP (inner, 0), 1);
+
+		  if (REG_P (op0_) && CONST_INT_P (op1_)
+		      && rtx_equal_p (op0_, old_dest)
+		      && GET_MODE (op0_) == GET_MODE (old_op1))
+		    {
+		      rtx new_src;
+
+		      if (find_regno_note (next_insn, REG_DEAD,
+					   REGNO (old_dest)))
+			{
+			  new_src = plus_constant (GET_MODE (op0), op0,
+						   INTVAL (op1_));
+			  validate_change (insn, &SET_SRC (single_set1),
+					   new_src, 1);
+			  new_src = gen_rtx_PLUS (GET_MODE (op0_), op0_,
+						  old_op1);
+			  validate_change (next_insn, mem_plus_loc, new_src, 1);
+			  if (apply_change_group ())
+			    {
+			      rtx_insn **slot =
+				expensive_addr->find_slot (insn, INSERT);
+
+			      if (*slot)
+				gcc_unreachable ();
+			      else
+				*slot = insn;
+
+			      need_expensive_addr_check_p = true;
+
+			      if (dump_file)
+				fprintf (dump_file,
+					 "\nRe-associate insn %d and %d for later"
+					 " RTL loop invariant hoisting.\n",
+					 INSN_UID (insn), INSN_UID (next_insn));
+			    }
+			}
+		      break;
+		    }
+		} while ((next_insn = next_real_insn (next_insn))
+			 && body[i] == BLOCK_FOR_INSN (next_insn));
 	    }
 
 	  FOR_EACH_EDGE (e, ei, body[i]->succs)
@@ -725,6 +850,9 @@ create_new_invariant (struct def *def, rtx_insn *insn, bitmap depends_on,
 	 See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
 					 ADDR_SPACE_GENERIC, speed) < 3;
+      if (need_expensive_addr_check_p
+	  && expensive_addr->find (insn))
+	inv->cheap_address = false;
     }
   else
     {
@@ -1043,7 +1171,7 @@ find_invariants (struct loop *loop)
   bitmap always_executed = BITMAP_ALLOC (NULL);
   basic_block *body = get_loop_body_in_dom_order (loop);
 
-  find_exits (loop, body, may_exit, has_exit);
+  find_exits_and_reshuffle (loop, body, may_exit, has_exit);
   compute_always_reached (loop, body, may_exit, always_reached);
   compute_always_reached (loop, body, has_exit, always_executed);
 
@@ -1628,6 +1756,9 @@ move_invariants (struct loop *loop)
 static void
 init_inv_motion_data (void)
 {
+  need_expensive_addr_check_p = false;
+  expensive_addr = new hash_table <pointer_hash <rtx_insn> > (4);
+
   actual_stamp = 1;
 
   invariants.create (100);
@@ -1663,6 +1794,8 @@ free_inv_motion_data (void)
       free (inv);
     }
   invariants.release ();
+
+  delete expensive_addr;
 }
 
 /* Move the invariants out of the LOOP.  */
diff --git a/gcc/testsuite/gcc.dg/pr62173.c b/gcc/testsuite/gcc.dg/pr62173.c
new file mode 100644
index 0000000..a5a1a86
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr62173.c
@@ -0,0 +1,19 @@
+/* { dg-do compile { target { powerpc64-*-* } } } */
+/* { dg-options "-O2 -fdump-rtl-loop2_invariant" } */
+
+void foo (char);
+
+void
+bar(int i)
+{
+  char A[10];
+  int d = 0;
+  while (i > 0)
+    A[d++] = i--;
+
+  while (d > 0)
+    foo (A[d--]);
+}
+
+/* { dg-final { scan-rtl-dump "Re-associate insn .* loop invariant hoisting" "loop2_invariant" } } */
+/* { dg-final { cleanup-rtl-dump "loop2_invariant" } } */

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-15 15:36         ` Jiong Wang
@ 2014-12-17 16:19           ` Richard Biener
  2014-12-18 17:08             ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Richard Biener @ 2014-12-17 16:19 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches

On Mon, Dec 15, 2014 at 4:29 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>
> On 15/12/14 15:28, Jiong Wang wrote:
>>
>> On 04/12/14 19:32, Jiong Wang wrote:
>>>
>>> On 04/12/14 11:07, Richard Biener wrote:
>>>
>>>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>>
>>>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>>>>
>>>>>
>>>>> which means re-associate the constant imm with the virtual frame
>>>>> pointer.
>>>>>
>>>>> transform
>>>>>
>>>>>         RA <- fixed_reg + RC
>>>>>         RD <- MEM (RA + const_offset)
>>>>>
>>>>>      into:
>>>>>
>>>>>         RA <- fixed_reg + const_offset
>>>>>         RD <- MEM (RA + RC)
>>>>>
>>>>> then RA <- fixed_reg + const_offset is actually loop invariant, so the
>>>>> later
>>>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus
>>>>> ameliorate
>>>>> what tree
>>>>> level ivopts could not sort out.
>>>>> There is a LIM pass after gimple ivopts - if the invariantness is
>>>>> already
>>>>> visible there why not handle it there similar to the special-cases in
>>>>> rewrite_bittest and rewrite_reciprocal?
>>>
>>> maybe, needs further check.
>>>
>>>>> And of course similar tricks could be applied on the RTL level to
>>>>> RTL invariant motion?
>>>
>>> Thanks. I just checked the code, yes, loop invariant motion pass
>>> is the natural place to integrate such multi-insns invariant analysis
>>> trick.
>>>
>>> those code could be integrated into loop-invariant.c cleanly, but
>>> then I found although the invariant detected successfully but it's not
>>> moved
>>> out of loop because of cost issue, and looks like the patch committed to
>>> fix PR33928
>>> is trying to prevent such cheap address be hoisted to reduce register
>>> pressure.
>>>
>>>
>>>     805       /* ??? Try to determine cheapness of address computation.
>>> Unfortunately
>>>     806          the address cost is only a relative measure, we can't
>>> really compare
>>>     807          it with any absolute number, but only with other address
>>> costs.
>>>     808          But here we don't have any other addresses, so compare
>>> with a magic
>>>     809          number anyway.  It has to be large enough to not regress
>>> PR33928
>>>     810          (by avoiding to move reg+8,reg+16,reg+24 invariants),
>>> but small
>>>     811          enough to not regress 410.bwaves either (by still moving
>>> reg+reg
>>>     812          invariants).
>>>     813          See
>>> http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
>>>     814       inv->cheap_address = address_cost (SET_SRC (set),
>>> word_mode,
>>>     815                                          ADDR_SPACE_GENERIC,
>>> speed) < 3;
>>>
>>>
>>> I think that maybe necessary for x86 which is short of register, while
>>> for RISC,
>>> it may not be that necessary, especially the whole register pressure is
>>> not big.
>>>
>>> currently, what I could think of is for this transformation below, we
>>> should
>>> increase the costs:
>>>     A
>>> ==
>>>         RA <- virtual_stack_var + RC
>>>         RD <- MEM (RA + const_offset)
>>>
>>>      into:
>>>
>>> B
>>> ==
>>>         RA <- virtual_stack_var + const_offset  <--B
>>>         RD <- MEM (RA + RC)
>>>
>>> because the cost is not that cheap, if there is not re-assocation of
>>> virtual_stack_var
>>> with const_offset, then lra elimination will create another instruction
>>> to hold the
>>> elimination result, so format A will actually be
>>>
>>>         RT <- real_stack_pointer + elimination_offset
>>>         RA <- RT + RC
>>>         RD <- MEM (RA + const_offset)
>>>
>>> so, the re-assocation and later hoisting of invariant B could actually
>>> save two instructions
>>> in the loop, this is why there are 15% perf gap for bzip2 under some
>>> situation.
>>
>> updated patch.
>>
>> moved this instruction shuffling trick to rtl loop invariant pass.
>> as described above, this patch tries to transform A to B, so that
>> after the transformation:
>>
>>     * RA <- virtual_stack_var + const_offset  could be hoisted out of the
>> loop
>>     * easy the work of lra elimination as virtual_stack_var is associated
>> with const_offset
>>       that the elimination offset could be combined with const_offset
>> automatically.
>>
>> current rtl loop invariant pass treat "reg <- reg + off" as cheap address,
>> while although
>> "reg <- virtual_stack_var + offset" fall into the same format, but it's
>> not that cheap as
>> we could also save one lra elimination instruction. so this patch will
>> mark
>> "reg <- virtual_stack_var + offset" transformed from A to be expensive, so
>> that it could be
>> hoisted later.
>>
>> after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*.
>> because there are one
>> glitch in aarch64_legitimize_address which cause unnecessary complex
>> instructions sequences
>> generated when legitimize some addresses. and if we fix that, we will get
>> cheaper address for
>> those cases which is generally good, and the cheaper address will cause
>> tree level IVOPT do
>> more IVOPT optimization which is generally good also, but from the speck2k
>> result, there
>> are actually around 1% code size regression on two cases, the reason is
>> for target support
>> post-index address, doing IVOPT may not always be the best choice because
>> we lost the advantage
>> of using post-index addressing.
>>
>> on aarch64, for the following testcase, the ivopted version is complexer
>> than not ivopted version.
>>
>>       while (oargc--) *(nargv++) = *(oargv++);
>>    so, I sent the generic fix here only, as it's an independent patch, and
>> could benefit other targets
>> like powerpc64 although the issue on aarch64 is still not resolved before
>> we figure out how to let
>> the correct fix on aarch64_legitimize_address do not cause code size
>> regression on benchmark caused
>> by sub-optimal tree level IVOPT.
>>
>> and the testcase is marked to run on powerpc64 only at current time.
>>
>> bootstrap OK on x86-64, no regression.
>> bootstrap OK on AArch64, and from speck2k compile dump, there do have a
>> few more RTL loop
>> invariants get hoisted.
>>
>> ok for trunk?
>>
>> gcc/
>> PR62173
>>     loop-invariant.c.c (expensive_addr): New hash_table.
>>     (need_expensive_addr_check_p): New bool.
>>     (find_exits): Rename to "find_exists_and_reshuffle.
>>     Support re-shuffle instructions for better loop invariant hoisting.
>>     (create_new_invariant): Mark address as expensive if it's generated by
>> re-shuffle.
>>     (init_inv_motion_data): Initialize expensive_addr and
>> need_expensive_addr_check_p.
>>     (free_inv_motion_data): Release expensive_addr.
>>
>> gcc/testssuite/
>>     PR62173
>>     gcc.dg/pr62173.c: New test.
>
>
> sorry, patch uploaded.

+             do
+               {
...
+               } while ((next_insn = next_real_insn (next_insn))
+                        && body[i] == BLOCK_FOR_INSN (next_insn));

ick.  I realize we don't have SSA form on RTL but doesn't DF provide
at least some help in looking up definition statements for pseudos?
In fact we want to restrict the transform to single-use pseudos, thus
hopefully it can at least tell us that... (maybe not and this is what
LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
computes df_chain with DF_UD_CHAIN which seems exactly what
is needed (apart from maybe getting at single-use info).

The meat of this should be factored out to a separate function I guess.

Otherwise my RTL fu is too weak to review this properly.

Thanks,
Richard.

>
>>
>>>> Oh, and the patch misses a testcase.
>>>
>>> It's at the start of the email, will include in the patch.
>>>
>>> void bar(int i) {
>>>      char A[10];
>>>      int d = 0;
>>>      while (i > 0)
>>>      A[d++] = i--;
>>>
>>>      while (d > 0)
>>>      foo(A[d--]);
>>> }
>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> and this patch only tries to re-shuffle instructions within single
>>>>>> basic
>>>>>> block which
>>>>>> is a inner loop which is perf critical.
>>>>>>
>>>>>> I am reusing the loop info in fwprop because there is loop info and
>>>>>> it's run
>>>>>> before
>>>>>> GCSE.
>>>>>>
>>>>>> verified on aarch64 and mips64, the array base address hoisted out of
>>>>>> loop.
>>>>>>
>>>>>> bootstrap ok on x86-64 and aarch64.
>>>>>>
>>>>>> comments?
>>>>>>
>>>>>> thanks.
>>>>>>
>>>>>> gcc/
>>>>>>      PR62173
>>>>>>      fwprop.c (prepare_for_gcse_pre): New function.
>>>>>>      (fwprop_done): Call it.
>>>
>>>
>>
>>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-17 16:19           ` Richard Biener
@ 2014-12-18 17:08             ` Jiong Wang
  2014-12-18 21:16               ` Jiong Wang
  2014-12-18 22:19               ` Segher Boessenkool
  0 siblings, 2 replies; 37+ messages in thread
From: Jiong Wang @ 2014-12-18 17:08 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches


On 17/12/14 15:54, Richard Biener wrote:
> On Mon, Dec 15, 2014 at 4:29 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>> On 15/12/14 15:28, Jiong Wang wrote:
>>> On 04/12/14 19:32, Jiong Wang wrote:
>>>> On 04/12/14 11:07, Richard Biener wrote:
>>>>
>>>>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>>>>>
>>>>>>
>>>>>> which means re-associate the constant imm with the virtual frame
>>>>>> pointer.
>>>>>>
>>>>>> transform
>>>>>>
>>>>>>          RA <- fixed_reg + RC
>>>>>>          RD <- MEM (RA + const_offset)
>>>>>>
>>>>>>       into:
>>>>>>
>>>>>>          RA <- fixed_reg + const_offset
>>>>>>          RD <- MEM (RA + RC)
>>>>>>
>>>>>> then RA <- fixed_reg + const_offset is actually loop invariant, so the
>>>>>> later
>>>>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus
>>>>>> ameliorate
>>>>>> what tree
>>>>>> level ivopts could not sort out.
>>>>>> There is a LIM pass after gimple ivopts - if the invariantness is
>>>>>> already
>>>>>> visible there why not handle it there similar to the special-cases in
>>>>>> rewrite_bittest and rewrite_reciprocal?
>>>> maybe, needs further check.
>>>>
>>>>>> And of course similar tricks could be applied on the RTL level to
>>>>>> RTL invariant motion?
>>>> Thanks. I just checked the code, yes, loop invariant motion pass
>>>> is the natural place to integrate such multi-insns invariant analysis
>>>> trick.
>>>>
>>>> those code could be integrated into loop-invariant.c cleanly, but
>>>> then I found although the invariant detected successfully but it's not
>>>> moved
>>>> out of loop because of cost issue, and looks like the patch committed to
>>>> fix PR33928
>>>> is trying to prevent such cheap address be hoisted to reduce register
>>>> pressure.
>>>>
>>>>
>>>>      805       /* ??? Try to determine cheapness of address computation.
>>>> Unfortunately
>>>>      806          the address cost is only a relative measure, we can't
>>>> really compare
>>>>      807          it with any absolute number, but only with other address
>>>> costs.
>>>>      808          But here we don't have any other addresses, so compare
>>>> with a magic
>>>>      809          number anyway.  It has to be large enough to not regress
>>>> PR33928
>>>>      810          (by avoiding to move reg+8,reg+16,reg+24 invariants),
>>>> but small
>>>>      811          enough to not regress 410.bwaves either (by still moving
>>>> reg+reg
>>>>      812          invariants).
>>>>      813          See
>>>> http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
>>>>      814       inv->cheap_address = address_cost (SET_SRC (set),
>>>> word_mode,
>>>>      815                                          ADDR_SPACE_GENERIC,
>>>> speed) < 3;
>>>>
>>>>
>>>> I think that maybe necessary for x86 which is short of register, while
>>>> for RISC,
>>>> it may not be that necessary, especially the whole register pressure is
>>>> not big.
>>>>
>>>> currently, what I could think of is for this transformation below, we
>>>> should
>>>> increase the costs:
>>>>      A
>>>> ==
>>>>          RA <- virtual_stack_var + RC
>>>>          RD <- MEM (RA + const_offset)
>>>>
>>>>       into:
>>>>
>>>> B
>>>> ==
>>>>          RA <- virtual_stack_var + const_offset  <--B
>>>>          RD <- MEM (RA + RC)
>>>>
>>>> because the cost is not that cheap, if there is not re-assocation of
>>>> virtual_stack_var
>>>> with const_offset, then lra elimination will create another instruction
>>>> to hold the
>>>> elimination result, so format A will actually be
>>>>
>>>>          RT <- real_stack_pointer + elimination_offset
>>>>          RA <- RT + RC
>>>>          RD <- MEM (RA + const_offset)
>>>>
>>>> so, the re-assocation and later hoisting of invariant B could actually
>>>> save two instructions
>>>> in the loop, this is why there are 15% perf gap for bzip2 under some
>>>> situation.
>>> updated patch.
>>>
>>> moved this instruction shuffling trick to rtl loop invariant pass.
>>> as described above, this patch tries to transform A to B, so that
>>> after the transformation:
>>>
>>>      * RA <- virtual_stack_var + const_offset  could be hoisted out of the
>>> loop
>>>      * easy the work of lra elimination as virtual_stack_var is associated
>>> with const_offset
>>>        that the elimination offset could be combined with const_offset
>>> automatically.
>>>
>>> current rtl loop invariant pass treat "reg <- reg + off" as cheap address,
>>> while although
>>> "reg <- virtual_stack_var + offset" fall into the same format, but it's
>>> not that cheap as
>>> we could also save one lra elimination instruction. so this patch will
>>> mark
>>> "reg <- virtual_stack_var + offset" transformed from A to be expensive, so
>>> that it could be
>>> hoisted later.
>>>
>>> after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*.
>>> because there are one
>>> glitch in aarch64_legitimize_address which cause unnecessary complex
>>> instructions sequences
>>> generated when legitimize some addresses. and if we fix that, we will get
>>> cheaper address for
>>> those cases which is generally good, and the cheaper address will cause
>>> tree level IVOPT do
>>> more IVOPT optimization which is generally good also, but from the speck2k
>>> result, there
>>> are actually around 1% code size regression on two cases, the reason is
>>> for target support
>>> post-index address, doing IVOPT may not always be the best choice because
>>> we lost the advantage
>>> of using post-index addressing.
>>>
>>> on aarch64, for the following testcase, the ivopted version is complexer
>>> than not ivopted version.
>>>
>>>        while (oargc--) *(nargv++) = *(oargv++);
>>>     so, I sent the generic fix here only, as it's an independent patch, and
>>> could benefit other targets
>>> like powerpc64 although the issue on aarch64 is still not resolved before
>>> we figure out how to let
>>> the correct fix on aarch64_legitimize_address do not cause code size
>>> regression on benchmark caused
>>> by sub-optimal tree level IVOPT.
>>>
>>> and the testcase is marked to run on powerpc64 only at current time.
>>>
>>> bootstrap OK on x86-64, no regression.
>>> bootstrap OK on AArch64, and from speck2k compile dump, there do have a
>>> few more RTL loop
>>> invariants get hoisted.
>>>
>>> ok for trunk?
>>>
>>> gcc/
>>> PR62173
>>>      loop-invariant.c.c (expensive_addr): New hash_table.
>>>      (need_expensive_addr_check_p): New bool.
>>>      (find_exits): Rename to "find_exists_and_reshuffle.
>>>      Support re-shuffle instructions for better loop invariant hoisting.
>>>      (create_new_invariant): Mark address as expensive if it's generated by
>>> re-shuffle.
>>>      (init_inv_motion_data): Initialize expensive_addr and
>>> need_expensive_addr_check_p.
>>>      (free_inv_motion_data): Release expensive_addr.
>>>
>>> gcc/testssuite/
>>>      PR62173
>>>      gcc.dg/pr62173.c: New test.
>>
>> sorry, patch uploaded.
> +             do
> +               {
> ...
> +               } while ((next_insn = next_real_insn (next_insn))
> +                        && body[i] == BLOCK_FOR_INSN (next_insn));
>
> ick.  I realize we don't have SSA form on RTL but doesn't DF provide
> at least some help in looking up definition statements for pseudos?
> In fact we want to restrict the transform to single-use pseudos, thus
> hopefully it can at least tell us that... (maybe not and this is what
> LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
> computes df_chain with DF_UD_CHAIN which seems exactly what
> is needed (apart from maybe getting at single-use info).

thanks very much for these inspiring questions.

yes, we want to restrict the transformation on single-use pseudo only,
and it's better the transformation could re-use existed info and helper
function to avoid increase compile time. but I haven't found anything I
can reuse at the stage the transformation happen.

the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
the code about build LOG_LINKS, and try to see if we can do some factor out.

thanks.

Regards,
Jiong

>
> The meat of this should be factored out to a separate function I guess.
>
> Otherwise my RTL fu is too weak to review this properly.
>
> Thanks,
> Richard.
>
>>>>> Oh, and the patch misses a testcase.
>>>> It's at the start of the email, will include in the patch.
>>>>
>>>> void bar(int i) {
>>>>       char A[10];
>>>>       int d = 0;
>>>>       while (i > 0)
>>>>       A[d++] = i--;
>>>>
>>>>       while (d > 0)
>>>>       foo(A[d--]);
>>>> }
>>>>
>>>>>> Thanks,
>>>>>> Richard.
>>>>>>
>>>>>>> and this patch only tries to re-shuffle instructions within single
>>>>>>> basic
>>>>>>> block which
>>>>>>> is a inner loop which is perf critical.
>>>>>>>
>>>>>>> I am reusing the loop info in fwprop because there is loop info and
>>>>>>> it's run
>>>>>>> before
>>>>>>> GCSE.
>>>>>>>
>>>>>>> verified on aarch64 and mips64, the array base address hoisted out of
>>>>>>> loop.
>>>>>>>
>>>>>>> bootstrap ok on x86-64 and aarch64.
>>>>>>>
>>>>>>> comments?
>>>>>>>
>>>>>>> thanks.
>>>>>>>
>>>>>>> gcc/
>>>>>>>       PR62173
>>>>>>>       fwprop.c (prepare_for_gcse_pre): New function.
>>>>>>>       (fwprop_done): Call it.
>>>>
>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-18 17:08             ` Jiong Wang
@ 2014-12-18 21:16               ` Jiong Wang
  2014-12-18 22:19               ` Segher Boessenkool
  1 sibling, 0 replies; 37+ messages in thread
From: Jiong Wang @ 2014-12-18 21:16 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Richard Biener, gcc-patches

2014-12-18 17:00 GMT+00:00 Jiong Wang <jiong.wang@arm.com>:
>>>> ok for trunk?
>>>>
>>>> gcc/
>>>> PR62173
>>>>      loop-invariant.c.c (expensive_addr): New hash_table.
>>>>      (need_expensive_addr_check_p): New bool.
>>>>      (find_exits): Rename to "find_exists_and_reshuffle.
>>>>      Support re-shuffle instructions for better loop invariant hoisting.

another question is, is it safe to re-use REG_DEAD info here without
calling df_note_add_problem and df_analysis first?
am I using those info passed down from the previous pass which
calculated these info and maybe broken?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-18 17:08             ` Jiong Wang
  2014-12-18 21:16               ` Jiong Wang
@ 2014-12-18 22:19               ` Segher Boessenkool
  2014-12-19  4:06                 ` Bin.Cheng
  1 sibling, 1 reply; 37+ messages in thread
From: Segher Boessenkool @ 2014-12-18 22:19 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Richard Biener, gcc-patches

On Thu, Dec 18, 2014 at 05:00:01PM +0000, Jiong Wang wrote:
> On 17/12/14 15:54, Richard Biener wrote:
> >ick.  I realize we don't have SSA form on RTL but doesn't DF provide
> >at least some help in looking up definition statements for pseudos?
> >In fact we want to restrict the transform to single-use pseudos, thus
> >hopefully it can at least tell us that... (maybe not and this is what
> >LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
> >computes df_chain with DF_UD_CHAIN which seems exactly what
> >is needed (apart from maybe getting at single-use info).
> 
> thanks very much for these inspiring questions.
> 
> yes, we want to restrict the transformation on single-use pseudo only,
> and it's better the transformation could re-use existed info and helper
> function to avoid increase compile time. but I haven't found anything I
> can reuse at the stage the transformation happen.
> 
> the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
> the code about build LOG_LINKS, and try to see if we can do some factor out.

LOG_LINKs in combine are just historical.  combine should be converted
to use DF fully.

LOG_LINKs have nothing to do with single use; they point from the _first_
use to its corresponding def.

You might want to look at what fwprop does instead.


Segher

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-18 22:19               ` Segher Boessenkool
@ 2014-12-19  4:06                 ` Bin.Cheng
  2014-12-19 10:29                   ` Jiong Wang
  2014-12-19 15:21                   ` Segher Boessenkool
  0 siblings, 2 replies; 37+ messages in thread
From: Bin.Cheng @ 2014-12-19  4:06 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Jiong Wang, Richard Biener, gcc-patches

On Fri, Dec 19, 2014 at 6:09 AM, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
> On Thu, Dec 18, 2014 at 05:00:01PM +0000, Jiong Wang wrote:
>> On 17/12/14 15:54, Richard Biener wrote:
>> >ick.  I realize we don't have SSA form on RTL but doesn't DF provide
>> >at least some help in looking up definition statements for pseudos?
>> >In fact we want to restrict the transform to single-use pseudos, thus
>> >hopefully it can at least tell us that... (maybe not and this is what
>> >LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
>> >computes df_chain with DF_UD_CHAIN which seems exactly what
>> >is needed (apart from maybe getting at single-use info).
>>
>> thanks very much for these inspiring questions.
>>
>> yes, we want to restrict the transformation on single-use pseudo only,
>> and it's better the transformation could re-use existed info and helper
>> function to avoid increase compile time. but I haven't found anything I
>> can reuse at the stage the transformation happen.
>>
>> the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
>> the code about build LOG_LINKS, and try to see if we can do some factor out.
>
> LOG_LINKs in combine are just historical.  combine should be converted
> to use DF fully.
>
> LOG_LINKs have nothing to do with single use; they point from the _first_
> use to its corresponding def.
>
> You might want to look at what fwprop does instead.
Pass rtl fwprop uses df information in single-definition way, it
doesn't really take into consideration if register is a single use.
This often corrupts other optimizations like post-increment and
load/store pair.  For example:

  add r2, r1, r0
  ldr rx, [r2]
  add r2, r2, #4
is transformed into below form:
  add r2, r1, r0
  ldr rx, [r1, r0]
  add r2, r2, #4

As a result, post-increment opportunity is corrupted, also definition
of r2 can't be deleted because it's not single use.

Thanks,
bin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-19  4:06                 ` Bin.Cheng
@ 2014-12-19 10:29                   ` Jiong Wang
  2014-12-19 11:45                     ` Richard Biener
  2014-12-19 12:09                     ` Eric Botcazou
  2014-12-19 15:21                   ` Segher Boessenkool
  1 sibling, 2 replies; 37+ messages in thread
From: Jiong Wang @ 2014-12-19 10:29 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Segher Boessenkool, Jiong Wang, Richard Biener, gcc-patches

2014-12-19 3:51 GMT+00:00 Bin.Cheng <amker.cheng@gmail.com>:
> On Fri, Dec 19, 2014 at 6:09 AM, Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
>> On Thu, Dec 18, 2014 at 05:00:01PM +0000, Jiong Wang wrote:
>>> On 17/12/14 15:54, Richard Biener wrote:
>>> >ick.  I realize we don't have SSA form on RTL but doesn't DF provide
>>> >at least some help in looking up definition statements for pseudos?
>>> >In fact we want to restrict the transform to single-use pseudos, thus
>>> >hopefully it can at least tell us that... (maybe not and this is what
>>> >LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
>>> >computes df_chain with DF_UD_CHAIN which seems exactly what
>>> >is needed (apart from maybe getting at single-use info).
>>>
>>> thanks very much for these inspiring questions.
>>>
>>> yes, we want to restrict the transformation on single-use pseudo only,
>>> and it's better the transformation could re-use existed info and helper
>>> function to avoid increase compile time. but I haven't found anything I
>>> can reuse at the stage the transformation happen.
>>>
>>> the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
>>> the code about build LOG_LINKS, and try to see if we can do some factor out.
>>
>> LOG_LINKs in combine are just historical.  combine should be converted
>> to use DF fully.
>>
>> LOG_LINKs have nothing to do with single use; they point from the _first_
>> use to its corresponding def.
>>
>> You might want to look at what fwprop does instead.
> Pass rtl fwprop uses df information in single-definition way, it
> doesn't really take into consideration if register is a single use.
> This often corrupts other optimizations like post-increment and
> load/store pair.  For example:
>
>   add r2, r1, r0
>   ldr rx, [r2]
>   add r2, r2, #4
> is transformed into below form:
>   add r2, r1, r0
>   ldr rx, [r1, r0]
>   add r2, r2, #4
>
> As a result, post-increment opportunity is corrupted, also definition
> of r2 can't be deleted because it's not single use.
>
> Thanks,
> bin

thanks for all these suggestion.

Have look at the LOG_LINK build function, a simple reverse scan, while
needs to allocate big map array for all pseudo regs. Haven't looked at
similar code in fwprop,

actually, when found the first matched insn pattern, I just want to
scan several insns next, then abort quickly if nothing meet
requirement. there is no need to build full single-use information.

still can anyone confirm that it is safe to re-use REG_DEAD info there
without calling df_note_add_problem and df_analysis first? or I am
using those info passed down from the previous pass which calculated
these info and maybe broken?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-19 10:29                   ` Jiong Wang
@ 2014-12-19 11:45                     ` Richard Biener
  2014-12-19 15:31                       ` Kenneth Zadeck
  2014-12-19 12:09                     ` Eric Botcazou
  1 sibling, 1 reply; 37+ messages in thread
From: Richard Biener @ 2014-12-19 11:45 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Bin.Cheng, Segher Boessenkool, Jiong Wang, gcc-patches, Kenneth Zadeck

On Fri, Dec 19, 2014 at 11:28 AM, Jiong Wang
<wong.kwongyuan.tools@gmail.com> wrote:
> 2014-12-19 3:51 GMT+00:00 Bin.Cheng <amker.cheng@gmail.com>:
>> On Fri, Dec 19, 2014 at 6:09 AM, Segher Boessenkool
>> <segher@kernel.crashing.org> wrote:
>>> On Thu, Dec 18, 2014 at 05:00:01PM +0000, Jiong Wang wrote:
>>>> On 17/12/14 15:54, Richard Biener wrote:
>>>> >ick.  I realize we don't have SSA form on RTL but doesn't DF provide
>>>> >at least some help in looking up definition statements for pseudos?
>>>> >In fact we want to restrict the transform to single-use pseudos, thus
>>>> >hopefully it can at least tell us that... (maybe not and this is what
>>>> >LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
>>>> >computes df_chain with DF_UD_CHAIN which seems exactly what
>>>> >is needed (apart from maybe getting at single-use info).
>>>>
>>>> thanks very much for these inspiring questions.
>>>>
>>>> yes, we want to restrict the transformation on single-use pseudo only,
>>>> and it's better the transformation could re-use existed info and helper
>>>> function to avoid increase compile time. but I haven't found anything I
>>>> can reuse at the stage the transformation happen.
>>>>
>>>> the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
>>>> the code about build LOG_LINKS, and try to see if we can do some factor out.
>>>
>>> LOG_LINKs in combine are just historical.  combine should be converted
>>> to use DF fully.

As noted loop-invariant already computes DF use-def chains.  The question
is whether it does compute single-use info when doing that (should be trivially
possible).  Kenny?

>>> LOG_LINKs have nothing to do with single use; they point from the _first_
>>> use to its corresponding def.
>>>
>>> You might want to look at what fwprop does instead.
>> Pass rtl fwprop uses df information in single-definition way, it
>> doesn't really take into consideration if register is a single use.
>> This often corrupts other optimizations like post-increment and
>> load/store pair.  For example:
>>
>>   add r2, r1, r0
>>   ldr rx, [r2]
>>   add r2, r2, #4
>> is transformed into below form:
>>   add r2, r1, r0
>>   ldr rx, [r1, r0]
>>   add r2, r2, #4
>>
>> As a result, post-increment opportunity is corrupted, also definition
>> of r2 can't be deleted because it's not single use.
>>
>> Thanks,
>> bin
>
> thanks for all these suggestion.
>
> Have look at the LOG_LINK build function, a simple reverse scan, while
> needs to allocate big map array for all pseudo regs. Haven't looked at
> similar code in fwprop,
>
> actually, when found the first matched insn pattern, I just want to
> scan several insns next, then abort quickly if nothing meet
> requirement. there is no need to build full single-use information.
>
> still can anyone confirm that it is safe to re-use REG_DEAD info there
> without calling df_note_add_problem and df_analysis first? or I am
> using those info passed down from the previous pass which calculated
> these info and maybe broken?

It's not safe to use REG_DEAD info without re-computing it.

Richard.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-19 10:29                   ` Jiong Wang
  2014-12-19 11:45                     ` Richard Biener
@ 2014-12-19 12:09                     ` Eric Botcazou
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Botcazou @ 2014-12-19 12:09 UTC (permalink / raw)
  To: Jiong Wang
  Cc: gcc-patches, Bin.Cheng, Segher Boessenkool, Jiong Wang, Richard Biener

> still can anyone confirm that it is safe to re-use REG_DEAD info there
> without calling df_note_add_problem and df_analysis first? or I am
> using those info passed down from the previous pass which calculated
> these info and maybe broken?

It is generally _not_ safe to consume REG_UNUSED and REG_DEAD notes without 
recomputing them first.  fwprop.c recomputes them on entry through the call to 
build_single_def_use_links (not clear why since it doesn't seem to use them) 
but it can presumably invalidate them before calling fwprop_done.

I think it would be better to avoid using them in fwprop.c if possible.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-19  4:06                 ` Bin.Cheng
  2014-12-19 10:29                   ` Jiong Wang
@ 2014-12-19 15:21                   ` Segher Boessenkool
  1 sibling, 0 replies; 37+ messages in thread
From: Segher Boessenkool @ 2014-12-19 15:21 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Jiong Wang, Richard Biener, gcc-patches

On Fri, Dec 19, 2014 at 11:51:06AM +0800, Bin.Cheng wrote:
> >> yes, we want to restrict the transformation on single-use pseudo only,
> >> and it's better the transformation could re-use existed info and helper
> >> function to avoid increase compile time. but I haven't found anything I
> >> can reuse at the stage the transformation happen.
> >>
> >> the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
> >> the code about build LOG_LINKS, and try to see if we can do some factor out.
> >
> > LOG_LINKs in combine are just historical.  combine should be converted
> > to use DF fully.
> >
> > LOG_LINKs have nothing to do with single use; they point from the _first_
> > use to its corresponding def.
> >
> > You might want to look at what fwprop does instead.
> Pass rtl fwprop uses df information in single-definition way, it
> doesn't really take into consideration if register is a single use.

Sure, because that's not what fwprop is meant to do.  It has all the
information though, and combine (in the LOG_LINKs) does not: it does
not encode single uses, it does not encode all uses, it does not
encode all defs, it doesn't even look outside of a single basic block.


Segher

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-19 11:45                     ` Richard Biener
@ 2014-12-19 15:31                       ` Kenneth Zadeck
  2015-02-11 11:20                         ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Kenneth Zadeck @ 2014-12-19 15:31 UTC (permalink / raw)
  To: Richard Biener, Jiong Wang
  Cc: Bin.Cheng, Segher Boessenkool, Jiong Wang, gcc-patches

On 12/19/2014 06:26 AM, Richard Biener wrote:
> On Fri, Dec 19, 2014 at 11:28 AM, Jiong Wang
> <wong.kwongyuan.tools@gmail.com> wrote:
>> 2014-12-19 3:51 GMT+00:00 Bin.Cheng <amker.cheng@gmail.com>:
>>> On Fri, Dec 19, 2014 at 6:09 AM, Segher Boessenkool
>>> <segher@kernel.crashing.org> wrote:
>>>> On Thu, Dec 18, 2014 at 05:00:01PM +0000, Jiong Wang wrote:
>>>>> On 17/12/14 15:54, Richard Biener wrote:
>>>>>> ick.  I realize we don't have SSA form on RTL but doesn't DF provide
>>>>>> at least some help in looking up definition statements for pseudos?
>>>>>> In fact we want to restrict the transform to single-use pseudos, thus
>>>>>> hopefully it can at least tell us that... (maybe not and this is what
>>>>>> LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
>>>>>> computes df_chain with DF_UD_CHAIN which seems exactly what
>>>>>> is needed (apart from maybe getting at single-use info).
>>>>> thanks very much for these inspiring questions.
>>>>>
>>>>> yes, we want to restrict the transformation on single-use pseudo only,
>>>>> and it's better the transformation could re-use existed info and helper
>>>>> function to avoid increase compile time. but I haven't found anything I
>>>>> can reuse at the stage the transformation happen.
>>>>>
>>>>> the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
>>>>> the code about build LOG_LINKS, and try to see if we can do some factor out.
>>>> LOG_LINKs in combine are just historical.  combine should be converted
>>>> to use DF fully.
> As noted loop-invariant already computes DF use-def chains.  The question
> is whether it does compute single-use info when doing that (should be trivially
> possible).  Kenny?
In the US, (where i live), there is something called the "statute of 
limitations" which says that after 7 years you cannot be found 
accountable for anything except the most heinous of crimes.   It has 
been more than 7 years since i did this.

however, since i am a nice person ....

loop-invariant solves the DF_UD_CHAIN which builds a data structure that 
connects each use with all of the defs that reach it.   I believe that 
this is the opposite of what you want here.

if you really need this, you need to also turn on the DF_DU_CHAIN which 
builds the opposite structure.    Both structures can be space hogs, but 
they are both turned on in other places in the compiler so it is 
unlikely to be an issue.


>
>>>> LOG_LINKs have nothing to do with single use; they point from the _first_
>>>> use to its corresponding def.
>>>>
>>>> You might want to look at what fwprop does instead.
>>> Pass rtl fwprop uses df information in single-definition way, it
>>> doesn't really take into consideration if register is a single use.
>>> This often corrupts other optimizations like post-increment and
>>> load/store pair.  For example:
>>>
>>>    add r2, r1, r0
>>>    ldr rx, [r2]
>>>    add r2, r2, #4
>>> is transformed into below form:
>>>    add r2, r1, r0
>>>    ldr rx, [r1, r0]
>>>    add r2, r2, #4
>>>
>>> As a result, post-increment opportunity is corrupted, also definition
>>> of r2 can't be deleted because it's not single use.
>>>
>>> Thanks,
>>> bin
>> thanks for all these suggestion.
>>
>> Have look at the LOG_LINK build function, a simple reverse scan, while
>> needs to allocate big map array for all pseudo regs. Haven't looked at
>> similar code in fwprop,
>>
>> actually, when found the first matched insn pattern, I just want to
>> scan several insns next, then abort quickly if nothing meet
>> requirement. there is no need to build full single-use information.
>>
>> still can anyone confirm that it is safe to re-use REG_DEAD info there
>> without calling df_note_add_problem and df_analysis first? or I am
>> using those info passed down from the previous pass which calculated
>> these info and maybe broken?
> It's not safe to use REG_DEAD info without re-computing it.
not sure that reg_dead is the right answer even if you did re-compute 
it.   I believe you can have two parallel uses (on both sides of an 
if-then-else) for a single def (above the if then else) and have two 
REG_DEAD notes.

> Richard.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2014-12-19 15:31                       ` Kenneth Zadeck
@ 2015-02-11 11:20                         ` Jiong Wang
  2015-02-11 14:22                           ` Kenneth Zadeck
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-02-11 11:20 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: Richard Biener, Bin.Cheng, Segher Boessenkool, Jiong Wang, gcc-patches

2014-12-19 15:21 GMT+00:00 Kenneth Zadeck <zadeck@naturalbridge.com>:
>
> however, since i am a nice person ....
>
> loop-invariant solves the DF_UD_CHAIN which builds a data structure that
> connects each use with all of the defs that reach it.   I believe that this
> is the opposite of what you want here.
>
> if you really need this, you need to also turn on the DF_DU_CHAIN which
> builds the opposite structure.    Both structures can be space hogs, but
> they are both turned on in other places in the compiler so it is unlikely to
> be an issue.

Exactly, Thanks, Kenneth.

This approach works from my experiment and look much better than
previous REG_NOTE approach.
while it do have one problem. We need the UD/DU chain built before we
do insn re-shuffling.
While after re-shuffling, UD chain needs update, otherwise, the later
"check_dependecies" in find_invariant_insn may fail.

although we have re-shuffle instruction 1 into 2, the later check
still using old UD info, thus
decide instruction 2 is not iv.

1: regA <- vfp + regB
2: regA <- vfp + const

my current fix is to insert those re-shuffled insn into a table named
"vfp_const_iv", then skip those
dependencies check  for them as they don't have any dependencies.

>
>
>
>>
>>>>> LOG_LINKs have nothing to do with single use; they point from the
>>>>> _first_
>>>>> use to its corresponding def.
>>>>>
>>>>> You might want to look at what fwprop does instead.
>>>>
>>>> Pass rtl fwprop uses df information in single-definition way, it
>>>> doesn't really take into consideration if register is a single use.
>>>> This often corrupts other optimizations like post-increment and
>>>> load/store pair.  For example:
>>>>
>>>>    add r2, r1, r0
>>>>    ldr rx, [r2]
>>>>    add r2, r2, #4
>>>> is transformed into below form:
>>>>    add r2, r1, r0
>>>>    ldr rx, [r1, r0]
>>>>    add r2, r2, #4
>>>>
>>>> As a result, post-increment opportunity is corrupted, also definition
>>>> of r2 can't be deleted because it's not single use.
>>>>
>>>> Thanks,
>>>> bin
>>>
>>> thanks for all these suggestion.
>>>
>>> Have look at the LOG_LINK build function, a simple reverse scan, while
>>> needs to allocate big map array for all pseudo regs. Haven't looked at
>>> similar code in fwprop,
>>>
>>> actually, when found the first matched insn pattern, I just want to
>>> scan several insns next, then abort quickly if nothing meet
>>> requirement. there is no need to build full single-use information.
>>>
>>> still can anyone confirm that it is safe to re-use REG_DEAD info there
>>> without calling df_note_add_problem and df_analysis first? or I am
>>> using those info passed down from the previous pass which calculated
>>> these info and maybe broken?
>>
>> It's not safe to use REG_DEAD info without re-computing it.
>
> not sure that reg_dead is the right answer even if you did re-compute it.
> I believe you can have two parallel uses (on both sides of an if-then-else)
> for a single def (above the if then else) and have two REG_DEAD notes.
>
>> Richard.
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-02-11 11:20                         ` Jiong Wang
@ 2015-02-11 14:22                           ` Kenneth Zadeck
  2015-02-11 18:18                             ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Kenneth Zadeck @ 2015-02-11 14:22 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Richard Biener, Bin.Cheng, Segher Boessenkool, Jiong Wang, gcc-patches

On 02/11/2015 06:20 AM, Jiong Wang wrote:
> 2014-12-19 15:21 GMT+00:00 Kenneth Zadeck <zadeck@naturalbridge.com>:
>> however, since i am a nice person ....
>>
>> loop-invariant solves the DF_UD_CHAIN which builds a data structure that
>> connects each use with all of the defs that reach it.   I believe that this
>> is the opposite of what you want here.
>>
>> if you really need this, you need to also turn on the DF_DU_CHAIN which
>> builds the opposite structure.    Both structures can be space hogs, but
>> they are both turned on in other places in the compiler so it is unlikely to
>> be an issue.
> Exactly, Thanks, Kenneth.
>
> This approach works from my experiment and look much better than
> previous REG_NOTE approach.
> while it do have one problem. We need the UD/DU chain built before we
> do insn re-shuffling.
> While after re-shuffling, UD chain needs update, otherwise, the later
> "check_dependecies" in find_invariant_insn may fail.
>
> although we have re-shuffle instruction 1 into 2, the later check
> still using old UD info, thus
> decide instruction 2 is not iv.
>
> 1: regA <- vfp + regB
> 2: regA <- vfp + const
>
> my current fix is to insert those re-shuffled insn into a table named
> "vfp_const_iv", then skip those
> dependencies check  for them as they don't have any dependencies.
You now are beginning to understand why Mark Wegman and I invented SSA 
form almost 30 years ago!!!!  There is no good way to keep the 
information up to data as you are changing the program.    To a large 
extent you are on your own.   If what you are suggesting works, then 
this is likely much faster than rebuilding the chains.

>>
>>
>>>>>> LOG_LINKs have nothing to do with single use; they point from the
>>>>>> _first_
>>>>>> use to its corresponding def.
>>>>>>
>>>>>> You might want to look at what fwprop does instead.
>>>>> Pass rtl fwprop uses df information in single-definition way, it
>>>>> doesn't really take into consideration if register is a single use.
>>>>> This often corrupts other optimizations like post-increment and
>>>>> load/store pair.  For example:
>>>>>
>>>>>     add r2, r1, r0
>>>>>     ldr rx, [r2]
>>>>>     add r2, r2, #4
>>>>> is transformed into below form:
>>>>>     add r2, r1, r0
>>>>>     ldr rx, [r1, r0]
>>>>>     add r2, r2, #4
>>>>>
>>>>> As a result, post-increment opportunity is corrupted, also definition
>>>>> of r2 can't be deleted because it's not single use.
>>>>>
>>>>> Thanks,
>>>>> bin
>>>> thanks for all these suggestion.
>>>>
>>>> Have look at the LOG_LINK build function, a simple reverse scan, while
>>>> needs to allocate big map array for all pseudo regs. Haven't looked at
>>>> similar code in fwprop,
>>>>
>>>> actually, when found the first matched insn pattern, I just want to
>>>> scan several insns next, then abort quickly if nothing meet
>>>> requirement. there is no need to build full single-use information.
>>>>
>>>> still can anyone confirm that it is safe to re-use REG_DEAD info there
>>>> without calling df_note_add_problem and df_analysis first? or I am
>>>> using those info passed down from the previous pass which calculated
>>>> these info and maybe broken?
>>> It's not safe to use REG_DEAD info without re-computing it.
>> not sure that reg_dead is the right answer even if you did re-compute it.
>> I believe you can have two parallel uses (on both sides of an if-then-else)
>> for a single def (above the if then else) and have two REG_DEAD notes.
>>
>>> Richard.
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-02-11 14:22                           ` Kenneth Zadeck
@ 2015-02-11 18:18                             ` Jiong Wang
  2015-04-14 15:06                               ` Jiong Wang
  2015-04-19 16:20                               ` Jiong Wang
  0 siblings, 2 replies; 37+ messages in thread
From: Jiong Wang @ 2015-02-11 18:18 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: Jeff Law, Richard Biener, Bin.Cheng, Segher Boessenkool, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3075 bytes --]

On 11/02/15 14:21, Kenneth Zadeck wrote:
> On 02/11/2015 06:20 AM, Jiong Wang wrote:
>> 2014-12-19 15:21 GMT+00:00 Kenneth Zadeck <zadeck@naturalbridge.com>:
>>> however, since i am a nice person ....
>>>
>>> loop-invariant solves the DF_UD_CHAIN which builds a data structure that
>>> connects each use with all of the defs that reach it.   I believe that this
>>> is the opposite of what you want here.
>>>
>>> if you really need this, you need to also turn on the DF_DU_CHAIN which
>>> builds the opposite structure.    Both structures can be space hogs, but
>>> they are both turned on in other places in the compiler so it is unlikely to
>>> be an issue.
>> Exactly, Thanks, Kenneth.
>>
>> This approach works from my experiment and look much better than
>> previous REG_NOTE approach.
>> while it do have one problem. We need the UD/DU chain built before we
>> do insn re-shuffling.
>> While after re-shuffling, UD chain needs update, otherwise, the later
>> "check_dependecies" in find_invariant_insn may fail.
>>
>> although we have re-shuffle instruction 1 into 2, the later check
>> still using old UD info, thus
>> decide instruction 2 is not iv.
>>
>> 1: regA <- vfp + regB
>> 2: regA <- vfp + const
>>
>> my current fix is to insert those re-shuffled insn into a table named
>> "vfp_const_iv", then skip those
>> dependencies check  for them as they don't have any dependencies.
> You now are beginning to understand why Mark Wegman and I invented SSA
> form almost 30 years ago!!!!

Indeed, thanks.

attachment is the new draft patch, pass x86-84 bootstrap
(actually it will not affect x86-64, because of addressing mode),
  
while failed on aarch64 bootstrap during stage2/3 binary comparision,
there must be some glitches needs to be fixed.

Will clean up later after I finished my upcoming long flight and what I
am seeking now is whether the general thoughts, code logic & framework, in this
patch is acceptable to the community?

personally, I think this version is much better than previous version.
new added code integrated with existed code in a more natural way. no use
of unsafe REG_NOTE info which is not updated in this pass.

and from AArch64 gcc bootstrap, 239 new loop invariant found by this
re-shuffling. so, this small improvement on rtl loop invariant pass do have
some value.

please review and give comments.

Thanks.

2015-02-11  Jiong Wang  <jiong.wang@arm.com>

gcc/
   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
   (vfp_const_iv): New hash table.
   (expensive_addr_check_p): New boolean.
   (init_inv_motion_data): Initialize new variables.
   (free_inv_motion_data): Release hash table.
   (create_new_invariant): Set cheap_address to false for iv in
   vfp_const_iv table.
   (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv table.
   (use_for_single_du): New function.
   (reshuffle_insn_with_vfp): Likewise.
   (find_invariants_bb): Call reshuffle_insn_with_vfp.
  
gcc/testsuite/
   * gcc.dg/pr62173.c: New testcase.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: new-k.patch --]
[-- Type: text/x-patch; name=new-k.patch, Size: 6684 bytes --]

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index f79b497..b1c4760 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -203,6 +203,8 @@ typedef struct invariant *invariant_p;
 /* The invariants.  */
 
 static vec<invariant_p> invariants;
+static hash_table <pointer_hash <rtx_insn> > *vfp_const_iv;
+static bool need_expensive_addr_check_p;
 
 /* Check the size of the invariant table and realloc if necessary.  */
 
@@ -695,7 +697,7 @@ find_defs (struct loop *loop)
 
   df_remove_problem (df_chain);
   df_process_deferred_rescans ();
-  df_chain_add_problem (DF_UD_CHAIN);
+  df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN);
   df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
   df_analyze_loop (loop);
   check_invariant_table_size ();
@@ -742,6 +744,9 @@ create_new_invariant (struct def *def, rtx_insn *insn, bitmap depends_on,
 	 See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
 					 ADDR_SPACE_GENERIC, speed) < 3;
+
+      if (need_expensive_addr_check_p && vfp_const_iv->find (insn))
+	inv->cheap_address = false;
     }
   else
     {
@@ -952,7 +957,8 @@ find_invariant_insn (rtx_insn *insn, bool always_reached, bool always_executed)
     return;
 
   depends_on = BITMAP_ALLOC (NULL);
-  if (!check_dependencies (insn, depends_on))
+  if (!vfp_const_iv->find (insn)
+      && !check_dependencies (insn, depends_on))
     {
       BITMAP_FREE (depends_on);
       return;
@@ -1007,6 +1013,143 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
   record_uses (insn);
 }

+/*  Find the single use of def of insn. At the same time,
+    make sure def of insn is the single def which reach the use.  */
+    
+static rtx_insn *
+use_for_single_du (rtx_insn *insn)
+{
+  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  df_ref def;
+
+  FOR_EACH_INSN_INFO_DEF (def, insn_info)
+    {
+      struct df_link *chain = DF_REF_CHAIN (def);
+
+      if (!chain
+	  || chain->next)
+	return NULL;
+
+      df_ref use = chain->ref;
+      chain = DF_REF_CHAIN (use);
+
+      if (!chain
+	  || chain->next)
+	return NULL;
+
+      return DF_REF_INSN (use);
+    }
+
+  return NULL;
+}
+
+/* This function also try to transform
+
+     RA <- fixed_reg + RC
+     RD <- MEM (RA + const_offset)
+
+   into:
+
+     RA <- fixed_reg + const_offset
+     RD <- MEM (RA + RC) <- pos0
+
+  If use of RA in the second insn is the single use, and the define of
+  RA in the first insn is the single def reach the second insn.
+
+  After this change, the first instruction is loop invariant.  */
+
+static void
+reshuffle_insn_with_vfp (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+
+  if (!set
+      || GET_CODE (SET_SRC (set)) != PLUS)
+    return;
+
+  rtx dest = SET_DEST (set);
+  rtx op0 = XEXP (SET_SRC (set), 0);
+  rtx op1 = XEXP (SET_SRC (set), 1);
+  rtx non_vfp_op = op1;
+
+  if (op1 == frame_pointer_rtx
+      || op1 == stack_pointer_rtx
+      || op1 == virtual_stack_vars_rtx)
+    {
+      non_vfp_op = op0;
+      std::swap (op0, op1);
+    }
+
+  if (GET_CODE (op1) == SIGN_EXTEND
+      || GET_CODE (op1) == ZERO_EXTEND)
+    op1 = XEXP (op1, 0);
+
+  if (!(REG_P (dest) && REG_P (op0) && REG_P (op1)
+	&& (op0 == frame_pointer_rtx
+	    || op0 == stack_pointer_rtx
+	    || op0 == virtual_stack_vars_rtx)))
+    return;
+
+  rtx_insn *use_insn;
+
+  if (!(use_insn = use_for_single_du (insn)))
+    return;
+
+  rtx u_set = single_set (use_insn);
+
+  if (!(u_set && (REG_P (SET_DEST (u_set)) || MEM_P (SET_DEST (u_set)))))
+    return;
+
+  rtx mem_addr;
+
+  if (REG_P (SET_DEST (u_set)))
+    mem_addr = SET_SRC (u_set);
+  else
+    mem_addr = SET_DEST (u_set);
+
+  if (GET_CODE (mem_addr) == ZERO_EXTEND
+      || GET_CODE (mem_addr) == SIGN_EXTEND
+      || GET_CODE (mem_addr) == TRUNCATE)
+    mem_addr = XEXP (mem_addr, 0);
+
+  if (!MEM_P (mem_addr)
+      || GET_CODE (XEXP (mem_addr, 0)) != PLUS)
+    return;
+
+  rtx *mem_plus_loc = &XEXP (mem_addr, 0);
+  rtx u_op0 = XEXP (XEXP (mem_addr, 0), 0);
+  rtx u_op1 = XEXP (XEXP (mem_addr, 0), 1);
+
+  if (!(REG_P (u_op0) && CONST_INT_P (u_op1)
+	&& REGNO (dest) == REGNO (u_op0)))
+    return;
+
+  rtx new_src = plus_constant (GET_MODE (op0), op0, INTVAL (u_op1));
+  validate_change (insn, &SET_SRC (set), new_src, 1);
+  new_src = gen_rtx_PLUS (GET_MODE (u_op0), u_op0, non_vfp_op);
+  validate_change (use_insn, mem_plus_loc, new_src, 1);
+
+  if (apply_change_group ())
+    {
+      rtx_insn **slot = vfp_const_iv->find_slot (insn, INSERT);
+
+      if (*slot)
+	gcc_unreachable ();
+      else
+	*slot = insn;
+
+      need_expensive_addr_check_p = true;
+
+      if (dump_file)
+	fprintf (dump_file,
+		 "\nRe-associate insn %d and %d for later"
+		 " RTL loop invariant hoisting.\n",
+		 INSN_UID (insn), INSN_UID (use_insn));
+    }
+
+  return;
+}
+
 /* Finds invariants in basic block BB.  ALWAYS_REACHED is true if the
    basic block is always executed.  ALWAYS_EXECUTED is true if the basic
    block is always executed, unless the program ends due to a function
@@ -1022,6 +1147,8 @@ find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
       if (!NONDEBUG_INSN_P (insn))
 	continue;
 
+      reshuffle_insn_with_vfp (insn);
+
       find_invariants_insn (insn, always_reached, always_executed);
 
       if (always_reached
@@ -1647,6 +1774,9 @@ move_invariants (struct loop *loop)
 static void
 init_inv_motion_data (void)
 {
+  need_expensive_addr_check_p = false;
+  vfp_const_iv = new hash_table <pointer_hash <rtx_insn> > (4);
+
   actual_stamp = 1;
 
   invariants.create (100);
@@ -1682,6 +1812,8 @@ free_inv_motion_data (void)
       free (inv);
     }
   invariants.release ();
+
+  delete vfp_const_iv;
 }
 
 /* Move the invariants out of the LOOP.  */
diff --git a/gcc/testsuite/gcc.dg/pr62173.c b/gcc/testsuite/gcc.dg/pr62173.c
new file mode 100644
index 0000000..f059dee
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr62173.c
@@ -0,0 +1,19 @@
+/* { dg-do compile { target { powerpc64-*-* } } } */
+/* { dg-options "-O2 -fdump-rtl-loop2_invariant" } */
+
+void foo (char);
+
+void
+bar(int i)
+{
+  char A[10];
+  int d = 0;
+  while (i > 0)
+    A[d++] = i--;
+
+  while (d > 0)
+    foo (A[d--]);
+}
+
+/* { dg-final { scan-rtl-dump "Re-associate insn .* loop invariant hoisting" "loop2_invariant" } } */
+/* { dg-final { cleanup-rtl-dump "loop2_invariant" } } */

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-02-11 18:18                             ` Jiong Wang
@ 2015-04-14 15:06                               ` Jiong Wang
  2015-04-14 16:49                                 ` Steven Bosscher
  2015-04-19 16:20                               ` Jiong Wang
  1 sibling, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-04-14 15:06 UTC (permalink / raw)
  To: gcc-patches
  Cc: Kenneth Zadeck, Jeff Law, Richard Biener, Bin.Cheng, Segher Boessenkool

2015-02-11 18:18 GMT+00:00 Jiong Wang <jiong.wang@arm.com>:
> On 11/02/15 14:21, Kenneth Zadeck wrote:
>>
>> On 02/11/2015 06:20 AM, Jiong Wang wrote:
>>>
>>> 2014-12-19 15:21 GMT+00:00 Kenneth Zadeck <zadeck@naturalbridge.com>:
>>>>
>>>> however, since i am a nice person ....
>>>>
>>>> loop-invariant solves the DF_UD_CHAIN which builds a data structure that
>>>> connects each use with all of the defs that reach it.   I believe that
>>>> this
>>>> is the opposite of what you want here.
>>>>
>>>> if you really need this, you need to also turn on the DF_DU_CHAIN which
>>>> builds the opposite structure.    Both structures can be space hogs, but
>>>> they are both turned on in other places in the compiler so it is
>>>> unlikely to
>>>> be an issue.
>>>
>>> Exactly, Thanks, Kenneth.
>>>
>>> This approach works from my experiment and look much better than
>>> previous REG_NOTE approach.
>>> while it do have one problem. We need the UD/DU chain built before we
>>> do insn re-shuffling.
>>> While after re-shuffling, UD chain needs update, otherwise, the later
>>> "check_dependecies" in find_invariant_insn may fail.
>>>
>>> although we have re-shuffle instruction 1 into 2, the later check
>>> still using old UD info, thus
>>> decide instruction 2 is not iv.
>>>
>>> 1: regA <- vfp + regB
>>> 2: regA <- vfp + const
>>>
>>> my current fix is to insert those re-shuffled insn into a table named
>>> "vfp_const_iv", then skip those
>>> dependencies check  for them as they don't have any dependencies.
>>
>> You now are beginning to understand why Mark Wegman and I invented SSA
>> form almost 30 years ago!!!!
>
>
> Indeed, thanks.
>
> attachment is the new draft patch, pass x86-84 bootstrap
> (actually it will not affect x86-64, because of addressing mode),
>  while failed on aarch64 bootstrap during stage2/3 binary comparision,
> there must be some glitches needs to be fixed.
>
> Will clean up later after I finished my upcoming long flight and what I
> am seeking now is whether the general thoughts, code logic & framework, in
> this
> patch is acceptable to the community?
>
> personally, I think this version is much better than previous version.
> new added code integrated with existed code in a more natural way. no use
> of unsafe REG_NOTE info which is not updated in this pass.
>
> and from AArch64 gcc bootstrap, 239 new loop invariant found by this
> re-shuffling. so, this small improvement on rtl loop invariant pass do have
> some value.
>
> please review and give comments.
>
> Thanks.

Ping ~

And for the bootstrap stage2/3 binary comparision failure,  the
different is in ira-costs.o.

because for stage2, we actually add one extra compilation option
"-gtoggle" while not for stage3, as
we want to make sure debug info generation will not affect any real
instruction generation.

but, after some investigation I found gcc actually generate difference
code when debug info enabled because
DEBUG_INSN will affect data flow analysis.

(insn 2556 2555 2776 257 (set (reg/f:DI 1473)
        (plus:DI (reg/f:DI 64 sfp)
            (reg:DI 1474))) ../../gcc/gcc/ira-costs.c:628 87 {*adddi3_aarch64}
     (expr_list:REG_DEAD (reg:DI 1474)
        (nil)))
(debug_insn 2776 2556 2557 257 (var_location:SI D#105 (mem:SI (plus:DI
(reg/f:DI 1473)
            (const_int -240 [0xffffffffffffff10])) [23 classes S4 A32])) -1
     (nil))
(insn 2557 2776 2558 257 (set (reg:SI 591 [ D.69930 ])
        (mem:SI (plus:DI (reg/f:DI 1473)
                (const_int -240 [0xffffffffffffff10])) [23 classes S4
A32])) ../../gcc/gcc/ira-costs.c:628 39 {*movsi_aarch64}
     (expr_list:REG_DEAD (reg/f:DI 1473)
        (nil)))

without "debug_insn 2776", operands of insn 2556 and 2557 can be
shuffled, so that insn 2556 becomes "reg 1473 = sfp  - 240",
thus hoisted out of the loop as invariant.  while with debug_insn
2776,   reg 1473 are used in both insn 2776 and insn 2557, thus
the single def-use analysis returns false, thus we won't do any
transformation on this which is correct.

So I think this stage2/3 binary difference is acceptable?

for compile time, I haven't found difference from -ftime-report. and
ree.c and enabled "DF_UD_CHAIN + DF_DU_CHAIN" also.

>
> 2015-02-11  Jiong Wang  <jiong.wang@arm.com>
>
> gcc/
>   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>   (vfp_const_iv): New hash table.
>   (expensive_addr_check_p): New boolean.
>   (init_inv_motion_data): Initialize new variables.
>   (free_inv_motion_data): Release hash table.
>   (create_new_invariant): Set cheap_address to false for iv in
>   vfp_const_iv table.
>   (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
> table.
>   (use_for_single_du): New function.
>   (reshuffle_insn_with_vfp): Likewise.
>   (find_invariants_bb): Call reshuffle_insn_with_vfp.
>  gcc/testsuite/
>   * gcc.dg/pr62173.c: New testcase.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-14 15:06                               ` Jiong Wang
@ 2015-04-14 16:49                                 ` Steven Bosscher
  2015-04-14 17:24                                   ` Jeff Law
  0 siblings, 1 reply; 37+ messages in thread
From: Steven Bosscher @ 2015-04-14 16:49 UTC (permalink / raw)
  To: Jiong Wang
  Cc: gcc-patches, Kenneth Zadeck, Jeff Law, Richard Biener, Bin.Cheng,
	Segher Boessenkool

On Tue, Apr 14, 2015 at 5:06 PM, Jiong Wang wrote:
> but, after some investigation I found gcc actually generate difference
> code when debug info enabled because
> DEBUG_INSN will affect data flow analysis.

It should not, so that's a bug.


> So I think this stage2/3 binary difference is acceptable?

No, they should be identical. If there's a difference, then there's a
bug - which, it seems, you've already found, too.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-14 16:49                                 ` Steven Bosscher
@ 2015-04-14 17:24                                   ` Jeff Law
  2015-04-14 21:49                                     ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff Law @ 2015-04-14 17:24 UTC (permalink / raw)
  To: Steven Bosscher, Jiong Wang
  Cc: gcc-patches, Kenneth Zadeck, Richard Biener, Bin.Cheng,
	Segher Boessenkool

On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>> So I think this stage2/3 binary difference is acceptable?
>
> No, they should be identical. If there's a difference, then there's a
> bug - which, it seems, you've already found, too.
RIght.  And so the natural question is how to fix.

At first glance it would seem like having this new code ignore 
dependencies rising from debug insns would work.

Which then begs the question, what happens to the debug insn -- it's 
certainly not going to be correct anymore if the transformation is made.

jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-14 17:24                                   ` Jeff Law
@ 2015-04-14 21:49                                     ` Jiong Wang
  2015-04-21 14:43                                       ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-04-14 21:49 UTC (permalink / raw)
  To: Jeff Law
  Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck, Richard Biener,
	Bin.Cheng, Segher Boessenkool

2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>:
> On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>>>
>>> So I think this stage2/3 binary difference is acceptable?
>>
>>
>> No, they should be identical. If there's a difference, then there's a
>> bug - which, it seems, you've already found, too.
>
> RIght.  And so the natural question is how to fix.
>
> At first glance it would seem like having this new code ignore dependencies
> rising from debug insns would work.
>
> Which then begs the question, what happens to the debug insn -- it's
> certainly not going to be correct anymore if the transformation is made.

Exactly.

The debug_insn 2776 in my example is to record the base address of a
local array. the new code is doing correctly here by not shuffling the
operands of insn 2556 and 2557 as there is additional reference of
reg:1473 from debug insn, although the code will still execute correctly
if we do the transformation.

my understanding to fix this:

  * delete the out-of-date mismatch debug_insn? as there is no guarantee
    to generate accurate debug info under -O2.

    IMO, this debug_insn may affect "DW_AT_location" field for variable
    descrption of "classes" in .debug_info section, but it's omitted in
    the final output already.

    <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable)
    <38a4f>   DW_AT_name : (indirect string, offset: 0x18db): classes
    <38a53>   DW_AT_decl_file   : 1
    <38a54>   DW_AT_decl_line   : 548
    <38a56>   DW_AT_type        : <0x38cb4>

  * update the debug_insn? if the following change is OK with dwarf standard

   from

     insn0: reg0 = fp + reg1
     debug_insn: var_loc = reg0 + const_off
     insn1: reg2 = reg0 + const_off

   to

     insn0: reg0 = fp + const_off
     debug_insn: var_loc = reg0 + reg1
     insn1: reg2 = reg0 + reg1

Thanks,

Regards,
Jiong

>
> jeff
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-02-11 18:18                             ` Jiong Wang
  2015-04-14 15:06                               ` Jiong Wang
@ 2015-04-19 16:20                               ` Jiong Wang
  1 sibling, 0 replies; 37+ messages in thread
From: Jiong Wang @ 2015-04-19 16:20 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Kenneth Zadeck, Jeff Law, Richard Biener, Bin.Cheng,
	Segher Boessenkool, gcc-patches

2015-02-11 18:18 GMT+00:00 Jiong Wang <jiong.wang@arm.com>:
>
> 2015-02-11  Jiong Wang  <jiong.wang@arm.com>
>
> gcc/
>   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>   (vfp_const_iv): New hash table.
>   (expensive_addr_check_p): New boolean.
>   (init_inv_motion_data): Initialize new variables.
>   (free_inv_motion_data): Release hash table.
>   (create_new_invariant): Set cheap_address to false for iv in
>   vfp_const_iv table.
>   (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
> table.
>   (use_for_single_du): New function.
>   (reshuffle_insn_with_vfp): Likewise.
>   (find_invariants_bb): Call reshuffle_insn_with_vfp.

For performance measurement:

on AArch64:
  * this patch finds several hundreds new loop invariants across speck2k6.
  * one bench in spec2k6 float get +4.5% performance improvement.

I guess similar improvements could be achieved on other RISC backend.

Regards,
Jiong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-14 21:49                                     ` Jiong Wang
@ 2015-04-21 14:43                                       ` Jiong Wang
  2015-04-24  1:55                                         ` Jeff Law
  2015-04-28 12:16                                         ` Jiong Wang
  0 siblings, 2 replies; 37+ messages in thread
From: Jiong Wang @ 2015-04-21 14:43 UTC (permalink / raw)
  To: Jeff Law; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck

[-- Attachment #1: Type: text/plain, Size: 3634 bytes --]


Jiong Wang writes:

> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>:
>> On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>>>>
>>>> So I think this stage2/3 binary difference is acceptable?
>>>
>>>
>>> No, they should be identical. If there's a difference, then there's a
>>> bug - which, it seems, you've already found, too.
>>
>> RIght.  And so the natural question is how to fix.
>>
>> At first glance it would seem like having this new code ignore dependencies
>> rising from debug insns would work.
>>
>> Which then begs the question, what happens to the debug insn -- it's
>> certainly not going to be correct anymore if the transformation is made.
>
> Exactly.
>
> The debug_insn 2776 in my example is to record the base address of a
> local array. the new code is doing correctly here by not shuffling the
> operands of insn 2556 and 2557 as there is additional reference of
> reg:1473 from debug insn, although the code will still execute correctly
> if we do the transformation.
>
> my understanding to fix this:
>
>   * delete the out-of-date mismatch debug_insn? as there is no guarantee
>     to generate accurate debug info under -O2.
>
>     IMO, this debug_insn may affect "DW_AT_location" field for variable
>     descrption of "classes" in .debug_info section, but it's omitted in
>     the final output already.
>
>     <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable)
>     <38a4f>   DW_AT_name : (indirect string, offset: 0x18db): classes
>     <38a53>   DW_AT_decl_file   : 1
>     <38a54>   DW_AT_decl_line   : 548
>     <38a56>   DW_AT_type        : <0x38cb4>
>
>   * update the debug_insn? if the following change is OK with dwarf standard
>
>    from
>
>      insn0: reg0 = fp + reg1
>      debug_insn: var_loc = reg0 + const_off
>      insn1: reg2 = reg0 + const_off
>
>    to
>
>      insn0: reg0 = fp + const_off
>      debug_insn: var_loc = reg0 + reg1
>      insn1: reg2 = reg0 + reg1
>
> Thanks,
>

And attachment is the new patch which will update debug_insn as
described in the second solution above.

Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK.

On AArch64, this patch give 600+ new rtl loop invariants found across
spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new
invariants found in the critical function "regex_compile".

The similar improvements may be achieved on other RISC backends like
powerpc/mips I guess.

One thing to mention, for AArch64, one minor glitch in
aarch64_legitimize_address needs to be fixed to let this patch take
effect, I will send out that patch later as it's a seperate issue.
Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so
should be OK, and I verified the base address of local array in the
testcase given by Seb on pr62173 do hoisted on ppc64 now. I think
pr62173 is fixed on those 64bit arch by this patch.

Thoughts?

Thanks.

2015-04-21  Jiong Wang  <jiong.wang@arm.com>

gcc/
  * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
  (vfp_const_iv): New hash table.
  (expensive_addr_check_p): New boolean.
  (init_inv_motion_data): Initialize new variables.>
  (free_inv_motion_data): Release hash table.
  (create_new_invariant): Set cheap_address to false for iv in
  vfp_const_iv table.
  (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
  table.
  (use_for_single_du): New function.
  (reshuffle_insn_with_vfp): Likewise.
  (find_invariants_bb): Call reshuffle_insn_with_vfp.

gcc/testsuite/
   * gcc.dg/pr62173.c: New testcase.

-- 
Regards,
Jiong


[-- Attachment #2: fix-dwarf.patch --]
[-- Type: text/x-diff, Size: 7390 bytes --]

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index f79b497..f70dfb0 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -203,6 +203,8 @@ typedef struct invariant *invariant_p;
 /* The invariants.  */
 
 static vec<invariant_p> invariants;
+static hash_table <pointer_hash <rtx_insn> > *vfp_const_iv;
+static bool need_expensive_addr_check_p;
 
 /* Check the size of the invariant table and realloc if necessary.  */
 
@@ -695,7 +697,7 @@ find_defs (struct loop *loop)
 
   df_remove_problem (df_chain);
   df_process_deferred_rescans ();
-  df_chain_add_problem (DF_UD_CHAIN);
+  df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN);
   df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
   df_analyze_loop (loop);
   check_invariant_table_size ();
@@ -742,6 +744,9 @@ create_new_invariant (struct def *def, rtx_insn *insn, bitmap depends_on,
 	 See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
 					 ADDR_SPACE_GENERIC, speed) < 3;
+
+      if (need_expensive_addr_check_p && vfp_const_iv->find (insn))
+	inv->cheap_address = false;
     }
   else
     {
@@ -952,7 +957,8 @@ find_invariant_insn (rtx_insn *insn, bool always_reached, bool always_executed)
     return;
 
   depends_on = BITMAP_ALLOC (NULL);
-  if (!check_dependencies (insn, depends_on))
+  if (!vfp_const_iv->find (insn)
+      && !check_dependencies (insn, depends_on))
     {
       BITMAP_FREE (depends_on);
       return;
@@ -1007,6 +1013,180 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed)
   record_uses (insn);
 }
 
+/*  Find the single use of def of insn. At the same time,
+    make sure def of insn is the single def which reach the use.
+    NOTE: debug_insn can affect DF, var_location debug insn may reference
+    the same rtl expr as variable location, such debug_insn should not
+    affect the insn shuffling while we do need to update the loc of the
+    debug_insn after insn shuffling. So here we will record such debug_insn.  */
+
+static rtx_insn *
+use_for_single_du (rtx_insn *insn, rtx_insn **debug_insn, rtx *debug_expr_loc)
+{
+  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  df_ref def;
+
+  FOR_EACH_INSN_INFO_DEF (def, insn_info)
+    {
+      struct df_link *chain = DF_REF_CHAIN (def);
+
+      if (!chain)
+	return NULL;
+
+      /* For multi use, only allow the second use be debug_insn.  */
+      if (chain->next
+	  && (chain->next->next
+	      || NONDEBUG_INSN_P (DF_REF_INSN (chain->next->ref))))
+	return NULL;
+
+      if (chain->next)
+	{
+	  rtx_insn *insn = DF_REF_INSN (chain->next->ref);
+	  rtx debug_pat = PATTERN (insn);
+	  if (GET_CODE (debug_pat) != VAR_LOCATION)
+	    return NULL;
+	  else
+	    {
+	      *debug_insn = insn;
+	      *debug_expr_loc = PAT_VAR_LOCATION_LOC (debug_pat);
+	    }
+	}
+
+      df_ref use = chain->ref;
+      chain = DF_REF_CHAIN (use);
+
+      if (!chain
+	  || chain->next)
+	return NULL;
+
+      return DF_REF_INSN (use);
+    }
+
+  return NULL;
+}
+
+/* This function also try to transform
+
+     RA <- fixed_reg + RC
+     RD <- MEM (RA + const_offset)
+
+   into:
+
+     RA <- fixed_reg + const_offset
+     RD <- MEM (RA + RC) <- pos0
+
+  If use of RA in the second insn is the single use, and the define of
+  RA in the first insn is the single def reach the second insn.
+
+  After this change, the first instruction is loop invariant.  */
+
+static void
+reshuffle_insn_with_vfp (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+
+  if (!set
+      || GET_CODE (SET_SRC (set)) != PLUS)
+    return;
+
+  rtx dest = SET_DEST (set);
+  rtx op0 = XEXP (SET_SRC (set), 0);
+  rtx op1 = XEXP (SET_SRC (set), 1);
+  rtx non_vfp_op = op1;
+
+  if (op1 == frame_pointer_rtx
+      || op1 == stack_pointer_rtx
+      || op1 == virtual_stack_vars_rtx)
+    {
+      non_vfp_op = op0;
+      std::swap (op0, op1);
+    }
+
+  if (GET_CODE (op1) == SIGN_EXTEND
+      || GET_CODE (op1) == ZERO_EXTEND)
+    op1 = XEXP (op1, 0);
+
+  if (!(REG_P (dest) && REG_P (op0) && REG_P (op1)
+	&& (op0 == frame_pointer_rtx
+	    || op0 == stack_pointer_rtx
+	    || op0 == virtual_stack_vars_rtx)))
+    return;
+
+  rtx_insn *use_insn, *debug_insn = NULL;
+  rtx debug_expr_loc;
+
+  if (!(use_insn = use_for_single_du (insn, &debug_insn, &debug_expr_loc)))
+    return;
+
+  rtx u_set = single_set (use_insn);
+
+  if (!(u_set && (REG_P (SET_DEST (u_set)) || MEM_P (SET_DEST (u_set)))))
+    return;
+
+  rtx mem_addr;
+
+  if (REG_P (SET_DEST (u_set)))
+    mem_addr = SET_SRC (u_set);
+  else
+    mem_addr = SET_DEST (u_set);
+
+  if (debug_insn && !rtx_equal_p (debug_expr_loc, mem_addr))
+    return;
+
+  if (GET_CODE (mem_addr) == ZERO_EXTEND
+      || GET_CODE (mem_addr) == SIGN_EXTEND
+      || GET_CODE (mem_addr) == TRUNCATE)
+    mem_addr = XEXP (mem_addr, 0);
+
+  if (!MEM_P (mem_addr)
+      || GET_CODE (XEXP (mem_addr, 0)) != PLUS)
+    return;
+
+  rtx *mem_plus_loc = &XEXP (mem_addr, 0);
+  rtx u_op0 = XEXP (XEXP (mem_addr, 0), 0);
+  rtx u_op1 = XEXP (XEXP (mem_addr, 0), 1);
+
+  if (!(REG_P (u_op0) && CONST_INT_P (u_op1)
+	&& REGNO (dest) == REGNO (u_op0)))
+    return;
+
+  rtx new_src = plus_constant (GET_MODE (op0), op0, INTVAL (u_op1));
+  validate_change (insn, &SET_SRC (set), new_src, 1);
+  new_src = gen_rtx_PLUS (GET_MODE (u_op0), u_op0, non_vfp_op);
+  validate_change (use_insn, mem_plus_loc, new_src, 1);
+  if (debug_insn)
+    validate_change (debug_insn, &XEXP(debug_expr_loc, 0), copy_rtx (new_src),
+		     1);
+
+  if (apply_change_group ())
+    {
+      rtx_insn **slot = vfp_const_iv->find_slot (insn, INSERT);
+
+      if (*slot)
+	gcc_unreachable ();
+      else
+	*slot = insn;
+
+      need_expensive_addr_check_p = true;
+
+      /* We should update REG_DEAD info also.  */
+      rtx note;
+      if ((note = find_reg_note (insn, REG_DEAD, non_vfp_op)))
+	{
+	  remove_note (insn, note);
+	  add_shallow_copy_of_reg_note (use_insn, note);
+	}
+
+      if (dump_file)
+	fprintf (dump_file,
+		 "\nRe-associate insn %d and %d for later"
+		 " RTL loop invariant hoisting.\n",
+		 INSN_UID (insn), INSN_UID (use_insn));
+    }
+
+  return;
+}
+
 /* Finds invariants in basic block BB.  ALWAYS_REACHED is true if the
    basic block is always executed.  ALWAYS_EXECUTED is true if the basic
    block is always executed, unless the program ends due to a function
@@ -1022,6 +1197,8 @@ find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
       if (!NONDEBUG_INSN_P (insn))
 	continue;
 
+      reshuffle_insn_with_vfp (insn);
+
       find_invariants_insn (insn, always_reached, always_executed);
 
       if (always_reached
@@ -1647,6 +1824,9 @@ move_invariants (struct loop *loop)
 static void
 init_inv_motion_data (void)
 {
+  need_expensive_addr_check_p = false;
+  vfp_const_iv = new hash_table <pointer_hash <rtx_insn> > (4);
+
   actual_stamp = 1;
 
   invariants.create (100);
@@ -1682,6 +1862,8 @@ free_inv_motion_data (void)
       free (inv);
     }
   invariants.release ();
+
+  delete vfp_const_iv;
 }
 
 /* Move the invariants out of the LOOP.  */

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-21 14:43                                       ` Jiong Wang
@ 2015-04-24  1:55                                         ` Jeff Law
  2015-04-24 17:05                                           ` Jiong Wang
  2015-04-28 12:16                                         ` Jiong Wang
  1 sibling, 1 reply; 37+ messages in thread
From: Jeff Law @ 2015-04-24  1:55 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck

On 04/21/2015 08:24 AM, Jiong Wang wrote:
>
> Jiong Wang writes:
>
>> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>:
>>> On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>>>>>
>>>>> So I think this stage2/3 binary difference is acceptable?
>>>>
>>>>
>>>> No, they should be identical. If there's a difference, then there's a
>>>> bug - which, it seems, you've already found, too.
>>>
>>> RIght.  And so the natural question is how to fix.
>>>
>>> At first glance it would seem like having this new code ignore dependencies
>>> rising from debug insns would work.
>>>
>>> Which then begs the question, what happens to the debug insn -- it's
>>> certainly not going to be correct anymore if the transformation is made.
>>
>> Exactly.
>>
>> The debug_insn 2776 in my example is to record the base address of a
>> local array. the new code is doing correctly here by not shuffling the
>> operands of insn 2556 and 2557 as there is additional reference of
>> reg:1473 from debug insn, although the code will still execute correctly
>> if we do the transformation.
>>
>> my understanding to fix this:
>>
>>    * delete the out-of-date mismatch debug_insn? as there is no guarantee
>>      to generate accurate debug info under -O2.
>>
>>      IMO, this debug_insn may affect "DW_AT_location" field for variable
>>      descrption of "classes" in .debug_info section, but it's omitted in
>>      the final output already.
>>
>>      <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable)
>>      <38a4f>   DW_AT_name : (indirect string, offset: 0x18db): classes
>>      <38a53>   DW_AT_decl_file   : 1
>>      <38a54>   DW_AT_decl_line   : 548
>>      <38a56>   DW_AT_type        : <0x38cb4>
>>
>>    * update the debug_insn? if the following change is OK with dwarf standard
>>
>>     from
>>
>>       insn0: reg0 = fp + reg1
>>       debug_insn: var_loc = reg0 + const_off
>>       insn1: reg2 = reg0 + const_off
>>
>>     to
>>
>>       insn0: reg0 = fp + const_off
>>       debug_insn: var_loc = reg0 + reg1
>>       insn1: reg2 = reg0 + reg1
>>
>> Thanks,
>>
>
> And attachment is the new patch which will update debug_insn as
> described in the second solution above.
>
> Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK.
>
> On AArch64, this patch give 600+ new rtl loop invariants found across
> spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new
> invariants found in the critical function "regex_compile".
>
> The similar improvements may be achieved on other RISC backends like
> powerpc/mips I guess.
>
> One thing to mention, for AArch64, one minor glitch in
> aarch64_legitimize_address needs to be fixed to let this patch take
> effect, I will send out that patch later as it's a seperate issue.
> Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so
> should be OK, and I verified the base address of local array in the
> testcase given by Seb on pr62173 do hoisted on ppc64 now. I think
> pr62173 is fixed on those 64bit arch by this patch.
>
> Thoughts?
>
> Thanks.
>
> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>
> gcc/
>    * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>    (vfp_const_iv): New hash table.
>    (expensive_addr_check_p): New boolean.
>    (init_inv_motion_data): Initialize new variables.>
>    (free_inv_motion_data): Release hash table.
>    (create_new_invariant): Set cheap_address to false for iv in
>    vfp_const_iv table.
>    (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>    table.
>    (use_for_single_du): New function.
>    (reshuffle_insn_with_vfp): Likewise.
>    (find_invariants_bb): Call reshuffle_insn_with_vfp.
>
> gcc/testsuite/
>     * gcc.dg/pr62173.c: New testcase.
So ultimately isn't this just a specialized version of reassociation of 
pointer arithmetic?  And if so, don't we have all the usual problems 
around introducing overflow?

ISTM if this is going to go forward (ie, we somehow convince ourselves 
that this kind of reassociation is OK), then should it be made to apply 
on pointers in general rather than restricting to stack, frame, 
virtual-frame?

I wish I'd looked more closely at the patch the first time around so 
that I could have raised these questions sooner.

Jeff
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-24  1:55                                         ` Jeff Law
@ 2015-04-24 17:05                                           ` Jiong Wang
  2015-05-14 20:04                                             ` Jeff Law
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-04-24 17:05 UTC (permalink / raw)
  To: Jeff Law; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck


Jeff Law writes:

> On 04/21/2015 08:24 AM, Jiong Wang wrote:
>>
>> Jiong Wang writes:
>>
>>> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>:
>>>> On 04/14/2015 10:48 AM, Steven Bosscher wrote:
>>>>>>
>>>>>> So I think this stage2/3 binary difference is acceptable?
>>>>>
>>>>>
>>>>> No, they should be identical. If there's a difference, then there's a
>>>>> bug - which, it seems, you've already found, too.
>>>>
>>>> RIght.  And so the natural question is how to fix.
>>>>
>>>> At first glance it would seem like having this new code ignore dependencies
>>>> rising from debug insns would work.
>>>>
>>>> Which then begs the question, what happens to the debug insn -- it's
>>>> certainly not going to be correct anymore if the transformation is made.
>>>
>>> Exactly.
>>>
>>> The debug_insn 2776 in my example is to record the base address of a
>>> local array. the new code is doing correctly here by not shuffling the
>>> operands of insn 2556 and 2557 as there is additional reference of
>>> reg:1473 from debug insn, although the code will still execute correctly
>>> if we do the transformation.
>>>
>>> my understanding to fix this:
>>>
>>>    * delete the out-of-date mismatch debug_insn? as there is no guarantee
>>>      to generate accurate debug info under -O2.
>>>
>>>      IMO, this debug_insn may affect "DW_AT_location" field for variable
>>>      descrption of "classes" in .debug_info section, but it's omitted in
>>>      the final output already.
>>>
>>>      <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable)
>>>      <38a4f>   DW_AT_name : (indirect string, offset: 0x18db): classes
>>>      <38a53>   DW_AT_decl_file   : 1
>>>      <38a54>   DW_AT_decl_line   : 548
>>>      <38a56>   DW_AT_type        : <0x38cb4>
>>>
>>>    * update the debug_insn? if the following change is OK with dwarf standard
>>>
>>>     from
>>>
>>>       insn0: reg0 = fp + reg1
>>>       debug_insn: var_loc = reg0 + const_off
>>>       insn1: reg2 = reg0 + const_off
>>>
>>>     to
>>>
>>>       insn0: reg0 = fp + const_off
>>>       debug_insn: var_loc = reg0 + reg1
>>>       insn1: reg2 = reg0 + reg1
>>>
>>> Thanks,
>>>
>>
>> And attachment is the new patch which will update debug_insn as
>> described in the second solution above.
>>
>> Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK.
>>
>> On AArch64, this patch give 600+ new rtl loop invariants found across
>> spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new
>> invariants found in the critical function "regex_compile".
>>
>> The similar improvements may be achieved on other RISC backends like
>> powerpc/mips I guess.
>>
>> One thing to mention, for AArch64, one minor glitch in
>> aarch64_legitimize_address needs to be fixed to let this patch take
>> effect, I will send out that patch later as it's a seperate issue.
>> Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so
>> should be OK, and I verified the base address of local array in the
>> testcase given by Seb on pr62173 do hoisted on ppc64 now. I think
>> pr62173 is fixed on those 64bit arch by this patch.
>>
>> Thoughts?
>>
>> Thanks.
>>
>> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>>
>> gcc/
>>    * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>>    (vfp_const_iv): New hash table.
>>    (expensive_addr_check_p): New boolean.
>>    (init_inv_motion_data): Initialize new variables.>
>>    (free_inv_motion_data): Release hash table.
>>    (create_new_invariant): Set cheap_address to false for iv in
>>    vfp_const_iv table.
>>    (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>>    table.
>>    (use_for_single_du): New function.
>>    (reshuffle_insn_with_vfp): Likewise.
>>    (find_invariants_bb): Call reshuffle_insn_with_vfp.
>>
>> gcc/testsuite/
>>     * gcc.dg/pr62173.c: New testcase.
> So ultimately isn't this just a specialized version of reassociation of 
> pointer arithmetic?  And if so, don't we have all the usual problems 
> around introducing overflow?
>
>
> ISTM if this is going to go forward (ie, we somehow convince ourselves 
> that this kind of reassociation is OK), then should it be made to apply 
> on pointers in general rather than restricting to stack, frame, 
> virtual-frame?
>

Jeff,

  Thanks for the review.

  This transformation is not reassociation of pointer arithmetic.
  
  The idea of this patch is, given two instructions with variable value,
  we may get new instruction sequences with fixed value by reassociating
  their operands. And currently GCC only generate such instruction
  sequences for local array accessing as far as I known.

  for the statement "D.2707 = A[i]", gcc generate the following
  instruction sequence:

  (insn 92 91 93 6 (set (reg/f:DI 148)
        (plus:DI (reg/f:DI 64 sfp)
            (reg:DI 147 [ i ])))
     (expr_list:REG_DEAD (reg:DI 147 [ i ])
        (nil)))
        
  (insn 93 92 94 6 (set (reg:SI 149 [ D.2707 ])
        (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 148)
                    (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8])))
     (expr_list:REG_DEAD (reg/f:DI 148)
        (nil)))


  both insn 92 and insn 93 are with variable value, but
  "(reg/f:DI 64 sfp)" in insn 92 and "const_int -16" in insn 93 are
  fixed value.
  
  So my patch will transform above sequence into the following if
  "apply_change_group" succeed. Then insn 92 is with fixed value and
  thus loop invariant.
  
  (insn 92 88 97 5 (set (reg/f:DI 148)
        (plus:DI (reg/f:DI 64 sfp)
            (const_int -16 [0xfffffffffffffff0])))
     (nil))
     
  (insn 93 131 94 6 (set (reg:SI 149 [ D.2707 ])
        (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 148)
                    (reg:DI 147 [ i ])) [0 A S1 A8])))
     (expr_list:REG_DEAD (reg:DI 147 [ i ])
        (expr_list:REG_DEAD (reg/f:DI 148)
            (nil))))

  This tranformation will only be done if the def of r148 in insn 92 is the single
  def for the use of it in insn 93, and the use in insn 93 is also
  single use.

  And if there is overflow, then it will happen in the original insn 93
  also, then it should not be generated in the first place. So I think
  there is no need to do overflow check.

  Does this explanation make sense? 
  
-- 
Regards,
Jiong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-21 14:43                                       ` Jiong Wang
  2015-04-24  1:55                                         ` Jeff Law
@ 2015-04-28 12:16                                         ` Jiong Wang
  2015-04-28 14:00                                           ` Matthew Fortune
  1 sibling, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-04-28 12:16 UTC (permalink / raw)
  To: matthew.fortune; +Cc: Jeff Law, Steven Bosscher, gcc-patches

Hi Matthew,

2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>:

>
> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>
> gcc/
>   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>   (vfp_const_iv): New hash table.
>   (expensive_addr_check_p): New boolean.
>   (init_inv_motion_data): Initialize new variables.>
>   (free_inv_motion_data): Release hash table.
>   (create_new_invariant): Set cheap_address to false for iv in
>   vfp_const_iv table.
>   (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>   table.
>   (use_for_single_du): New function.
>   (reshuffle_insn_with_vfp): Likewise.
>   (find_invariants_bb): Call reshuffle_insn_with_vfp.

Is it possible for you to test this patch on spec2k6 on MIPS?
especially 436.cactusADM.
I am interested in the performance impact on other load/store machines.

Thanks.

Regards,
Jiong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-28 12:16                                         ` Jiong Wang
@ 2015-04-28 14:00                                           ` Matthew Fortune
  2015-04-28 14:31                                             ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Matthew Fortune @ 2015-04-28 14:00 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Jeff Law, Steven Bosscher, gcc-patches

> Hi Matthew,
> 
> 2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>:
> 
> >
> > 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
> >
> > gcc/
> >   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
> >   (vfp_const_iv): New hash table.
> >   (expensive_addr_check_p): New boolean.
> >   (init_inv_motion_data): Initialize new variables.>
> >   (free_inv_motion_data): Release hash table.
> >   (create_new_invariant): Set cheap_address to false for iv in
> >   vfp_const_iv table.
> >   (find_invariant_insn): Skip dependencies check for iv in
> vfp_const_iv
> >   table.
> >   (use_for_single_du): New function.
> >   (reshuffle_insn_with_vfp): Likewise.
> >   (find_invariants_bb): Call reshuffle_insn_with_vfp.
> 
> Is it possible for you to test this patch on spec2k6 on MIPS?
> especially 436.cactusADM.
> I am interested in the performance impact on other load/store machines.

I will certainly try and fit it in. It may take me a week to get back to
you though as I'm away on vacation for a few days.

Matthew 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-28 14:00                                           ` Matthew Fortune
@ 2015-04-28 14:31                                             ` Jiong Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Jiong Wang @ 2015-04-28 14:31 UTC (permalink / raw)
  To: Matthew Fortune; +Cc: Jeff Law, Steven Bosscher, gcc-patches

2015-04-28 14:56 GMT+01:00 Matthew Fortune <Matthew.Fortune@imgtec.com>:
>> Hi Matthew,
>>
>> 2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>:
>>
>> >
>> > 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>> >
>> > gcc/
>> >   * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>> >   (vfp_const_iv): New hash table.
>> >   (expensive_addr_check_p): New boolean.
>> >   (init_inv_motion_data): Initialize new variables.>
>> >   (free_inv_motion_data): Release hash table.
>> >   (create_new_invariant): Set cheap_address to false for iv in
>> >   vfp_const_iv table.
>> >   (find_invariant_insn): Skip dependencies check for iv in
>> vfp_const_iv
>> >   table.
>> >   (use_for_single_du): New function.
>> >   (reshuffle_insn_with_vfp): Likewise.
>> >   (find_invariants_bb): Call reshuffle_insn_with_vfp.
>>
>> Is it possible for you to test this patch on spec2k6 on MIPS?
>> especially 436.cactusADM.
>> I am interested in the performance impact on other load/store machines.
>
> I will certainly try and fit it in. It may take me a week to get back to
> you though as I'm away on vacation for a few days.

Thanks very much. I appreciate that.

And you can add "-fdump-rtl-loop2_invariant", then we can see how many
new loop invariants found at RTL level across spec2k6 on MIPS.

Thanks.

Regards,
Jiong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-04-24 17:05                                           ` Jiong Wang
@ 2015-05-14 20:04                                             ` Jeff Law
  2015-05-14 22:07                                               ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff Law @ 2015-05-14 20:04 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck

On 04/24/2015 10:18 AM, Jiong Wang wrote:
>>> 2015-04-21  Jiong Wang  <jiong.wang@arm.com>
>>>
>>> gcc/
>>>     * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build.
>>>     (vfp_const_iv): New hash table.
>>>     (expensive_addr_check_p): New boolean.
>>>     (init_inv_motion_data): Initialize new variables.>
>>>     (free_inv_motion_data): Release hash table.
>>>     (create_new_invariant): Set cheap_address to false for iv in
>>>     vfp_const_iv table.
>>>     (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv
>>>     table.
>>>     (use_for_single_du): New function.
>>>     (reshuffle_insn_with_vfp): Likewise.
>>>     (find_invariants_bb): Call reshuffle_insn_with_vfp.
>>>
>>> gcc/testsuite/
>>>      * gcc.dg/pr62173.c: New testcase.
>> So ultimately isn't this just a specialized version of reassociation of
>> pointer arithmetic?  And if so, don't we have all the usual problems
>> around introducing overflow?
>>
>>
>> ISTM if this is going to go forward (ie, we somehow convince ourselves
>> that this kind of reassociation is OK), then should it be made to apply
>> on pointers in general rather than restricting to stack, frame,
>> virtual-frame?
>>
>
> Jeff,
>
>    Thanks for the review.
>
>    This transformation is not reassociation of pointer arithmetic.
                             ^^^

>
>    The idea of this patch is, given two instructions with variable value,
>    we may get new instruction sequences with fixed value by reassociating
>    their operands. And currently GCC only generate such instruction
>    sequences for local array accessing as far as I known.
Which is precisely reassociation of pointer arithmetic.

Given:
x = a + b
y = *(x + c)

You're generating:
x = a + c
y = *(x + b)

That's reassociation of pointer arithmetic.

For all kinds of reassociation we have to concern ourselves with adding 
overflow where it didn't already occur.  Assuming a 32 bit architecture 
we could get overflow if A is 0x7fffffff, b is -4 and and c = 3

0x7fffffff + -4 = 0x7ffffffb
0x7ffffffb + 3 = 0x7ffffffe

If you make the transformation you're suggesting we get

0x7fffffff + 3 = 0x80000002  OVERFLOW
0x80000002 - 4 = 0x7ffffffe

Now if you always know pointers are unsigned, then the overflow is 
defined and you'd be OK.  But that's a property of the target and one 
that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED 
which kind of tells us something in this space).

In addition to worrying about overflow, you have to worry about 
segmented architectures with implicit segment selection -- especially if 
the segment selection comes from the base register than the entire 
effective address.

On such architectures a + b != b + a when used in a memory reference. 
The HPPA and mn103 ports immediately come to mind.  I'm sure there's at 
least one more since I recall helping someone with these issues in the 
past.  Anyway

So given
x = a + b;
y = *(x + c)

Can only be transformed into
x = a + c
y = *(x + b)

If and only if a + b would choose the same segment as a + c.   On the PA 
that wouldn't be true for something like

a = 0x4000000
b = 0x10
c = -4

a + b is 0x40000010
a + c is 0x3ffffffc

Those select different segments.  Even though the full address is the 
same (0x4000000c).

Sadly we don't expose this aspect of target characteristics at all.

To proceed, you'd probably need ways to conditionalize these 
transformations.  ie, target hooks.

Jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-05-14 20:04                                             ` Jeff Law
@ 2015-05-14 22:07                                               ` Jiong Wang
  2015-05-14 22:24                                                 ` Jeff Law
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-05-14 22:07 UTC (permalink / raw)
  To: Jeff Law; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck

Jeff Law writes:

> For all kinds of reassociation we have to concern ourselves with adding 
> overflow where it didn't already occur.  Assuming a 32 bit architecture 
> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3
>
> 0x7fffffff + -4 = 0x7ffffffb
> 0x7ffffffb + 3 = 0x7ffffffe
>
>
> If you make the transformation you're suggesting we get
>
> 0x7fffffff + 3 = 0x80000002  OVERFLOW
> 0x80000002 - 4 = 0x7ffffffe
>
> Now if you always know pointers are unsigned, then the overflow is 
> defined and you'd be OK.  But that's a property of the target and one 
> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED 
> which kind of tells us something in this space).

I see, understood, cool! Thanks for such detailed explanation.

Above scenario do may happen for general pointer arith
reassociation.

One thing may make life easier as my reassociation is restricted within
frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was
to address some variable on stack. index_reg, const_off were part of
the stack offset of the variable. Reassociate them means reorder two
parts of the stack offset. There may be way to prove the transformation
will not add extra overflow risk, especially when the index_reg is
unsigned.

I understand for general pointer arith reassociation, there do have big
risk, as the involved operands largely come from irrelevant instruction,
no relationship between the values from those operands, we can deduce nothing.

>
> In addition to worrying about overflow, you have to worry about 
> segmented architectures with implicit segment selection -- especially if 
> the segment selection comes from the base register than the entire 
> effective address.
>

Hmm, understood!

This let me recall something as dark as x86 segment descriptor in protecting mode...

>
> Sadly we don't expose this aspect of target characteristics at all.
>
>
> To proceed, you'd probably need ways to conditionalize these 
> transformations.  ie, target hooks.
>
> Jeff

-- 
Regards,
Jiong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-05-14 22:07                                               ` Jiong Wang
@ 2015-05-14 22:24                                                 ` Jeff Law
  2015-05-21 21:51                                                   ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff Law @ 2015-05-14 22:24 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck

On 05/14/2015 03:13 PM, Jiong Wang wrote:
>
> Jeff Law writes:
>
>> For all kinds of reassociation we have to concern ourselves with adding
>> overflow where it didn't already occur.  Assuming a 32 bit architecture
>> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3
>>
>> 0x7fffffff + -4 = 0x7ffffffb
>> 0x7ffffffb + 3 = 0x7ffffffe
>>
>>
>> If you make the transformation you're suggesting we get
>>
>> 0x7fffffff + 3 = 0x80000002  OVERFLOW
>> 0x80000002 - 4 = 0x7ffffffe
>>
>> Now if you always know pointers are unsigned, then the overflow is
>> defined and you'd be OK.  But that's a property of the target and one
>> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED
>> which kind of tells us something in this space).
>
> I see, understood, cool! Thanks for such detailed explanation.
>
> Above scenario do may happen for general pointer arith
> reassociation.
>
> One thing may make life easier as my reassociation is restricted within
> frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was
> to address some variable on stack. index_reg, const_off were part of
> the stack offset of the variable. Reassociate them means reorder two
> parts of the stack offset. There may be way to prove the transformation
> will not add extra overflow risk, especially when the index_reg is
> unsigned.
>
> I understand for general pointer arith reassociation, there do have big
> risk, as the involved operands largely come from irrelevant instruction,
> no relationship between the values from those operands, we can deduce nothing.
Given the special status of SP, FP and ARGP and a known constant part, 
we can probably do something here.  More below...



>
>>
>> In addition to worrying about overflow, you have to worry about
>> segmented architectures with implicit segment selection -- especially if
>> the segment selection comes from the base register than the entire
>> effective address.
>>
>
> Hmm, understood!
>
> This let me recall something as dark as x86 segment descriptor in protecting mode...
Possibly, I've actually never studied the segmented aspects of the x86. 
  But I'm painfully familiar with the others mentioned :(

My recollection for the segmented stuff on the PA is we only had a 
single guard page at both ends of the segment.  So we only allowed an 
offset of +-4k when doing address reassociations in legitimize_address. 
  This was possible because we had callouts from the right places in the 
RTL generators/optimizers to allow targets to rewrite address 
arithmetic.  So we could naturally bury the target details away from the 
code generator/optimizers.

So we could possibly parameterize the transformation around similar 
concepts.  The design issue here is it's introducing more target 
dependencies in places where we've really wanted to avoid them.  In 
theory the gimple optimizers are supposed to be target independent. 
Reality is some stuff bleeds into them (the one that's mentioned the 
most often is branch costing, but there's others).

*If* we decide to go forward with using some target hooks here.  I'd be 
tempted to do 2.  One that's effective a tri-state.  Full reassociation, 
limited reassociation, no reassociation.  The second would bound the 
constants in the limited reassociation case.

Thoughts?

Jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-05-14 22:24                                                 ` Jeff Law
@ 2015-05-21 21:51                                                   ` Jiong Wang
  2015-05-27 16:11                                                     ` Jeff Law
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-05-21 21:51 UTC (permalink / raw)
  To: Jeff Law; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck


Jeff Law writes:

> On 05/14/2015 03:13 PM, Jiong Wang wrote:
>>
>> Jeff Law writes:
>>
>>> For all kinds of reassociation we have to concern ourselves with adding
>>> overflow where it didn't already occur.  Assuming a 32 bit architecture
>>> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3
>>>
>>> 0x7fffffff + -4 = 0x7ffffffb
>>> 0x7ffffffb + 3 = 0x7ffffffe
>>>
>>>
>>> If you make the transformation you're suggesting we get
>>>
>>> 0x7fffffff + 3 = 0x80000002  OVERFLOW
>>> 0x80000002 - 4 = 0x7ffffffe
>>>
>>> Now if you always know pointers are unsigned, then the overflow is
>>> defined and you'd be OK.  But that's a property of the target and one
>>> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED
>>> which kind of tells us something in this space).
>>
>> I see, understood, cool! Thanks for such detailed explanation.
>>
>> Above scenario do may happen for general pointer arith
>> reassociation.
>>
>> One thing may make life easier as my reassociation is restricted within
>> frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was
>> to address some variable on stack. index_reg, const_off were part of
>> the stack offset of the variable. Reassociate them means reorder two
>> parts of the stack offset. There may be way to prove the transformation
>> will not add extra overflow risk, especially when the index_reg is
>> unsigned.
>>
>> I understand for general pointer arith reassociation, there do have big
>> risk, as the involved operands largely come from irrelevant instruction,
>> no relationship between the values from those operands, we can deduce nothing.
> Given the special status of SP, FP and ARGP and a known constant part, 
> we can probably do something here.  More below...
>
>
>
>>
>>>
>>> In addition to worrying about overflow, you have to worry about
>>> segmented architectures with implicit segment selection -- especially if
>>> the segment selection comes from the base register than the entire
>>> effective address.
>>>
>>
>> Hmm, understood!
>>
>> This let me recall something as dark as x86 segment descriptor in protecting mode...
> Possibly, I've actually never studied the segmented aspects of the x86. 
>   But I'm painfully familiar with the others mentioned :(
>
> My recollection for the segmented stuff on the PA is we only had a 
> single guard page at both ends of the segment.  So we only allowed an 
> offset of +-4k when doing address reassociations in legitimize_address. 
>   This was possible because we had callouts from the right places in the 
> RTL generators/optimizers to allow targets to rewrite address 
> arithmetic.  So we could naturally bury the target details away from the 
> code generator/optimizers.
>
> So we could possibly parameterize the transformation around similar 
> concepts.  The design issue here is it's introducing more target 
> dependencies in places where we've really wanted to avoid them.  In 
> theory the gimple optimizers are supposed to be target independent. 
> Reality is some stuff bleeds into them (the one that's mentioned the 
> most often is branch costing, but there's others).
>
> *If* we decide to go forward with using some target hooks here.  I'd be 
> tempted to do 2.  One that's effective a tri-state.  Full reassociation, 
> limited reassociation, no reassociation.  The second would bound the 
> constants in the limited reassociation case.
>
> Thoughts?

Thanks for these thoughts.

I tried but still can't prove this transformation will not introduce
extra pointer overflow even given it's reassociation with vfp, although
my first impression is it do will not introduce extra risk in real
application.

Have done a quick check on hppa's legitimize_address. I see for (plus
sym_ref, const_int), if const_int is beyond +-4K, then that hook will
force them into register, then (plus reg, reg) is always OK.

So for target hooks,  my understanding of your idea is something like:

 new hook targetm.pointer_arith_reassociate (), if return -1 then
 support full reassociation, 0 for limited, 1 for should not do any
 reassociation. the default version return -1 as most targets are OK to
 do reassociation given we can prove there is no introducing of overflow
 risk. While for target like HPPA, we should define this hook to return
 0 for limited support.

 Then, if targetm.pointer_arith_reassociate () return 1, we should
 further invoke the second hook targetm.limited_reassociate_p (rtx x),
 to check the reassociated rtx 'x' meets any restrictions, for example
 for HPPA, constants part shouldn't beyond +-4K.

not sure whether my understanding is correct.
 
-- 
Regards,
Jiong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-05-21 21:51                                                   ` Jiong Wang
@ 2015-05-27 16:11                                                     ` Jeff Law
  2015-09-02 13:49                                                       ` Jiong Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff Law @ 2015-05-27 16:11 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Steven Bosscher, gcc-patches, Kenneth Zadeck

On 05/21/2015 02:46 PM, Jiong Wang wrote:
>
> Thanks for these thoughts.
>
> I tried but still can't prove this transformation will not introduce
> extra pointer overflow even given it's reassociation with vfp, although
> my first impression is it do will not introduce extra risk in real
> application.
>
> Have done a quick check on hppa's legitimize_address. I see for (plus
> sym_ref, const_int), if const_int is beyond +-4K, then that hook will
> force them into register, then (plus reg, reg) is always OK.
I'm virtually certain the PA's legitimize_address is not overflow safe. 
  It was written long before we started worrying about overflows in 
address computations.  It was mostly concerned with trying generate good 
addressing modes without running afoul of the implicit space register 
selection issues.

A SYMBOL_REF is always a valid base register.  However, as the comment 
in hppa_legitimize_address notes, we might be given a MEM for something 
like:  x[n-100000].

We don't want to rewrite that as (x-100000) + n, even though doing so 
would be beneficial for LICM.


>
> So for target hooks,  my understanding of your idea is something like:
>
>   new hook targetm.pointer_arith_reassociate (), if return -1 then
>   support full reassociation, 0 for limited, 1 for should not do any
>   reassociation. the default version return -1 as most targets are OK to
>   do reassociation given we can prove there is no introducing of overflow
>   risk. While for target like HPPA, we should define this hook to return
>   0 for limited support.
Right.  Rather than use magic constants, I'd suggest an enum for the 
tri-state.  FULL_PTR_REASSOCIATION, PARTIAL_PTR_REASSOCIATION, 
NO_PTR_REASSOCIATION.


>
>   Then, if targetm.pointer_arith_reassociate () return 1, we should
>   further invoke the second hook targetm.limited_reassociate_p (rtx x),
>   to check the reassociated rtx 'x' meets any restrictions, for example
>   for HPPA, constants part shouldn't beyond +-4K.
Right.

Jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-05-27 16:11                                                     ` Jeff Law
@ 2015-09-02 13:49                                                       ` Jiong Wang
  2015-09-02 20:52                                                         ` Jeff Law
  0 siblings, 1 reply; 37+ messages in thread
From: Jiong Wang @ 2015-09-02 13:49 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jeff Law


Jeff Law writes:

> On 05/21/2015 02:46 PM, Jiong Wang wrote:
>>
>> Thanks for these thoughts.
>>
>> I tried but still can't prove this transformation will not introduce
>> extra pointer overflow even given it's reassociation with vfp, although
>> my first impression is it do will not introduce extra risk in real
>> application.
>>
>> Have done a quick check on hppa's legitimize_address. I see for (plus
>> sym_ref, const_int), if const_int is beyond +-4K, then that hook will
>> force them into register, then (plus reg, reg) is always OK.
> I'm virtually certain the PA's legitimize_address is not overflow safe. 
>   It was written long before we started worrying about overflows in 
> address computations.  It was mostly concerned with trying generate good 
> addressing modes without running afoul of the implicit space register 
> selection issues.
>
> A SYMBOL_REF is always a valid base register.  However, as the comment 
> in hppa_legitimize_address notes, we might be given a MEM for something 
> like:  x[n-100000].
>
> We don't want to rewrite that as (x-100000) + n, even though doing so 
> would be beneficial for LICM.
>
>
>>
>> So for target hooks,  my understanding of your idea is something like:
>>
>>   new hook targetm.pointer_arith_reassociate (), if return -1 then
>>   support full reassociation, 0 for limited, 1 for should not do any
>>   reassociation. the default version return -1 as most targets are OK to
>>   do reassociation given we can prove there is no introducing of overflow
>>   risk. While for target like HPPA, we should define this hook to return
>>   0 for limited support.
> Right.  Rather than use magic constants, I'd suggest an enum for the 
> tri-state.  FULL_PTR_REASSOCIATION, PARTIAL_PTR_REASSOCIATION, 
> NO_PTR_REASSOCIATION.
>
>
>>
>>   Then, if targetm.pointer_arith_reassociate () return 1, we should
>>   further invoke the second hook targetm.limited_reassociate_p (rtx x),
>>   to check the reassociated rtx 'x' meets any restrictions, for example
>>   for HPPA, constants part shouldn't beyond +-4K.
> Right.
>
> Jeff

For the record, after Bin's recent tree-ssa-ivopt improvement originated
from PR62173, this patch is not benefitial anymore.

I can't see such re-shuffling opportunites in RTL level anymore. This
patch was trying to hoist those RTL sequences generated for local array
base address which haven't been hoisted out of the loop at tree level,
while now they are handled quite well by tree-ssa-ivopt.

During both aarch64 and mips64 bootstrapping, optimization in this patch
haven't been triggered while there were quite a few before Bin's tree level fix.

I have stopped working on this patch. Thanks for those time spent on
reviewing and discussing on this.

-- 
Regards,
Jiong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
  2015-09-02 13:49                                                       ` Jiong Wang
@ 2015-09-02 20:52                                                         ` Jeff Law
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff Law @ 2015-09-02 20:52 UTC (permalink / raw)
  To: Jiong Wang, gcc-patches

On 09/02/2015 07:36 AM, Jiong Wang wrote:

> For the record, after Bin's recent tree-ssa-ivopt improvement originated
> from PR62173, this patch is not benefitial anymore.
It happens sometimes.

>
> I have stopped working on this patch. Thanks for those time spent on
> reviewing and discussing on this.
No problem, thanks for testing after Bin's changes and letting me know 
that this has become a non-issue.

Jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-09-02 20:51 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-04 11:00 [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting Jiong Wang
2014-12-04 11:07 ` Richard Biener
2014-12-04 11:07   ` Richard Biener
2014-12-04 19:32     ` Jiong Wang
2014-12-15 15:29       ` Jiong Wang
2014-12-15 15:36         ` Jiong Wang
2014-12-17 16:19           ` Richard Biener
2014-12-18 17:08             ` Jiong Wang
2014-12-18 21:16               ` Jiong Wang
2014-12-18 22:19               ` Segher Boessenkool
2014-12-19  4:06                 ` Bin.Cheng
2014-12-19 10:29                   ` Jiong Wang
2014-12-19 11:45                     ` Richard Biener
2014-12-19 15:31                       ` Kenneth Zadeck
2015-02-11 11:20                         ` Jiong Wang
2015-02-11 14:22                           ` Kenneth Zadeck
2015-02-11 18:18                             ` Jiong Wang
2015-04-14 15:06                               ` Jiong Wang
2015-04-14 16:49                                 ` Steven Bosscher
2015-04-14 17:24                                   ` Jeff Law
2015-04-14 21:49                                     ` Jiong Wang
2015-04-21 14:43                                       ` Jiong Wang
2015-04-24  1:55                                         ` Jeff Law
2015-04-24 17:05                                           ` Jiong Wang
2015-05-14 20:04                                             ` Jeff Law
2015-05-14 22:07                                               ` Jiong Wang
2015-05-14 22:24                                                 ` Jeff Law
2015-05-21 21:51                                                   ` Jiong Wang
2015-05-27 16:11                                                     ` Jeff Law
2015-09-02 13:49                                                       ` Jiong Wang
2015-09-02 20:52                                                         ` Jeff Law
2015-04-28 12:16                                         ` Jiong Wang
2015-04-28 14:00                                           ` Matthew Fortune
2015-04-28 14:31                                             ` Jiong Wang
2015-04-19 16:20                               ` Jiong Wang
2014-12-19 12:09                     ` Eric Botcazou
2014-12-19 15:21                   ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).