Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jiong Wang <jiong.wang@arm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting
Date: Mon, 15 Dec 2014 15:36:00 -0000	[thread overview]
Message-ID: <548EFE55.6090901@arm.com> (raw)
In-Reply-To: <548EFE0D.1070808@arm.com>

[-- Attachment #1: Type: text/plain, Size: 7445 bytes --]


On 15/12/14 15:28, Jiong Wang wrote:
> On 04/12/14 19:32, Jiong Wang wrote:
>> On 04/12/14 11:07, Richard Biener wrote:
>>
>>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.wang@arm.com> wrote:
>>>>
>>>>
>>>> which means re-associate the constant imm with the virtual frame pointer.
>>>>
>>>> transform
>>>>
>>>>         RA <- fixed_reg + RC
>>>>         RD <- MEM (RA + const_offset)
>>>>
>>>>      into:
>>>>
>>>>         RA <- fixed_reg + const_offset
>>>>         RD <- MEM (RA + RC)
>>>>
>>>> then RA <- fixed_reg + const_offset is actually loop invariant, so the later
>>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
>>>> what tree
>>>> level ivopts could not sort out.
>>>> There is a LIM pass after gimple ivopts - if the invariantness is already
>>>> visible there why not handle it there similar to the special-cases in
>>>> rewrite_bittest and rewrite_reciprocal?
>> maybe, needs further check.
>>
>>>> And of course similar tricks could be applied on the RTL level to
>>>> RTL invariant motion?
>> Thanks. I just checked the code, yes, loop invariant motion pass
>> is the natural place to integrate such multi-insns invariant analysis trick.
>>
>> those code could be integrated into loop-invariant.c cleanly, but
>> then I found although the invariant detected successfully but it's not moved
>> out of loop because of cost issue, and looks like the patch committed to fix PR33928
>> is trying to prevent such cheap address be hoisted to reduce register pressure.
>>
>>
>>     805       /* ??? Try to determine cheapness of address computation.  Unfortunately
>>     806          the address cost is only a relative measure, we can't really compare
>>     807          it with any absolute number, but only with other address costs.
>>     808          But here we don't have any other addresses, so compare with a magic
>>     809          number anyway.  It has to be large enough to not regress PR33928
>>     810          (by avoiding to move reg+8,reg+16,reg+24 invariants), but small
>>     811          enough to not regress 410.bwaves either (by still moving reg+reg
>>     812          invariants).
>>     813          See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
>>     814       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
>>     815                                          ADDR_SPACE_GENERIC, speed) < 3;
>>
>>
>> I think that maybe necessary for x86 which is short of register, while for RISC,
>> it may not be that necessary, especially the whole register pressure is not big.
>>
>> currently, what I could think of is for this transformation below, we should
>> increase the costs:
>>     
>> A
>> ==
>>         RA <- virtual_stack_var + RC
>>         RD <- MEM (RA + const_offset)
>>
>>      into:
>>
>> B
>> ==
>>         RA <- virtual_stack_var + const_offset  <--B
>>         RD <- MEM (RA + RC)
>>
>> because the cost is not that cheap, if there is not re-assocation of virtual_stack_var
>> with const_offset, then lra elimination will create another instruction to hold the
>> elimination result, so format A will actually be
>>
>>         RT <- real_stack_pointer + elimination_offset
>>         RA <- RT + RC
>>         RD <- MEM (RA + const_offset)
>>
>> so, the re-assocation and later hoisting of invariant B could actually save two instructions
>> in the loop, this is why there are 15% perf gap for bzip2 under some situation.
> updated patch.
>
> moved this instruction shuffling trick to rtl loop invariant pass.
> as described above, this patch tries to transform A to B, so that
> after the transformation:
>
>     * RA <- virtual_stack_var + const_offset  could be hoisted out of the loop
>     * easy the work of lra elimination as virtual_stack_var is associated with const_offset
>       that the elimination offset could be combined with const_offset automatically.
>
> current rtl loop invariant pass treat "reg <- reg + off" as cheap address, while although
> "reg <- virtual_stack_var + offset" fall into the same format, but it's not that cheap as
> we could also save one lra elimination instruction. so this patch will mark
> "reg <- virtual_stack_var + offset" transformed from A to be expensive, so that it could be
> hoisted later.
>
> after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*. because there are one
> glitch in aarch64_legitimize_address which cause unnecessary complex instructions sequences
> generated when legitimize some addresses. and if we fix that, we will get cheaper address for
> those cases which is generally good, and the cheaper address will cause tree level IVOPT do
> more IVOPT optimization which is generally good also, but from the speck2k result, there
> are actually around 1% code size regression on two cases, the reason is for target support
> post-index address, doing IVOPT may not always be the best choice because we lost the advantage
> of using post-index addressing.
>
> on aarch64, for the following testcase, the ivopted version is complexer than not ivopted version.
>
>       while (oargc--) *(nargv++) = *(oargv++);
>    
> so, I sent the generic fix here only, as it's an independent patch, and could benefit other targets
> like powerpc64 although the issue on aarch64 is still not resolved before we figure out how to let
> the correct fix on aarch64_legitimize_address do not cause code size regression on benchmark caused
> by sub-optimal tree level IVOPT.
>
> and the testcase is marked to run on powerpc64 only at current time.
>
> bootstrap OK on x86-64, no regression.
> bootstrap OK on AArch64, and from speck2k compile dump, there do have a few more RTL loop
> invariants get hoisted.
>
> ok for trunk?
>
> gcc/
> PR62173
>     loop-invariant.c.c (expensive_addr): New hash_table.
>     (need_expensive_addr_check_p): New bool.
>     (find_exits): Rename to "find_exists_and_reshuffle.
>     Support re-shuffle instructions for better loop invariant hoisting.
>     (create_new_invariant): Mark address as expensive if it's generated by re-shuffle.
>     (init_inv_motion_data): Initialize expensive_addr and need_expensive_addr_check_p.
>     (free_inv_motion_data): Release expensive_addr.
>
> gcc/testssuite/
>     PR62173
>     gcc.dg/pr62173.c: New test.

sorry, patch uploaded.

>
>>> Oh, and the patch misses a testcase.
>> It's at the start of the email, will include in the patch.
>>
>> void bar(int i) {
>>      char A[10];
>>      int d = 0;
>>      while (i > 0)
>>      A[d++] = i--;
>>
>>      while (d > 0)
>>      foo(A[d--]);
>> }
>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> and this patch only tries to re-shuffle instructions within single basic
>>>>> block which
>>>>> is a inner loop which is perf critical.
>>>>>
>>>>> I am reusing the loop info in fwprop because there is loop info and it's run
>>>>> before
>>>>> GCSE.
>>>>>
>>>>> verified on aarch64 and mips64, the array base address hoisted out of loop.
>>>>>
>>>>> bootstrap ok on x86-64 and aarch64.
>>>>>
>>>>> comments?
>>>>>
>>>>> thanks.
>>>>>
>>>>> gcc/
>>>>>      PR62173
>>>>>      fwprop.c (prepare_for_gcse_pre): New function.
>>>>>      (fwprop_done): Call it.
>>
>
>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: rtl-iv.patch --]
[-- Type: text/x-patch; name=rtl-iv.patch, Size: 6874 bytes --]

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index 19e536f..7a72f10 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -186,6 +186,8 @@ typedef struct invariant *invariant_p;
 /* The invariants.  */
 
 static vec<invariant_p> invariants;
+static hash_table <pointer_hash <rtx_insn> > *expensive_addr;
+static bool need_expensive_addr_check_p;
 
 /* Check the size of the invariant table and realloc if necessary.  */
 
@@ -576,18 +578,37 @@ compute_always_reached (struct loop *loop, basic_block *body,
 
 /* Finds exits out of the LOOP with body BODY.  Marks blocks in that we may
    exit the loop by cfg edge to HAS_EXIT and MAY_EXIT.  In MAY_EXIT
-   additionally mark blocks that may exit due to a call.  */
+   additionally mark blocks that may exit due to a call.
+
+   This function also try to transform
+
+     RA <- fixed_reg + RC
+     RD <- MEM (RA + const_offset)
+
+  into:
+
+     RA <- fixed_reg + const_offset
+     RD <- MEM (RA + RC) <- pos0
+
+  If RA is DEAD after pos0.
+
+  After this change, the first instruction is loop invariant.  */
 
 static void
-find_exits (struct loop *loop, basic_block *body,
-	    bitmap may_exit, bitmap has_exit)
+find_exits_and_reshuffle (struct loop *loop, basic_block *body,
+			  bitmap may_exit, bitmap has_exit)
 {
   unsigned i;
   edge_iterator ei;
   edge e;
   struct loop *outermost_exit = loop, *aexit;
   bool has_call = false;
-  rtx_insn *insn;
+  rtx_insn *insn, *next_insn;
+  rtx single_set1, single_set2, old_dest;
+  rtx op0, op0_;
+  rtx old_op1, op1, op1_;
+  rtx inner;
+  rtx *mem_plus_loc;
 
   for (i = 0; i < loop->num_nodes; i++)
     {
@@ -597,12 +618,116 @@ find_exits (struct loop *loop, basic_block *body,
 	    {
 	      if (CALL_P (insn)
 		  && (RTL_LOOPING_CONST_OR_PURE_CALL_P (insn)
-		      || !RTL_CONST_OR_PURE_CALL_P (insn)))
+		      || !RTL_CONST_OR_PURE_CALL_P (insn))
+		  && ! has_call)
 		{
 		  has_call = true;
 		  bitmap_set_bit (may_exit, i);
-		  break;
 		}
+
+	      if (! NONDEBUG_INSN_P (insn))
+		continue;
+
+	      single_set1 = single_set (insn);
+
+	      if (! single_set1
+		  || GET_CODE (SET_SRC (single_set1)) != PLUS)
+		continue;
+
+	      old_dest = SET_DEST (single_set1);
+	      op0 = XEXP (SET_SRC (single_set1), 0);
+	      old_op1 = op1 = XEXP (SET_SRC (single_set1), 1);
+
+	      if (op1 == frame_pointer_rtx
+		  || op1 == stack_pointer_rtx
+		  || op1 == virtual_stack_vars_rtx)
+		{
+		  old_op1 = op0;
+		  std::swap (op0, op1);
+		}
+
+	      if (GET_CODE (op1) == SIGN_EXTEND
+		  || GET_CODE (op1) == ZERO_EXTEND)
+		op1 = XEXP (op1, 0);
+
+	      if (! (REG_P (old_dest) && REG_P (op0) && REG_P (op1)
+		     && (op0 == frame_pointer_rtx
+			 || op0 == stack_pointer_rtx
+			 || op0 == virtual_stack_vars_rtx)))
+		continue;
+
+	      if (! (next_insn = next_real_insn (insn)))
+		break;
+
+	      do
+		{
+		  if (DEBUG_INSN_P (next_insn))
+		    continue;
+
+		  single_set2 = single_set (next_insn);
+
+		  if (!single_set2
+		      || (! REG_P (SET_DEST (single_set2))
+			  && ! MEM_P (SET_DEST (single_set2))))
+		    continue;
+
+		  if (REG_P (SET_DEST (single_set2)))
+		    inner = SET_SRC (single_set2);
+		  else
+		    inner = SET_DEST (single_set2);
+
+		  if (GET_CODE (inner) == ZERO_EXTEND
+		      || GET_CODE (inner) == SIGN_EXTEND
+		      || GET_CODE (inner) == TRUNCATE)
+		    inner = XEXP (inner, 0);
+
+		  if (! MEM_P (inner)
+		      || GET_CODE (XEXP (inner, 0)) != PLUS)
+		    continue;
+
+		  mem_plus_loc = &XEXP (inner, 0);
+		  op0_ = XEXP (XEXP (inner, 0), 0);
+		  op1_ = XEXP (XEXP (inner, 0), 1);
+
+		  if (REG_P (op0_) && CONST_INT_P (op1_)
+		      && rtx_equal_p (op0_, old_dest)
+		      && GET_MODE (op0_) == GET_MODE (old_op1))
+		    {
+		      rtx new_src;
+
+		      if (find_regno_note (next_insn, REG_DEAD,
+					   REGNO (old_dest)))
+			{
+			  new_src = plus_constant (GET_MODE (op0), op0,
+						   INTVAL (op1_));
+			  validate_change (insn, &SET_SRC (single_set1),
+					   new_src, 1);
+			  new_src = gen_rtx_PLUS (GET_MODE (op0_), op0_,
+						  old_op1);
+			  validate_change (next_insn, mem_plus_loc, new_src, 1);
+			  if (apply_change_group ())
+			    {
+			      rtx_insn **slot =
+				expensive_addr->find_slot (insn, INSERT);
+
+			      if (*slot)
+				gcc_unreachable ();
+			      else
+				*slot = insn;
+
+			      need_expensive_addr_check_p = true;
+
+			      if (dump_file)
+				fprintf (dump_file,
+					 "\nRe-associate insn %d and %d for later"
+					 " RTL loop invariant hoisting.\n",
+					 INSN_UID (insn), INSN_UID (next_insn));
+			    }
+			}
+		      break;
+		    }
+		} while ((next_insn = next_real_insn (next_insn))
+			 && body[i] == BLOCK_FOR_INSN (next_insn));
 	    }
 
 	  FOR_EACH_EDGE (e, ei, body[i]->succs)
@@ -725,6 +850,9 @@ create_new_invariant (struct def *def, rtx_insn *insn, bitmap depends_on,
 	 See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
 					 ADDR_SPACE_GENERIC, speed) < 3;
+      if (need_expensive_addr_check_p
+	  && expensive_addr->find (insn))
+	inv->cheap_address = false;
     }
   else
     {
@@ -1043,7 +1171,7 @@ find_invariants (struct loop *loop)
   bitmap always_executed = BITMAP_ALLOC (NULL);
   basic_block *body = get_loop_body_in_dom_order (loop);
 
-  find_exits (loop, body, may_exit, has_exit);
+  find_exits_and_reshuffle (loop, body, may_exit, has_exit);
   compute_always_reached (loop, body, may_exit, always_reached);
   compute_always_reached (loop, body, has_exit, always_executed);
 
@@ -1628,6 +1756,9 @@ move_invariants (struct loop *loop)
 static void
 init_inv_motion_data (void)
 {
+  need_expensive_addr_check_p = false;
+  expensive_addr = new hash_table <pointer_hash <rtx_insn> > (4);
+
   actual_stamp = 1;
 
   invariants.create (100);
@@ -1663,6 +1794,8 @@ free_inv_motion_data (void)
       free (inv);
     }
   invariants.release ();
+
+  delete expensive_addr;
 }
 
 /* Move the invariants out of the LOOP.  */
diff --git a/gcc/testsuite/gcc.dg/pr62173.c b/gcc/testsuite/gcc.dg/pr62173.c
new file mode 100644
index 0000000..a5a1a86
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr62173.c
@@ -0,0 +1,19 @@
+/* { dg-do compile { target { powerpc64-*-* } } } */
+/* { dg-options "-O2 -fdump-rtl-loop2_invariant" } */
+
+void foo (char);
+
+void
+bar(int i)
+{
+  char A[10];
+  int d = 0;
+  while (i > 0)
+    A[d++] = i--;
+
+  while (d > 0)
+    foo (A[d--]);
+}
+
+/* { dg-final { scan-rtl-dump "Re-associate insn .* loop invariant hoisting" "loop2_invariant" } } */
+/* { dg-final { cleanup-rtl-dump "loop2_invariant" } } */

next prev parent reply	other threads:[~2014-12-15 15:29 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-04 11:00 Jiong Wang
2014-12-04 11:07 ` Richard Biener
2014-12-04 11:07   ` Richard Biener
2014-12-04 19:32     ` Jiong Wang
2014-12-15 15:29       ` Jiong Wang
2014-12-15 15:36         ` Jiong Wang [this message]
2014-12-17 16:19           ` Richard Biener
2014-12-18 17:08             ` Jiong Wang
2014-12-18 21:16               ` Jiong Wang
2014-12-18 22:19               ` Segher Boessenkool
2014-12-19  4:06                 ` Bin.Cheng
2014-12-19 10:29                   ` Jiong Wang
2014-12-19 11:45                     ` Richard Biener
2014-12-19 15:31                       ` Kenneth Zadeck
2015-02-11 11:20                         ` Jiong Wang
2015-02-11 14:22                           ` Kenneth Zadeck
2015-02-11 18:18                             ` Jiong Wang
2015-04-14 15:06                               ` Jiong Wang
2015-04-14 16:49                                 ` Steven Bosscher
2015-04-14 17:24                                   ` Jeff Law
2015-04-14 21:49                                     ` Jiong Wang
2015-04-21 14:43                                       ` Jiong Wang
2015-04-24  1:55                                         ` Jeff Law
2015-04-24 17:05                                           ` Jiong Wang
2015-05-14 20:04                                             ` Jeff Law
2015-05-14 22:07                                               ` Jiong Wang
2015-05-14 22:24                                                 ` Jeff Law
2015-05-21 21:51                                                   ` Jiong Wang
2015-05-27 16:11                                                     ` Jeff Law
2015-09-02 13:49                                                       ` Jiong Wang
2015-09-02 20:52                                                         ` Jeff Law
2015-04-28 12:16                                         ` Jiong Wang
2015-04-28 14:00                                           ` Matthew Fortune
2015-04-28 14:31                                             ` Jiong Wang
2015-04-19 16:20                               ` Jiong Wang
2014-12-19 12:09                     ` Eric Botcazou
2014-12-19 15:21                   ` Segher Boessenkool

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=548EFE55.6090901@arm.com \
    --to=jiong.wang@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).