[RFC] Run pass_sink_code once more after ivopts/fre

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] Run pass_sink_code once more after ivopts/fre
@ 2020-12-21  9:03 Xiong Hu Luo
  2020-12-22 16:53 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Xiong Hu Luo @ 2020-12-21  9:03 UTC (permalink / raw)
  To: gcc-patches
  Cc: segher, dje.gcc, wschmidt, guojiufu, linkw, rguenther, Xiong Hu Luo

Here comes another case that requires run a pass once more, as this is
not the common suggested direction to solve problems, not quite sure
whether it is still a reasonble fix here.  Source code is something like:

ref = ip + *hslot;
while (ip < in_end - 2) {
  unsigned int len = 2;
  len++;
    for ()   {
      do len++;
      while (len < maxlen && ref[len] == ip[len]); //sink code here.
      break;
    }
  len -= 2;
  ip++;
  ip += len + 1;
  if (ip >= in_end - 2)
    break;
}

Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:

  <bb 31> [local count: 75120046]:
  # len_160 = PHI <len_161(30), len_189(58)>
  len_189 = len_160 + 1;
  _423 = (sizetype) len_189;
  _424 = ip_229 + _423;
  if (maxlen_186 > len_189)
    goto <bb 32>; [94.50%]
  else
    goto <bb 33>; [5.50%]

  <bb 32> [local count: 70988443]:
  _84 = *_424;
  _86 = ref_182 + _423;
  _87 = *_86;
  if (_84 == _87)
    goto <bb 58>; [94.50%]
  else
    goto <bb 33>; [5.50%]

  <bb 58> [local count: 67084079]:
  goto <bb 31>; [100.00%]

  <bb 33> [local count: 14847855]:
  # len_263 = PHI <len_160(32), len_160(31)>
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
    goto <bb 34>; [0.00%]
  else
    goto <bb 36>; [100.00%]

Then in ivopts, instructions are updated to xxx.c.174t.ivopts:

  <bb 31> [local count: 75120046]:
  # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)>
  _34 = (unsigned int) ivtmp.30_29;
  len_160 = _34 + 4294967295;
  _423 = ivtmp.30_29;
  _35 = (unsigned long) ip_229;
  _420 = ivtmp.30_29 + _35;
  _419 = (uint8_t *) _420;
  _424 = _419;
  len_418 = (unsigned int) ivtmp.30_29;
  if (maxlen_186 > len_418)
    goto <bb 32>; [94.50%]
  else
    goto <bb 33>; [5.50%]

  <bb 32> [local count: 70988443]:
  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
  ivtmp.30_31 = ivtmp.30_29 + 1;
  _417 = ref_182 + 18446744073709551615;
  _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
  if (_84 == _87)
    goto <bb 58>; [94.50%]
  else
    goto <bb 33>; [5.50%]

  <bb 58> [local count: 67084079]:
  goto <bb 31>; [100.00%]

  <bb 33> [local count: 14847855]:
  # len_263 = PHI <len_160(32), len_160(31)>
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
    goto <bb 34>; [0.00%]
  else
    goto <bb 36>; [100.00%]

Some instructions in BB 31 are not used in the loop and could be sinked
out of loop to reduce the computation, but they are not sinked
throughout all passes later.  Run the sink_code pass once more at least
after fre5 could improve this typical case performance 23% due to few
instructions exausted in loop.
xxx.c.209t.sink2:

Sinking _419 = (uint8_t *) _420;
 from bb 31 to bb 89
Sinking _420 = ivtmp.30_29 + _35;
 from bb 31 to bb 89
Sinking _35 = (unsigned long) ip_229;
 from bb 31 to bb 89
Sinking len_160 = _34 + 4294967295;
 from bb 31 to bb 33

I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
small with 0.25%.

The reason why it should be run after fre5 is fre would do some phi
optimization to expose the optimization.  The patch put it after
pass_modref is due to my guess that some gimple optimizations like
thread_jumps, dse, dce etc. could provide more opportunities for
sinking code.  Not sure it is the correct place to put.  I also
verified this issue exists in both X86 and ARM64.
Any comments?  Thanks.
---
 gcc/passes.def      | 1 +
 gcc/tree-ssa-sink.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/gcc/passes.def b/gcc/passes.def
index 21b2e2af0f7..69106615729 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -355,6 +355,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_uncprop);
       NEXT_PASS (pass_local_pure_const);
       NEXT_PASS (pass_modref);
+      NEXT_PASS (pass_sink_code);
   POP_INSERT_PASSES ()
   NEXT_PASS (pass_all_optimizations_g);
   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations_g)
diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c
index b0abf4147d6..824659f3919 100644
--- a/gcc/tree-ssa-sink.c
+++ b/gcc/tree-ssa-sink.c
@@ -819,6 +819,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_sink != 0; }
   virtual unsigned int execute (function *);
+  opt_pass *clone (void) { return new pass_sink_code (m_ctxt); }
 
 }; // class pass_sink_code
 
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2020-12-21  9:03 [RFC] Run pass_sink_code once more after ivopts/fre Xiong Hu Luo
@ 2020-12-22 16:53 ` Richard Biener
  2021-03-23  3:07   ` Xionghu Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2020-12-22 16:53 UTC (permalink / raw)
  To: Xiong Hu Luo, gcc-patches; +Cc: segher, dje.gcc, wschmidt, guojiufu, linkw

On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
>Here comes another case that requires run a pass once more, as this is
>not the common suggested direction to solve problems, not quite sure
>whether it is still a reasonble fix here.  Source code is something
>like:
>
>ref = ip + *hslot;
>while (ip < in_end - 2) {
>  unsigned int len = 2;
>  len++;
>    for ()   {
>      do len++;
>      while (len < maxlen && ref[len] == ip[len]); //sink code here.
>      break;
>    }
>  len -= 2;
>  ip++;
>  ip += len + 1;
>  if (ip >= in_end - 2)
>    break;
>}
>
>Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:
>
>  <bb 31> [local count: 75120046]:
>  # len_160 = PHI <len_161(30), len_189(58)>
>  len_189 = len_160 + 1;
>  _423 = (sizetype) len_189;
>  _424 = ip_229 + _423;
>  if (maxlen_186 > len_189)
>    goto <bb 32>; [94.50%]
>  else
>    goto <bb 33>; [5.50%]
>
>  <bb 32> [local count: 70988443]:
>  _84 = *_424;
>  _86 = ref_182 + _423;
>  _87 = *_86;
>  if (_84 == _87)
>    goto <bb 58>; [94.50%]
>  else
>    goto <bb 33>; [5.50%]
>
>  <bb 58> [local count: 67084079]:
>  goto <bb 31>; [100.00%]
>
>  <bb 33> [local count: 14847855]:
>  # len_263 = PHI <len_160(32), len_160(31)>
>  # _262 = PHI <_423(32), _423(31)>
>  # _264 = PHI <_424(32), _424(31)>
>  len_190 = len_263 + 4294967295;
>  if (len_190 <= 6)
>    goto <bb 34>; [0.00%]
>  else
>    goto <bb 36>; [100.00%]
>
>Then in ivopts, instructions are updated to xxx.c.174t.ivopts:
>
>  <bb 31> [local count: 75120046]:
>  # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)>
>  _34 = (unsigned int) ivtmp.30_29;
>  len_160 = _34 + 4294967295;
>  _423 = ivtmp.30_29;
>  _35 = (unsigned long) ip_229;
>  _420 = ivtmp.30_29 + _35;
>  _419 = (uint8_t *) _420;
>  _424 = _419;
>  len_418 = (unsigned int) ivtmp.30_29;
>  if (maxlen_186 > len_418)
>    goto <bb 32>; [94.50%]
>  else
>    goto <bb 33>; [5.50%]
>
>  <bb 32> [local count: 70988443]:
>  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
>  ivtmp.30_31 = ivtmp.30_29 + 1;
>  _417 = ref_182 + 18446744073709551615;
>  _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
>  if (_84 == _87)
>    goto <bb 58>; [94.50%]
>  else
>    goto <bb 33>; [5.50%]
>
>  <bb 58> [local count: 67084079]:
>  goto <bb 31>; [100.00%]
>
>  <bb 33> [local count: 14847855]:
>  # len_263 = PHI <len_160(32), len_160(31)>
>  # _262 = PHI <_423(32), _423(31)>
>  # _264 = PHI <_424(32), _424(31)>
>  len_190 = len_263 + 4294967295;
>  if (len_190 <= 6)
>    goto <bb 34>; [0.00%]
>  else
>    goto <bb 36>; [100.00%]
>
>Some instructions in BB 31 are not used in the loop and could be sinked
>out of loop to reduce the computation, but they are not sinked
>throughout all passes later.  Run the sink_code pass once more at least
>after fre5 could improve this typical case performance 23% due to few
>instructions exausted in loop.
>xxx.c.209t.sink2:
>
>Sinking _419 = (uint8_t *) _420;
> from bb 31 to bb 89
>Sinking _420 = ivtmp.30_29 + _35;
> from bb 31 to bb 89
>Sinking _35 = (unsigned long) ip_229;
> from bb 31 to bb 89
>Sinking len_160 = _34 + 4294967295;
> from bb 31 to bb 33
>
>I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
>by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
>small with 0.25%.
>
>The reason why it should be run after fre5 is fre would do some phi
>optimization to expose the optimization.  The patch put it after
>pass_modref is due to my guess that some gimple optimizations like
>thread_jumps, dse, dce etc. could provide more opportunities for
>sinking code.  Not sure it is the correct place to put.  I also
>verified this issue exists in both X86 and ARM64.
>Any comments?  Thanks.

It definitely should be before uncprop (but context stops there). And yes, re-running passes isn't the very, very best thing to do without explaining it cannot be done in other ways. Not for late stage 3 anyway. 

Richard. 

>---
> gcc/passes.def      | 1 +
> gcc/tree-ssa-sink.c | 1 +
> 2 files changed, 2 insertions(+)
>
>diff --git a/gcc/passes.def b/gcc/passes.def
>index 21b2e2af0f7..69106615729 100644
>--- a/gcc/passes.def
>+++ b/gcc/passes.def
>@@ -355,6 +355,7 @@ along with GCC; see the file COPYING3.  If not see
>       NEXT_PASS (pass_uncprop);
>       NEXT_PASS (pass_local_pure_const);
>       NEXT_PASS (pass_modref);
>+      NEXT_PASS (pass_sink_code);
>   POP_INSERT_PASSES ()
>   NEXT_PASS (pass_all_optimizations_g);
>   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations_g)
>diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c
>index b0abf4147d6..824659f3919 100644
>--- a/gcc/tree-ssa-sink.c
>+++ b/gcc/tree-ssa-sink.c
>@@ -819,6 +819,7 @@ public:
>   /* opt_pass methods: */
>   virtual bool gate (function *) { return flag_tree_sink != 0; }
>   virtual unsigned int execute (function *);
>+  opt_pass *clone (void) { return new pass_sink_code (m_ctxt); }
> 
> }; // class pass_sink_code
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2020-12-22 16:53 ` Richard Biener
@ 2021-03-23  3:07   ` Xionghu Luo
  2021-03-23  8:50     ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Xionghu Luo @ 2021-03-23  3:07 UTC (permalink / raw)
  To: Richard Biener, gcc-patches; +Cc: segher, dje.gcc, wschmidt, guojiufu, linkw

[-- Attachment #1: Type: text/plain, Size: 4361 bytes --]



On 2020/12/23 00:53, Richard Biener wrote:
> On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo <luoxhu@linux.ibm.com> wrote:
>> Here comes another case that requires run a pass once more, as this is
>> not the common suggested direction to solve problems, not quite sure
>> whether it is still a reasonble fix here.  Source code is something
>> like:
>>
>> ref = ip + *hslot;
>> while (ip < in_end - 2) {
>>   unsigned int len = 2;
>>   len++;
>>     for ()   {
>>       do len++;
>>       while (len < maxlen && ref[len] == ip[len]); //sink code here.
>>       break;
>>     }
>>   len -= 2;
>>   ip++;
>>   ip += len + 1;
>>   if (ip >= in_end - 2)
>>     break;
>> }
>>
>> Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:
>>
>>   <bb 31> [local count: 75120046]:
>>   # len_160 = PHI <len_161(30), len_189(58)>
>>   len_189 = len_160 + 1;
>>   _423 = (sizetype) len_189;
>>   _424 = ip_229 + _423;
>>   if (maxlen_186 > len_189)
>>     goto <bb 32>; [94.50%]
>>   else
>>     goto <bb 33>; [5.50%]
>>
>>   <bb 32> [local count: 70988443]:
>>   _84 = *_424;
>>   _86 = ref_182 + _423;
>>   _87 = *_86;
>>   if (_84 == _87)
>>     goto <bb 58>; [94.50%]
>>   else
>>     goto <bb 33>; [5.50%]
>>
>>   <bb 58> [local count: 67084079]:
>>   goto <bb 31>; [100.00%]
>>
>>   <bb 33> [local count: 14847855]:
>>   # len_263 = PHI <len_160(32), len_160(31)>
>>   # _262 = PHI <_423(32), _423(31)>
>>   # _264 = PHI <_424(32), _424(31)>
>>   len_190 = len_263 + 4294967295;
>>   if (len_190 <= 6)
>>     goto <bb 34>; [0.00%]
>>   else
>>     goto <bb 36>; [100.00%]
>>
>> Then in ivopts, instructions are updated to xxx.c.174t.ivopts:
>>
>>   <bb 31> [local count: 75120046]:
>>   # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)>
>>   _34 = (unsigned int) ivtmp.30_29;
>>   len_160 = _34 + 4294967295;
>>   _423 = ivtmp.30_29;
>>   _35 = (unsigned long) ip_229;
>>   _420 = ivtmp.30_29 + _35;
>>   _419 = (uint8_t *) _420;
>>   _424 = _419;
>>   len_418 = (unsigned int) ivtmp.30_29;
>>   if (maxlen_186 > len_418)
>>     goto <bb 32>; [94.50%]
>>   else
>>     goto <bb 33>; [5.50%]
>>
>>   <bb 32> [local count: 70988443]:
>>   _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
>>   ivtmp.30_31 = ivtmp.30_29 + 1;
>>   _417 = ref_182 + 18446744073709551615;
>>   _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
>>   if (_84 == _87)
>>     goto <bb 58>; [94.50%]
>>   else
>>     goto <bb 33>; [5.50%]
>>
>>   <bb 58> [local count: 67084079]:
>>   goto <bb 31>; [100.00%]
>>
>>   <bb 33> [local count: 14847855]:
>>   # len_263 = PHI <len_160(32), len_160(31)>
>>   # _262 = PHI <_423(32), _423(31)>
>>   # _264 = PHI <_424(32), _424(31)>
>>   len_190 = len_263 + 4294967295;
>>   if (len_190 <= 6)
>>     goto <bb 34>; [0.00%]
>>   else
>>     goto <bb 36>; [100.00%]
>>
>> Some instructions in BB 31 are not used in the loop and could be sinked
>> out of loop to reduce the computation, but they are not sinked
>> throughout all passes later.  Run the sink_code pass once more at least
>> after fre5 could improve this typical case performance 23% due to few
>> instructions exausted in loop.
>> xxx.c.209t.sink2:
>>
>> Sinking _419 = (uint8_t *) _420;
>> from bb 31 to bb 89
>> Sinking _420 = ivtmp.30_29 + _35;
>> from bb 31 to bb 89
>> Sinking _35 = (unsigned long) ip_229;
>> from bb 31 to bb 89
>> Sinking len_160 = _34 + 4294967295;
>> from bb 31 to bb 33
>>
>> I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
>> by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
>> small with 0.25%.
>>
>> The reason why it should be run after fre5 is fre would do some phi
>> optimization to expose the optimization.  The patch put it after
>> pass_modref is due to my guess that some gimple optimizations like
>> thread_jumps, dse, dce etc. could provide more opportunities for
>> sinking code.  Not sure it is the correct place to put.  I also
>> verified this issue exists in both X86 and ARM64.
>> Any comments?  Thanks.
> 
> It definitely should be before uncprop (but context stops there). And yes, re-running passes isn't the very, very best thing to do without explaining it cannot be done in other ways. Not for late stage 3 anyway.
> 
> Richard.
> 

Thanks.  Also tried to implement this in a seperate RTL pass, which
would be better?  I guess this would also be stage1 issues...


Xionghu

[-- Attachment #2: 0001-RTL-loop-sink-pass.patch --]
[-- Type: text/plain, Size: 18658 bytes --]

From ac6e161a592ea259f2807f40d1021ccbfc3a965f Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luoxhu@linux.ibm.com>
Date: Thu, 4 Mar 2021 05:05:19 -0600
Subject: [PATCH] RTL loop sink pass

This is a rtl loop sink pass that check loop header instructions,
if the instruction's dest is not used in loop and it's source is not
updated after current instruction, then this instruction's result is
not used but calculated in loop each itration, sink it out of loop could
reduce executions theoretically(register pressure is not considered yet).

Number of Instructions sank out of loop when running on P8LE:
1. SPEC2017 int: 371
2. SPEC2017 float: 949
3. bootstrap: 402
4. stage1 libraries: 115
5. regression tests: 4533

Though no obvious performance change for SPEC2017, rtl-sink-2.c could
achieve 10% performance improvement by sinking 4 instructions out of
loop header.  This is a bit like run gimple sink pass twice I pasted
several months ago, is implementing this in RTL better?

https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562352.html

gcc/ChangeLog:

	* Makefile.in:
	* dbgcnt.def (DEBUG_COUNTER):
	* passes.def:
	* timevar.def (TV_TREE_SINK):
	(TV_SINK):
	* tree-pass.h (make_pass_rtl_sink):
	* sink.c: New file.

gcc/testsuite/ChangeLog:

	* gcc.dg/rtl-sink-1.c: New test.
	* gcc.dg/rtl-sink-2.c: New test.
---
 gcc/Makefile.in                   |   1 +
 gcc/dbgcnt.def                    |   1 +
 gcc/passes.def                    |   1 +
 gcc/sink.c                        | 350 ++++++++++++++++++++++++++++++
 gcc/testsuite/gcc.dg/rtl-sink-1.c |  19 ++
 gcc/testsuite/gcc.dg/rtl-sink-2.c | 200 +++++++++++++++++
 gcc/timevar.def                   |   3 +-
 gcc/tree-pass.h                   |   1 +
 8 files changed, 575 insertions(+), 1 deletion(-)
 create mode 100644 gcc/sink.c
 create mode 100644 gcc/testsuite/gcc.dg/rtl-sink-1.c
 create mode 100644 gcc/testsuite/gcc.dg/rtl-sink-2.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a63c5d9cab6..b7ec7970aac 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1334,6 +1334,7 @@ OBJS = \
 	cppbuiltin.o \
 	cppdefault.o \
 	cprop.o \
+	sink.o \
 	cse.o \
 	cselib.o \
 	data-streamer.o \
diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index 93e7b4fd30e..c0702650ad3 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -197,6 +197,7 @@ DEBUG_COUNTER (sched_region)
 DEBUG_COUNTER (sel_sched_cnt)
 DEBUG_COUNTER (sel_sched_insn_cnt)
 DEBUG_COUNTER (sel_sched_region_cnt)
+DEBUG_COUNTER (sink)
 DEBUG_COUNTER (sms_sched_loop)
 DEBUG_COUNTER (split_for_sched2)
 DEBUG_COUNTER (store_merging)
diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc57..820b54155c5 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -436,6 +436,7 @@ along with GCC; see the file COPYING3.  If not see
       /* Perform loop optimizations.  It might be better to do them a bit
 	 sooner, but we want the profile feedback to work more
 	 efficiently.  */
+      NEXT_PASS (pass_rtl_sink);
       NEXT_PASS (pass_loop2);
       PUSH_INSERT_PASSES_WITHIN (pass_loop2)
 	  NEXT_PASS (pass_rtl_loop_init);
diff --git a/gcc/sink.c b/gcc/sink.c
new file mode 100644
index 00000000000..0c33f0e7c8b
--- /dev/null
+++ b/gcc/sink.c
@@ -0,0 +1,350 @@
+/* Code sink for RTL.
+   Copyright (C) 1997-2020 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "cfghooks.h"
+#include "df.h"
+#include "insn-config.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "recog.h"
+#include "diagnostic-core.h"
+#include "toplev.h"
+#include "cfgrtl.h"
+#include "cfganal.h"
+#include "lcm.h"
+#include "cfgcleanup.h"
+#include "cselib.h"
+#include "intl.h"
+#include "tree-pass.h"
+#include "dbgcnt.h"
+#include "cfgloop.h"
+#include "gcse.h"
+#include "loop-unroll.h"
+
+/* Check whether the instruction could be sunk out of loop by checking dest's
+   uss and source's def.  */
+
+static bool
+can_sink_reg_in_loop (loop *loop, rtx_insn *insn, rtx dest, rtx reg)
+{
+  df_ref def, use;
+  unsigned int src_regno, dest_regno, defs_in_loop_count = 0;
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  basic_block use_bb;
+
+  while (GET_CODE (reg) == ZERO_EXTEND || GET_CODE (reg) == SIGN_EXTEND)
+    reg = XEXP (reg, 0);
+
+  if (GET_CODE (reg) == SUBREG)
+    reg = SUBREG_REG (reg);
+
+  if (UNARY_P (reg))
+    return can_sink_reg_in_loop (loop, insn, dest, XEXP (reg, 0));
+
+  if (BINARY_P (reg))
+    {
+      rtx src0 = XEXP (reg, 0);
+      rtx src1 = XEXP (reg, 1);
+
+      if (CONST_INT_P (src1))
+	return can_sink_reg_in_loop (loop, insn, dest, src0);
+      else
+	return can_sink_reg_in_loop (loop, insn, dest, src0)
+	       && can_sink_reg_in_loop (loop, insn, dest, src1);
+    }
+
+  if (!REG_P (reg) || HARD_REGISTER_P (reg))
+    return false;
+
+  src_regno = REGNO (reg);
+  dest_regno = REGNO (dest);
+  rtx_insn *use_insn;
+
+  for (use = DF_REG_USE_CHAIN (dest_regno); use; use = DF_REF_NEXT_REG (use))
+    {
+      if (!DF_REF_INSN_INFO (use))
+	continue;
+
+      use_insn = DF_REF_INSN (use);
+      use_bb = BLOCK_FOR_INSN (use_insn);
+
+      /* Ignore instruction considered for moving.  */
+      if (use_insn == insn)
+	return false;
+
+      /* Don't consider uses in loop.  */
+      if (!use_bb->loop_father
+	  || (NONDEBUG_INSN_P (use_insn)
+	      && flow_bb_inside_loop_p (loop, use_bb)))
+	return false;
+    }
+
+  rtx_insn *def_insn;
+  basic_block def_bb;
+  /* Check for other defs.  Any other def in the loop might reach a use
+     currently reached by the def in insn.  */
+  for (def = DF_REG_DEF_CHAIN (dest_regno); def; def = DF_REF_NEXT_REG (def))
+    {
+      def_bb = DF_REF_BB (def);
+
+      /* Defs in exit block cannot reach a use they weren't already.  */
+      if (single_succ_p (def_bb))
+	{
+	  basic_block def_bb_succ;
+
+	  def_bb_succ = single_succ (def_bb);
+	  if (!flow_bb_inside_loop_p (loop, def_bb_succ))
+	    continue;
+	}
+
+      if (flow_bb_inside_loop_p (loop, def_bb) && ++defs_in_loop_count > 1)
+	return false;
+    }
+
+  for (def = DF_REG_DEF_CHAIN (src_regno); def; def = DF_REF_NEXT_REG (def))
+    {
+      def_bb = DF_REF_BB (def);
+      def_insn = DF_REF_INSN (def);
+      if (!flow_bb_inside_loop_p (loop, def_bb))
+	continue;
+
+      if (def_bb == bb && DF_INSN_LUID (insn) <= DF_INSN_LUID (def_insn))
+	return false;
+
+      if (def_bb != bb && def_bb != loop->latch)
+	return false;
+    }
+
+  auto_vec<edge> edges = get_loop_exit_edges (loop);
+  unsigned j;
+  edge e;
+
+  FOR_EACH_VEC_ELT (edges, j, e)
+    if (!dominated_by_p (CDI_DOMINATORS, e->src, bb))
+      return false;
+    else
+      continue;
+  return true;
+}
+
+/* If the instruction's dest is only used by debug instruction in the loop, it
+   need also be sunk out of loop to preserve the debug information.  */
+
+rtx_insn *
+sink_dest_in_debug_insn (rtx dest, rtx_insn *insn_sink_to, basic_block prev_bb)
+{
+  df_ref use;
+  rtx_insn *use_insn;
+  rtx pat;
+  basic_block use_bb;
+
+  unsigned int dest_regno = REGNO (dest);
+  for (use = DF_REG_USE_CHAIN (dest_regno); use; use = DF_REF_NEXT_REG (use))
+    {
+      if (!DF_REF_INSN_INFO (use))
+	continue;
+
+      use_insn = DF_REF_INSN (use);
+
+      if (NONDEBUG_INSN_P (use_insn))
+	continue;
+
+      use_bb = BLOCK_FOR_INSN (use_insn);
+
+      if (use_bb != prev_bb)
+	continue;
+
+      pat = PATTERN (use_insn);
+      emit_debug_insn_after_noloc (copy_insn (pat), insn_sink_to);
+      return use_insn;
+    }
+  return NULL;
+}
+
+/* If the instructions' dest is not used in loop and the source is not
+   re-defined after current instructions, sink it from loop header to every loop
+   exits.  */
+
+static void
+sink_set_reg_in_loop (loop *loop)
+{
+  unsigned i,j;
+  basic_block *body = get_loop_body_in_dom_order (loop);
+  rtx_insn *insn, *insn_sink_to, *dbg_insn;
+  edge e;
+  basic_block bb, new_bb;
+  rtx set, src, dest, pat;
+  auto_vec<edge> edges = get_loop_exit_edges (loop);
+
+  FOR_EACH_VEC_ELT (edges, j, e)
+    if (has_abnormal_or_eh_outgoing_edge_p (e->src))
+      return;
+
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb = body[i];
+      if (bb->loop_father == loop && bb_loop_header_p (bb))
+	{
+	  FOR_BB_INSNS_REVERSE (bb, insn)
+	    {
+	      if (!NONDEBUG_INSN_P (insn))
+		continue;
+
+	      if (CALL_P (insn))
+		continue;
+
+	      set = single_set (insn);
+	      if (!set || side_effects_p (set))
+		continue;
+
+	      src = SET_SRC (set);
+	      dest = SET_DEST (set);
+	      pat = PATTERN (insn);
+
+	      if (!REG_P (dest))
+		continue;
+
+	      if (!can_sink_reg_in_loop (loop, insn, dest, src))
+		continue;
+
+	      FOR_EACH_VEC_ELT (edges, j, e)
+		{
+		  new_bb = e->dest;
+		  if (!single_pred_p (new_bb))
+		    new_bb = split_edge (e);
+
+		  if (dump_file)
+		    {
+		      fprintf (dump_file, "Loop %d: sinking ", loop->num);
+		      print_rtl (dump_file, set);
+		      fprintf (dump_file, " from bb %d to bb %d \n", bb->index,
+			       new_bb->index);
+		    }
+
+		  insn_sink_to = BB_HEAD (new_bb);
+		  while (!NOTE_INSN_BASIC_BLOCK_P (insn_sink_to))
+		    insn_sink_to = NEXT_INSN (insn_sink_to);
+		  emit_insn_after_noloc (copy_rtx_if_shared (pat), insn_sink_to,
+					 new_bb);
+
+		  dbg_insn = sink_dest_in_debug_insn (dest, NEXT_INSN (insn_sink_to), bb);
+		}
+
+#if 1
+	      static unsigned long counts = 0;
+	      FILE *f = fopen ("/home3/luoxhu/workspace/gcc-git/gcc-master_build/sink.out", "a");
+	      fprintf (f, " %ld \n", counts);
+	      counts++;
+	      fclose (f);
+#endif
+	      delete_insn (insn);
+	      if (dbg_insn)
+		{
+		  delete_insn (dbg_insn);
+		  dbg_insn = NULL;
+		}
+	    }
+	}
+    }
+}
+
+/* Sink the set instructins out of the loops.  */
+
+static unsigned int
+sink_insn_in_loop (void)
+{
+  class loop *loop;
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      if (loop->num_nodes <= 1000)
+	sink_set_reg_in_loop (loop);
+    }
+  return 0;
+}
+
+static unsigned int
+execute_rtl_sink (void)
+{
+  delete_unreachable_blocks ();
+  df_set_flags (DF_LR_RUN_DCE);
+  df_note_add_problem ();
+  df_analyze ();
+
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+
+  if (dump_file)
+    fprintf (dump_file, "rtl sink begin:\n");
+
+  if (number_of_loops (cfun) > 1)
+    sink_insn_in_loop ();
+
+  if (dump_file)
+    fprintf (dump_file, "rtl sink end:\n");
+
+  loop_optimizer_finalize ();
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_rtl_sink = {
+  RTL_PASS,	  /* type */
+  "sink",	  /* name */
+  OPTGROUP_NONE,  /* optinfo_flags */
+  TV_SINK,	  /* tv_id */
+  PROP_cfglayout, /* properties_required */
+  0,		  /* properties_provided */
+  0,		  /* properties_destroyed */
+  0,		  /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_rtl_sink : public rtl_opt_pass
+{
+public:
+  pass_rtl_sink (gcc::context *ctxt) : rtl_opt_pass (pass_data_rtl_sink, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass *clone () { return new pass_rtl_sink (m_ctxt); }
+  virtual bool gate (function *) { return optimize > 0 && dbg_cnt (sink); }
+
+  virtual unsigned int execute (function *) { return execute_rtl_sink (); }
+
+}; // class pass_rtl_sink
+
+} // namespace
+
+rtl_opt_pass *
+make_pass_rtl_sink (gcc::context *ctxt)
+{
+  return new pass_rtl_sink (ctxt);
+}
diff --git a/gcc/testsuite/gcc.dg/rtl-sink-1.c b/gcc/testsuite/gcc.dg/rtl-sink-1.c
new file mode 100644
index 00000000000..0065c874cef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/rtl-sink-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile  } */
+/* { dg-additional-options " -O2 -fdump-rtl-sink"  } */
+
+int
+liveloop (int n, int *x, int *y)
+{
+  int i;
+  int ret;
+
+  for (i = 0; i < n; ++i)
+  {
+    ret = x[i] + 5;
+    y[i] = ret;
+  }
+  return ret;
+}
+
+/* { dg-final { scan-rtl-dump-times "sinking" 1 "sink" } } */
+
diff --git a/gcc/testsuite/gcc.dg/rtl-sink-2.c b/gcc/testsuite/gcc.dg/rtl-sink-2.c
new file mode 100644
index 00000000000..9888070e8b5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/rtl-sink-2.c
@@ -0,0 +1,200 @@
+/* { dg-do compile  } */
+/* { dg-additional-options " -O3 -fdump-rtl-sink"  } */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+
+# define HLOG 16
+#define        MAX_LIT        (1 <<  5)
+typedef uint8_t *LZF_HSLOT;
+typedef LZF_HSLOT LZF_STATE[1 << (HLOG)];
+
+int
+compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len)
+{
+  LZF_STATE htab;
+
+  uint8_t *ip = in_data;
+  uint8_t *op = out_data;
+  uint8_t *in_end = ip + in_len;
+  uint8_t *out_end = op + out_len;
+  uint8_t *ref;
+
+  unsigned long off;
+  unsigned int hval;
+  int lit;
+
+  if (!in_len || !out_len)
+    return 0;
+
+  lit = 0;
+  op++;
+  hval = (((ip[0]) << 8) | ip[1]);
+
+  while (ip < in_end - 2)
+    {
+      uint8_t *hslot;
+
+      hval = (((hval) << 8) | ip[2]);
+      hslot = (uint8_t *)(htab + (((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1)));
+
+      ref = *hslot + in_data;
+      *hslot = ip - in_data;
+
+      if (1 && (off = ip - ref - 1) < (1 << 13) && ref > in_data
+	  && ref[2] == ip[2]
+	  && ((ref[1] << 8) | ref[0]) == ((ip[1] << 8) | ip[0]))
+	{
+#if 1
+	   unsigned int len = 2;
+#else
+	   unsigned long len = 2;
+#endif
+	   unsigned int maxlen = in_end - ip - len;
+	   maxlen
+	     = maxlen > ((1 << 8) + (1 << 3)) ? ((1 << 8) + (1 << 3)) : maxlen;
+
+	   if ((op + 3 + 1 >= out_end) != 0)
+	     if (op - !lit + 3 + 1 >= out_end)
+	       return 0;
+
+	   op[-lit - 1] = lit - 1;
+	   op -= !lit;
+
+	   for (;;)
+	     {
+	       if (maxlen > 16)
+		 {
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		   len++;
+		   if (ref[len] != ip[len])
+		     break;
+		 }
+	       do
+		 {
+		   len++;
+		 }
+	       while (len < maxlen && ip[len] == ref[len]);
+	       break;
+	     }
+
+	   len -= 2;
+	   ip++;
+
+	   if (len < 7)
+	     {
+	       *op++ = (off >> 8) + (len << 5);
+	     }
+	   else
+	     {
+	       *op++ = (off >> 8) + (7 << 5);
+	       *op++ = len - 7;
+	     }
+	   *op++ = off;
+	   lit = 0;
+	   op++;
+	   ip += len + 1;
+
+	   if (ip >= in_end - 2)
+	     break;
+
+	   --ip;
+	   --ip;
+
+	   hval = (((ip[0]) << 8) | ip[1]);
+	   hval = (((hval) << 8) | ip[2]);
+	   htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))]
+	     = (LZF_HSLOT)(ip - in_data);
+	   ip++;
+
+	   hval = (((hval) << 8) | ip[2]);
+	   htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))]
+	     = (LZF_HSLOT)(ip - in_data);
+	   ip++;
+	 }
+       else
+	 {
+	   if (op >= out_end)
+	     return 0;
+
+	   lit++;
+	   *op++ = *ip++;
+
+	   if (lit == (1 << 5))
+	     {
+	       op[-lit - 1] = lit - 1;
+	       lit = 0;
+	       op++;
+	     }
+	 }
+     }
+   if (op + 3 > out_end) /* at most 3 bytes can be missing here */
+     return 0;
+
+   while (ip < in_end)
+     {
+       lit++;
+       *op++ = *ip++;
+       if (lit == MAX_LIT)
+	 {
+	   op[-lit - 1] = lit - 1; /* stop run */
+	   lit = 0;
+	   op++; /* start run */
+	 }
+     }
+
+   op[-lit - 1] = lit - 1; /* end run */
+   op -= !lit;		   /* undo run if length is zero */
+
+   return op - out_data;
+}
+
+/* { dg-final { scan-rtl-dump-times "sinking" 8 "sink" } } */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 63c0b3306de..d6854a299ab 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -176,7 +176,7 @@ DEFTIMEVAR (TV_TREE_SPLIT_EDGES      , "tree split crit edges")
 DEFTIMEVAR (TV_TREE_REASSOC          , "tree reassociation")
 DEFTIMEVAR (TV_TREE_PRE		     , "tree PRE")
 DEFTIMEVAR (TV_TREE_FRE		     , "tree FRE")
-DEFTIMEVAR (TV_TREE_SINK             , "tree code sinking")
+DEFTIMEVAR (TV_TREE_SINK	     , "tree code sinking")
 DEFTIMEVAR (TV_TREE_PHIOPT	     , "tree linearize phis")
 DEFTIMEVAR (TV_TREE_BACKPROP	     , "tree backward propagate")
 DEFTIMEVAR (TV_TREE_FORWPROP	     , "tree forward propagate")
@@ -248,6 +248,7 @@ DEFTIMEVAR (TV_LOOP_UNROLL           , "loop unrolling")
 DEFTIMEVAR (TV_LOOP_DOLOOP           , "loop doloop")
 DEFTIMEVAR (TV_LOOP_FINI	     , "loop fini")
 DEFTIMEVAR (TV_CPROP                 , "CPROP")
+DEFTIMEVAR (TV_SINK                  , "SINK")
 DEFTIMEVAR (TV_PRE                   , "PRE")
 DEFTIMEVAR (TV_HOIST                 , "code hoisting")
 DEFTIMEVAR (TV_LSM                   , "LSM")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 15693fee150..fb4a5aaefda 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -548,6 +548,7 @@ extern rtl_opt_pass *make_pass_rtl_dse1 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_dse2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_rtl_sink (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-03-23  3:07   ` Xionghu Luo
@ 2021-03-23  8:50     ` Richard Biener
  2021-03-26  7:35       ` Xionghu Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-03-23  8:50 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Tue, 23 Mar 2021, Xionghu Luo wrote:

> 
> 
> On 2020/12/23 00:53, Richard Biener wrote:
> > On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo
> > <luoxhu@linux.ibm.com> wrote:
> >> Here comes another case that requires run a pass once more, as this is
> >> not the common suggested direction to solve problems, not quite sure
> >> whether it is still a reasonble fix here.  Source code is something
> >> like:
> >>
> >> ref = ip + *hslot;
> >> while (ip < in_end - 2) {
> >>   unsigned int len = 2;
> >>   len++;
> >>     for ()   {
> >>       do len++;
> >>       while (len < maxlen && ref[len] == ip[len]); //sink code here.
> >>       break;
> >>     }
> >>   len -= 2;
> >>   ip++;
> >>   ip += len + 1;
> >>   if (ip >= in_end - 2)
> >>     break;
> >> }
> >>
> >> Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:
> >>
> >>   <bb 31> [local count: 75120046]:
> >>   # len_160 = PHI <len_161(30), len_189(58)>
> >>   len_189 = len_160 + 1;
> >>   _423 = (sizetype) len_189;
> >>   _424 = ip_229 + _423;
> >>   if (maxlen_186 > len_189)
> >>     goto <bb 32>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 32> [local count: 70988443]:
> >>   _84 = *_424;
> >>   _86 = ref_182 + _423;
> >>   _87 = *_86;
> >>   if (_84 == _87)
> >>     goto <bb 58>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 58> [local count: 67084079]:
> >>   goto <bb 31>; [100.00%]
> >>
> >>   <bb 33> [local count: 14847855]:
> >>   # len_263 = PHI <len_160(32), len_160(31)>
> >>   # _262 = PHI <_423(32), _423(31)>
> >>   # _264 = PHI <_424(32), _424(31)>
> >>   len_190 = len_263 + 4294967295;
> >>   if (len_190 <= 6)
> >>     goto <bb 34>; [0.00%]
> >>   else
> >>     goto <bb 36>; [100.00%]
> >>
> >> Then in ivopts, instructions are updated to xxx.c.174t.ivopts:
> >>
> >>   <bb 31> [local count: 75120046]:
> >>   # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)>
> >>   _34 = (unsigned int) ivtmp.30_29;
> >>   len_160 = _34 + 4294967295;
> >>   _423 = ivtmp.30_29;
> >>   _35 = (unsigned long) ip_229;
> >>   _420 = ivtmp.30_29 + _35;
> >>   _419 = (uint8_t *) _420;
> >>   _424 = _419;
> >>   len_418 = (unsigned int) ivtmp.30_29;
> >>   if (maxlen_186 > len_418)
> >>     goto <bb 32>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 32> [local count: 70988443]:
> >>   _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
> >>   ivtmp.30_31 = ivtmp.30_29 + 1;
> >>   _417 = ref_182 + 18446744073709551615;
> >>   _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
> >>   if (_84 == _87)
> >>     goto <bb 58>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 58> [local count: 67084079]:
> >>   goto <bb 31>; [100.00%]
> >>
> >>   <bb 33> [local count: 14847855]:
> >>   # len_263 = PHI <len_160(32), len_160(31)>
> >>   # _262 = PHI <_423(32), _423(31)>
> >>   # _264 = PHI <_424(32), _424(31)>
> >>   len_190 = len_263 + 4294967295;
> >>   if (len_190 <= 6)
> >>     goto <bb 34>; [0.00%]
> >>   else
> >>     goto <bb 36>; [100.00%]
> >>
> >> Some instructions in BB 31 are not used in the loop and could be sinked
> >> out of loop to reduce the computation, but they are not sinked
> >> throughout all passes later.  Run the sink_code pass once more at least
> >> after fre5 could improve this typical case performance 23% due to few
> >> instructions exausted in loop.
> >> xxx.c.209t.sink2:
> >>
> >> Sinking _419 = (uint8_t *) _420;
> >> from bb 31 to bb 89
> >> Sinking _420 = ivtmp.30_29 + _35;
> >> from bb 31 to bb 89
> >> Sinking _35 = (unsigned long) ip_229;
> >> from bb 31 to bb 89
> >> Sinking len_160 = _34 + 4294967295;
> >> from bb 31 to bb 33
> >>
> >> I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
> >> by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
> >> small with 0.25%.
> >>
> >> The reason why it should be run after fre5 is fre would do some phi
> >> optimization to expose the optimization.  The patch put it after
> >> pass_modref is due to my guess that some gimple optimizations like
> >> thread_jumps, dse, dce etc. could provide more opportunities for
> >> sinking code.  Not sure it is the correct place to put.  I also
> >> verified this issue exists in both X86 and ARM64.
> >> Any comments?  Thanks.
> > 
> > It definitely should be before uncprop (but context stops there). And yes,
> > re-running passes isn't the very, very best thing to do without explaining
> > it cannot be done in other ways. Not for late stage 3 anyway.
> > 
> > Richard.
> > 
> 
> Thanks.  Also tried to implement this in a seperate RTL pass, which
> would be better?  I guess this would also be stage1 issues...

Yes, that's also for stage1 obviously.  Can you check how the number
of sink opportunities of your RTL pass changes if you add the
2nd GIMPLE sinking pass?

I'll note that you are not sinking later stmts first (you only
walk insns reverse but not BBs).  GIMPLE sinking performs a
domwalk over post dominators (but it has an easier job because
of PHIs).  I guess you'd want to walk starting from loop exit
blocks (from innermost loops as you do) in reverse program order.

I'll also note that you don't have to check whether stmts you
want to sink have their uses set after it - you can emit copies
to a new pseudo at the original insn location and use those
after the loop (that of course comes at some cost).

Also we already have a sinking pass on RTL which even computes
a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
I'm not sure whether this deals with non-stores but the
LCM machinery definitely can handle arbitrary expressions.  I wonder
if it makes more sense to extend this rather than inventing a new
ad-hoc sinking pass?

Richard.

> 
> Xionghu
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-03-23  8:50     ` Richard Biener
@ 2021-03-26  7:35       ` Xionghu Luo
  2021-04-14  1:51         ` Xionghu Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Xionghu Luo @ 2021-03-26  7:35 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

Hi, sorry for late response,

On 2021/3/23 16:50, Richard Biener wrote:
>>> It definitely should be before uncprop (but context stops there). And yes,
>>> re-running passes isn't the very, very best thing to do without explaining
>>> it cannot be done in other ways. Not for late stage 3 anyway.
>>>
>>> Richard.
>>>
>> Thanks.  Also tried to implement this in a seperate RTL pass, which
>> would be better?  I guess this would also be stage1 issues...
> Yes, that's also for stage1 obviously.  Can you check how the number
> of sink opportunities of your RTL pass changes if you add the
> 2nd GIMPLE sinking pass?

Number of Instructions sank out of loop when running (no sink2 -> sink2):
    1. SPEC2017 int: 371  ->  142
    2. SPEC2017 float: 949  ->  343
    3. bootstrap: 402  ->  229
    4. stage1 libraries: 115   ->  68
    5. regression tests: 4533 ->  2948

the case I used from the beginning could all be optimized by gimple sink2
instead, but there are still many instructions sunk even gimple sink2 is
added, I guess most of them are produced by expand pass.  One example(It was
after #38 in 262r.reginfo, note that new block is created between
exit->src and exit->dst to avoid other bb jumps into exit->dst cause
execution error due to r132:DI updated unexpectedly.), sometimes extra
zero extend in loop like this could cause serious performance issue:

vect-live-4.ltrans0.ltrans.263r.sink:
...
Loop 2: sinking (set (reg/v:DI 132 [ <retval> ])
    (sign_extend:DI (reg:SI 144))) from bb 7 to bb 11

...

   44: L44:
   35: NOTE_INSN_BASIC_BLOCK 7
   36: r127:DI=r127:DI+0x4
   37: r145:SI=[r127:DI]
   38: r144:SI=r145:SI+0x5
      REG_DEAD r145:SI
   40: r125:DI=r125:DI+0x4
   41: [r125:DI]=r144:SI
      REG_DEAD r144:SI
   42: r146:SI=r126:DI#0-0x1
      REG_DEAD r126:DI
   43: r126:DI=zero_extend(r146:SI)
      REG_DEAD r146:SI
   45: r147:CC=cmp(r126:DI,0)
   46: pc={(r147:CC!=0)?L65:pc}
      REG_DEAD r147:CC
      REG_BR_PROB 941032164
   68: NOTE_INSN_BASIC_BLOCK 11
   69: r132:DI=sign_extend(r144:SI)
      ; pc falls through to BB 8
   65: L65:
   64: NOTE_INSN_BASIC_BLOCK 9
      ; pc falls through to BB 7
   47: L47:
   48: NOTE_INSN_BASIC_BLOCK 8

> 
> I'll note that you are not sinking later stmts first (you only
> walk insns reverse but not BBs).  GIMPLE sinking performs a
> domwalk over post dominators (but it has an easier job because
> of PHIs).  I guess you'd want to walk starting from loop exit
> blocks (from innermost loops as you do) in reverse program order.

Yes, this rtl sink could only sink instruction from *loop header*
to every loop exit blocks in reverse order, it is a bit strict
since this is the first step to see whether it is reasonable to add
such a pass.
For example, if the instruction is in loop body block, sink it out will
cause execution error sometimes as it doesn't have information in it
whether the loop body block will be executed or not, if the loop jumps
from header to exit directly, the instructions sunk from body to exit
will change register value unexpected, seems always_reached and
always_executed in loop-invarinat.c could be reused here to determine
such circumstance?

> 
> I'll also note that you don't have to check whether stmts you
> want to sink have their uses set after it - you can emit copies
> to a new pseudo at the original insn location and use those
> after the loop (that of course comes at some cost).

Not sure whether I understood your point correctly, but the instruction
is still in loop executed loop niter times?
What I am trying to do is move r132:DI=sign_extend(r144:SI)
out of loop, if it was executed in loop 100 times, r132 is not used
in loop, and r144 is not updated after current instruction, then move
it to loop exit to execute one once. 

> 
> Also we already have a sinking pass on RTL which even computes
> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
> I'm not sure whether this deals with non-stores but the
> LCM machinery definitely can handle arbitrary expressions.  I wonder
> if it makes more sense to extend this rather than inventing a new
> ad-hoc sinking pass?

From the literal, my pass doesn't handle or process store instructions
like store-motion..  Thanks, will check it.

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-03-26  7:35       ` Xionghu Luo
@ 2021-04-14  1:51         ` Xionghu Luo
  2021-04-14  6:41           ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Xionghu Luo @ 2021-04-14  1:51 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

Hi,

On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote:
>> Also we already have a sinking pass on RTL which even computes
>> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
>> I'm not sure whether this deals with non-stores but the
>> LCM machinery definitely can handle arbitrary expressions.  I wonder
>> if it makes more sense to extend this rather than inventing a new
>> ad-hoc sinking pass?
>  From the literal, my pass doesn't handle or process store instructions
> like store-motion..  Thanks, will check it.

Store motion only processes store instructions with data flow equations,
generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it
by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map,
st_insert_map) globally, each store place is independently represented in
the input bitmap vectors. Output is which should be delete and where to
insert, current code does what you said "emit copies to a new pseudo at
the original insn location and use it in followed bb", actually it is
"store replacement" instead of "store move", why not save one pseudo by
moving the store instruction to target edge directly?

There are many differences between the newly added rtl-sink pass and
store-motion pass. 
1. Store motion moves only store instructions, rtl-sink ignores store
instructions. 
2. Store motion is a global DF problem solving, rtl-sink only processes
loop header reversely with dependency check in loop, take the below RTL
as example,
"#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
but it moves #538 first, then #235, there is strong dependency here. It
seemsdoesn't like the LCM framework that could solve all and do the
delete-insert in one iteration. 

However, there are still some common methods could be shared, like the
def-use check(though store-motion is per bb, rtl-sink is per loop),
insert_store, commit_edge_insertions etc.

  508: L508:
  507: NOTE_INSN_BASIC_BLOCK 34
   12: r139:DI=r140:DI
      REG_DEAD r140:DI
  240: L240:
  231: NOTE_INSN_BASIC_BLOCK 35
  232: r142:DI=zero_extend(r139:DI#0)
  233: r371:SI=r142:DI#0-0x1
  234: r243:DI=zero_extend(r371:SI)
      REG_DEAD r371:SI
  235: r452:DI=r262:DI+r139:DI
  538: r194:DI=r452:DI
  236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
  237: pc={(geu(r372:CCUNS,0))?L246:pc}
      REG_DEAD r372:CCUNS
      REG_BR_PROB 59055804
  238: NOTE_INSN_BASIC_BLOCK 36
  239: r140:DI=r139:DI+0x1
  241: r373:DI=r251:DI-0x1
  242: r374:SI=zero_extend([r262:DI+r139:DI])
      REG_DEAD r139:DI
  243: r375:SI=zero_extend([r373:DI+r140:DI])
      REG_DEAD r373:DI
  244: r376:CC=cmp(r374:SI,r375:SI)
      REG_DEAD r375:SI
      REG_DEAD r374:SI
  245: pc={(r376:CC==0)?L508:pc}
      REG_DEAD r376:CC
      REG_BR_PROB 1014686028
  246: L246:
  247: NOTE_INSN_BASIC_BLOCK 37
  248: r377:SI=r142:DI#0-0x2
      REG_DEAD r142:DI
  249: r256:DI=zero_extend(r377:SI)

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-04-14  1:51         ` Xionghu Luo
@ 2021-04-14  6:41           ` Richard Biener
  2021-04-15  6:20             ` Xionghu Luo
  2021-04-24  3:44             ` [RFC] Run pass_sink_code once more after ivopts/fre Jeff Law
  0 siblings, 2 replies; 20+ messages in thread
From: Richard Biener @ 2021-04-14  6:41 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

On Wed, 14 Apr 2021, Xionghu Luo wrote:

> Hi,
> 
> On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote:
> >> Also we already have a sinking pass on RTL which even computes
> >> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
> >> I'm not sure whether this deals with non-stores but the
> >> LCM machinery definitely can handle arbitrary expressions.  I wonder
> >> if it makes more sense to extend this rather than inventing a new
> >> ad-hoc sinking pass?
> >  From the literal, my pass doesn't handle or process store instructions
> > like store-motion..  Thanks, will check it.
> 
> Store motion only processes store instructions with data flow equations,
> generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it
> by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map,
> st_insert_map) globally, each store place is independently represented in
> the input bitmap vectors. Output is which should be delete and where to
> insert, current code does what you said "emit copies to a new pseudo at
> the original insn location and use it in followed bb", actually it is
> "store replacement" instead of "store move", why not save one pseudo by
> moving the store instruction to target edge directly?

It probably simply saves the pass from doing analysis whether the
stored value is clobbered on the sinking path, enabling more store
sinking.  For stores that might be even beneficial, for non-stores
it becomes more of a cost issue, yes.

> There are many differences between the newly added rtl-sink pass and
> store-motion pass. 
> 1. Store motion moves only store instructions, rtl-sink ignores store
> instructions. 
> 2. Store motion is a global DF problem solving, rtl-sink only processes
> loop header reversely with dependency check in loop, take the below RTL
> as example,
> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> but it moves #538 first, then #235, there is strong dependency here. It
> seemsdoesn't like the LCM framework that could solve all and do the
> delete-insert in one iteration. 

So my question was whether we want to do both within the LCM store
sinking framework.  The LCM dataflow is also used by RTL PRE which
handles both loads and non-loads so in principle it should be able
to handle stores and non-stores for the sinking case (PRE on the
reverse CFG).

A global dataflow is more powerful than any local ad-hoc method.

Richard.

> However, there are still some common methods could be shared, like the
> def-use check(though store-motion is per bb, rtl-sink is per loop),
> insert_store, commit_edge_insertions etc.
> 
> 
>   508: L508:
>   507: NOTE_INSN_BASIC_BLOCK 34
>    12: r139:DI=r140:DI
>       REG_DEAD r140:DI
>   240: L240:
>   231: NOTE_INSN_BASIC_BLOCK 35
>   232: r142:DI=zero_extend(r139:DI#0)
>   233: r371:SI=r142:DI#0-0x1
>   234: r243:DI=zero_extend(r371:SI)
>       REG_DEAD r371:SI
>   235: r452:DI=r262:DI+r139:DI
>   538: r194:DI=r452:DI
>   236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
>   237: pc={(geu(r372:CCUNS,0))?L246:pc}
>       REG_DEAD r372:CCUNS
>       REG_BR_PROB 59055804
>   238: NOTE_INSN_BASIC_BLOCK 36
>   239: r140:DI=r139:DI+0x1
>   241: r373:DI=r251:DI-0x1
>   242: r374:SI=zero_extend([r262:DI+r139:DI])
>       REG_DEAD r139:DI
>   243: r375:SI=zero_extend([r373:DI+r140:DI])
>       REG_DEAD r373:DI
>   244: r376:CC=cmp(r374:SI,r375:SI)
>       REG_DEAD r375:SI
>       REG_DEAD r374:SI
>   245: pc={(r376:CC==0)?L508:pc}
>       REG_DEAD r376:CC
>       REG_BR_PROB 1014686028
>   246: L246:
>   247: NOTE_INSN_BASIC_BLOCK 37
>   248: r377:SI=r142:DI#0-0x2
>       REG_DEAD r142:DI
>   249: r256:DI=zero_extend(r377:SI)
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-04-14  6:41           ` Richard Biener
@ 2021-04-15  6:20             ` Xionghu Luo
  2021-04-15 11:34               ` Richard Biener
  2021-04-24  3:44             ` [RFC] Run pass_sink_code once more after ivopts/fre Jeff Law
  1 sibling, 1 reply; 20+ messages in thread
From: Xionghu Luo @ 2021-04-15  6:20 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

Thanks,

On 2021/4/14 14:41, Richard Biener wrote:
>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>> but it moves #538 first, then #235, there is strong dependency here. It
>> seemsdoesn't like the LCM framework that could solve all and do the
>> delete-insert in one iteration.
> So my question was whether we want to do both within the LCM store
> sinking framework.  The LCM dataflow is also used by RTL PRE which
> handles both loads and non-loads so in principle it should be able
> to handle stores and non-stores for the sinking case (PRE on the
> reverse CFG).
> 
> A global dataflow is more powerful than any local ad-hoc method.

My biggest concern is whether the LCM DF framework could support sinking
*multiple* reverse-dependent non-store instructions together by *one*
calling of LCM DF.   If this is not supported, we need run multiple LCM
until no new changes, it would be time consuming obviously (unless
compiling time is not important here).

> 
> Richard.
> 
>> However, there are still some common methods could be shared, like the
>> def-use check(though store-motion is per bb, rtl-sink is per loop),
>> insert_store, commit_edge_insertions etc.
>>
>>
>>    508: L508:
>>    507: NOTE_INSN_BASIC_BLOCK 34
>>     12: r139:DI=r140:DI
>>        REG_DEAD r140:DI
>>    240: L240:
>>    231: NOTE_INSN_BASIC_BLOCK 35
>>    232: r142:DI=zero_extend(r139:DI#0)
>>    233: r371:SI=r142:DI#0-0x1
>>    234: r243:DI=zero_extend(r371:SI)
>>        REG_DEAD r371:SI
>>    235: r452:DI=r262:DI+r139:DI
>>    538: r194:DI=r452:DI
>>    236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)


Like here, Each instruction's dest reg is calculated in the input vector
bitmap, after solving the equations by calling pre_edge_rev_lcm, 
move #538 out of loop for the first call, then move #235 out of loop
after a second call... 4 repeat calls needed in total here, is the LCM
framework smart enough to move the all 4 instruction within one iteration?
I am worried that the input vector bitmap couldn't solve the dependency
problem for two back chained instructions.


-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-04-15  6:20             ` Xionghu Luo
@ 2021-04-15 11:34               ` Richard Biener
  2021-04-20  9:23                 ` Xionghu Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-04-15 11:34 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

On Thu, 15 Apr 2021, Xionghu Luo wrote:

> Thanks,
> 
> On 2021/4/14 14:41, Richard Biener wrote:
> >> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> >> but it moves #538 first, then #235, there is strong dependency here. It
> >> seemsdoesn't like the LCM framework that could solve all and do the
> >> delete-insert in one iteration.
> > So my question was whether we want to do both within the LCM store
> > sinking framework.  The LCM dataflow is also used by RTL PRE which
> > handles both loads and non-loads so in principle it should be able
> > to handle stores and non-stores for the sinking case (PRE on the
> > reverse CFG).
> > 
> > A global dataflow is more powerful than any local ad-hoc method.
> 
> My biggest concern is whether the LCM DF framework could support sinking
> *multiple* reverse-dependent non-store instructions together by *one*
> calling of LCM DF.   If this is not supported, we need run multiple LCM
> until no new changes, it would be time consuming obviously (unless
> compiling time is not important here).

As said it is used for PRE and there it most definitely can do that.

> 
> > 
> > Richard.
> > 
> >> However, there are still some common methods could be shared, like the
> >> def-use check(though store-motion is per bb, rtl-sink is per loop),
> >> insert_store, commit_edge_insertions etc.
> >>
> >>
> >>    508: L508:
> >>    507: NOTE_INSN_BASIC_BLOCK 34
> >>     12: r139:DI=r140:DI
> >>        REG_DEAD r140:DI
> >>    240: L240:
> >>    231: NOTE_INSN_BASIC_BLOCK 35
> >>    232: r142:DI=zero_extend(r139:DI#0)
> >>    233: r371:SI=r142:DI#0-0x1
> >>    234: r243:DI=zero_extend(r371:SI)
> >>        REG_DEAD r371:SI
> >>    235: r452:DI=r262:DI+r139:DI
> >>    538: r194:DI=r452:DI
> >>    236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
> 
> 
> Like here, Each instruction's dest reg is calculated in the input vector
> bitmap, after solving the equations by calling pre_edge_rev_lcm, 
> move #538 out of loop for the first call, then move #235 out of loop
> after a second call... 4 repeat calls needed in total here, is the LCM
> framework smart enough to move the all 4 instruction within one iteration?
> I am worried that the input vector bitmap couldn't solve the dependency
> problem for two back chained instructions.
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-04-15 11:34               ` Richard Biener
@ 2021-04-20  9:23                 ` Xionghu Luo
  2021-04-21 11:54                   ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Xionghu Luo @ 2021-04-20  9:23 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 5192 bytes --]



On 2021/4/15 19:34, Richard Biener wrote:
> On Thu, 15 Apr 2021, Xionghu Luo wrote:
> 
>> Thanks,
>>
>> On 2021/4/14 14:41, Richard Biener wrote:
>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>>>> but it moves #538 first, then #235, there is strong dependency here. It
>>>> seemsdoesn't like the LCM framework that could solve all and do the
>>>> delete-insert in one iteration.
>>> So my question was whether we want to do both within the LCM store
>>> sinking framework.  The LCM dataflow is also used by RTL PRE which
>>> handles both loads and non-loads so in principle it should be able
>>> to handle stores and non-stores for the sinking case (PRE on the
>>> reverse CFG).
>>>
>>> A global dataflow is more powerful than any local ad-hoc method.
>>
>> My biggest concern is whether the LCM DF framework could support sinking
>> *multiple* reverse-dependent non-store instructions together by *one*
>> calling of LCM DF.   If this is not supported, we need run multiple LCM
>> until no new changes, it would be time consuming obviously (unless
>> compiling time is not important here).
> 
> As said it is used for PRE and there it most definitely can do that.

I did some investigation about PRE and attached a case to show how it
works, it is quite like store-motion, and actually there is a rtl-hoist
pass in gcse.c which only works for code size.  All of them are
leveraging the LCM framework to move instructions upward or downward.

PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
The four problems are all converted to the LCM DF problem with
n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
and two outputs of where to insert/delete.

PRE scan each instruction and hash the SRC to table without *checking the
relationship between instructions*, for the case attached, BB 37, BB 38
and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
save it to index 106, BB 38 save it to index 110. After finishing this pass,
"r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
expr to BB 75~BB 80 to create full redundancies from partial redundancies,
finally update instruction in BB 37.

Issues witnessed in the PRE run:
1) "r262:DI+r139:DI" in BLOCK 38 is not replaced;
2) PRE also use pseudo register to store the result like store-motion and
even rtl-hoist. Actually we need real "move" instead of "replace" for
rtl-sink to improve performance though also potential register pressure issue
like rtl-hoist?
3) SRC instruction is scanned WITHOUT back chain check cross instructions
in hash_scan_set, it couldn't handle below case though a+c is same with b+d.
So I am afraid single run couldn't solve the instruction dependency issue
to sink multiple instructions out as before for rtl-sink.

BB1:
 a = b;
 c = d;
 s = a + c;
BB2:
 s = b + d;


gcse.c:
changed = one_pre_gcse_pass ()
alloc_hash_table (&expr_hash_table);
compute_hash_table (&expr_hash_table);
 compute_hash_table_work (table);
   FOR_EACH_BB_FN (current_bb, cfun)
    FOR_BB_INSNS (current_bb, insn)
     hash_scan_insn (insn, table);
       hash_scan_set (pat, insn, table);
        if REG_P (dest)
          insert_expr_in_table (src, GET_MODE (dest), insn, antic_p, avail_p, max_distance, table);
            hash = hash_expr (x, mode, &do_not_record_p, table->size);
dump_hash_table (dump_file, "Expression", &expr_hash_table);
edge_list = compute_pre_data ();
 compute_local_properties (transp, comp, antloc, &expr_hash_table);
 edge_list = pre_edge_lcm (expr_hash_table.n_elems, transp, comp, antloc, ae_kill, &pre_insert_map, &pre_delete_map);
changed |= pre_gcse (edge_list);
 changed = pre_delete ();  /* Create a pseudo-reg to store the result of reaching expressions into. */
 did_insert = pre_edge_insert (edge_list, index_map); 

> 
>>
>>>
>>> Richard.
>>>
>>>> However, there are still some common methods could be shared, like the
>>>> def-use check(though store-motion is per bb, rtl-sink is per loop),
>>>> insert_store, commit_edge_insertions etc.
>>>>
>>>>
>>>>     508: L508:
>>>>     507: NOTE_INSN_BASIC_BLOCK 34
>>>>      12: r139:DI=r140:DI
>>>>         REG_DEAD r140:DI
>>>>     240: L240:
>>>>     231: NOTE_INSN_BASIC_BLOCK 35
>>>>     232: r142:DI=zero_extend(r139:DI#0)
>>>>     233: r371:SI=r142:DI#0-0x1
>>>>     234: r243:DI=zero_extend(r371:SI)
>>>>         REG_DEAD r371:SI
>>>>     235: r452:DI=r262:DI+r139:DI
>>>>     538: r194:DI=r452:DI
>>>>     236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
>>
>>
>> Like here, Each instruction's dest reg is calculated in the input vector
>> bitmap, after solving the equations by calling pre_edge_rev_lcm,
>> move #538 out of loop for the first call, then move #235 out of loop
>> after a second call... 4 repeat calls needed in total here, is the LCM
>> framework smart enough to move the all 4 instruction within one iteration?
>> I am worried that the input vector bitmap couldn't solve the dependency
>> problem for two back chained instructions.
>>
>>
>>
> 

-- 
Thanks,
Xionghu

[-- Attachment #2: Screen Shot 2021-04-20 at 17.20.47.png --]
[-- Type: image/png, Size: 244493 bytes --]

[-- Attachment #3: test_gcse.c.255r.pre --]
[-- Type: text/plain, Size: 48715 bytes --]


;; Function compute_on_bytes (compute_on_bytes, funcdef_no=24, decl_uid=4337, cgraph_uid=25, symbol_order=24)

starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
df_worklist_dataflow_doublequeue: n_basic_blocks 79 n_edges 117 count 89 (  1.1)
df_worklist_dataflow_doublequeue: n_basic_blocks 79 n_edges 117 count 97 (  1.2)
Expression hash table (131 buckets, 189 entries)
Index 0 (hash value 104; max distance 0)
  (compare:CC (reg/v:DI 274 [ in_len ])
    (const_int 0 [0]))
Index 1 (hash value 106; max distance 0)
  (compare:CC (reg/v:DI 276 [ out_len ])
    (const_int 0 [0]))
Index 2 (hash value 97; max distance 0)
  (plus:DI (reg/v/f:DI 273 [ in_data ])
    (reg/v:DI 274 [ in_len ]))
Index 3 (hash value 101; max distance 0)
  (plus:DI (reg/v/f:DI 275 [ out_data ])
    (reg/v:DI 276 [ out_len ]))
Index 4 (hash value 115; max distance 0)
  (plus:DI (reg/v/f:DI 275 [ out_data ])
    (const_int 1 [0x1]))
Index 5 (hash value 59; max distance 0)
  (bswap:HI (mem:HI (reg/v/f:DI 273 [ in_data ]) [0 MEM <short unsigned int> [(uint8_t *)in_data_169(D)]+0 S2 A8]))
Index 6 (hash value 4; max distance 0)
  (zero_extend:SI (reg:HI 279))
Index 7 (hash value 6; max distance 0)
  (zero_extend:DI (reg:SI 280))
Index 8 (hash value 85; max distance 0)
  (plus:DI (reg/v/f:DI 248 [ in_end ])
    (const_int -2 [0xfffffffffffffffe]))
Index 9 (hash value 79; max distance 0)
  (compare:CCUNS (reg/v/f:DI 273 [ in_data ])
    (reg/f:DI 264 [ _254 ]))
Index 10 (hash value 85; max distance 0)
  (ashift:SI (subreg/s/v:SI (reg/v:DI 229 [ hval ]) 0)
    (const_int 8 [0x8]))
Index 11 (hash value 3; max distance 0)
  (zero_extend:DI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 2 [0x2])) [0 MEM[(uint8_t *)ip_229 + 2B]+0 S1 A8]))
Index 12 (hash value 92; max distance 0)
  (ior:SI (subreg:SI (reg:DI 123 [ _10 ]) 0)
    (reg:SI 282))
Index 13 (hash value 9; max distance 0)
  (zero_extend:DI (reg:SI 283))
Index 14 (hash value 17; max distance 0)
  (lshiftrt:SI (reg:SI 283)
    (const_int 8 [0x8]))
Index 15 (hash value 8; max distance 0)
  (ashift:SI (reg:SI 283)
    (const_int 2 [0x2]))
Index 16 (hash value 129; max distance 0)
  (mult:SI (reg:SI 283)
    (const_int 5 [0x5]))
Index 17 (hash value 122; max distance 0)
  (minus:SI (reg:SI 284)
    (reg:SI 288))
Index 18 (hash value 43; max distance 0)
  (and:SI (reg:SI 289)
    (const_int 65535 [0xffff]))
Index 19 (hash value 16; max distance 0)
  (zero_extend:DI (reg:SI 290))
Index 20 (hash value 18; max distance 0)
  (ashift:DI (reg:DI 291)
    (const_int 3 [0x3]))
Index 21 (hash value 112; max distance 0)
  (plus:DI (reg/f:DI 110 sfp)
    (const_int 32 [0x20]))
Index 22 (hash value 29; max distance 0)
  (plus:DI (reg/f:DI 449)
    (reg:DI 292))
Index 23 (hash value 24; max distance 0)
  (zero_extend:DI (mem:QI (reg/v/f:DI 250 [ hslot ]) [0 *hslot_181+0 S1 A64]))
Index 24 (hash value 116; max distance 0)
  (plus:DI (reg/v/f:DI 273 [ in_data ])
    (reg:DI 293 [ *hslot_181 ]))
Index 25 (hash value 86; max distance 0)
  (minus:DI (reg/v/f:DI 262 [ ip ])
    (reg/v/f:DI 273 [ in_data ]))
Index 26 (hash value 64; max distance 0)
  (minus:DI (reg/v/f:DI 262 [ ip ])
    (reg/v/f:DI 251 [ ref ]))
Index 27 (hash value 2; max distance 0)
  (plus:DI (reg:DI 295)
    (const_int -1 [0xffffffffffffffff]))
Index 28 (hash value 35; max distance 0)
  (compare:CCUNS (reg:DI 135 [ _22 ])
    (const_int 8191 [0x1fff]))
Index 29 (hash value 66; max distance 0)
  (compare:CCUNS (reg/v/f:DI 273 [ in_data ])
    (reg/v/f:DI 251 [ ref ]))
Index 30 (hash value 122; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 2 [0x2])) [0 MEM[(uint8_t *)ref_182 + 2B]+0 S1 A8]))
Index 31 (hash value 87; max distance 0)
  (compare:CC (reg:SI 298 [ MEM[(uint8_t *)ref_182 + 2B] ])
    (subreg:SI (reg:DI 123 [ _10 ]) 0))
Index 32 (hash value 24; max distance 0)
  (zero_extend:SI (mem:HI (reg/v/f:DI 251 [ ref ]) [0 MEM <short unsigned int> [(uint8_t *)ref_182]+0 S2 A8]))
Index 33 (hash value 35; max distance 0)
  (zero_extend:SI (mem:HI (reg/v/f:DI 262 [ ip ]) [0 MEM <short unsigned int> [(uint8_t *)ip_229]+0 S2 A8]))
Index 34 (hash value 11; max distance 0)
  (compare:CC (reg:SI 300 [ MEM <short unsigned int> [(uint8_t *)ref_182] ])
    (reg:SI 301 [ MEM <short unsigned int> [(uint8_t *)ip_229] ]))
Index 35 (hash value 61; max distance 0)
  (minus:DI (reg/v/f:DI 248 [ in_end ])
    (reg/v/f:DI 262 [ ip ]))
Index 36 (hash value 2; max distance 0)
  (plus:SI (subreg:SI (reg:DI 303) 0)
    (const_int -2 [0xfffffffffffffffe]))
Index 37 (hash value 69; max distance 0)
  (plus:DI (reg/v/f:DI 226 [ op ])
    (const_int 4 [0x4]))
Index 38 (hash value 96; max distance 0)
  (compare:CCUNS (reg/v/f:DI 249 [ out_end ])
    (reg/f:DI 305))
Index 39 (hash value 31; max distance 0)
  (compare:CC (reg/v:DI 201 [ lit ])
    (const_int 0 [0]))
Index 40 (hash value 68; max distance 0)
  (plus:DI (reg/v/f:DI 226 [ op ])
    (const_int 3 [0x3]))
Index 41 (hash value 99; max distance 0)
  (compare:CCUNS (reg/v/f:DI 249 [ out_end ])
    (reg/f:DI 308))
Index 42 (hash value 3; max distance 0)
  (neg:SI (subreg/s/u:SI (reg/v:DI 201 [ lit ]) 0))
Index 43 (hash value 35; max distance 0)
  (sign_extend:DI (reg:SI 310))
Index 44 (hash value 87; max distance 0)
  (plus:DI (reg/v/f:DI 226 [ op ])
    (reg:DI 311))
Index 45 (hash value 32; max distance 0)
  (plus:SI (subreg:SI (reg/v:DI 201 [ lit ]) 0)
    (const_int -1 [0xffffffffffffffff]))
Index 46 (hash value 65; max distance 0)
  (eq:SI (subreg/s/u:SI (reg/v:DI 201 [ lit ]) 0)
    (const_int 0 [0]))
Index 47 (hash value 42; max distance 0)
  (zero_extend:DI (reg:SI 316))
Index 48 (hash value 92; max distance 0)
  (minus:DI (reg/v/f:DI 226 [ op ])
    (reg:DI 315))
Index 49 (hash value 20; max distance 0)
  (compare:CCUNS (reg:SI 304)
    (const_int 16 [0x10]))
Index 50 (hash value 123; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 3 [0x3])) [0 MEM[(uint8_t *)ref_182 + 3B]+0 S1 A8]))
Index 51 (hash value 3; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 3 [0x3])) [0 MEM[(uint8_t *)ip_229 + 3B]+0 S1 A8]))
Index 52 (hash value 47; max distance 0)
  (compare:CC (reg:SI 318 [ MEM[(uint8_t *)ref_182 + 3B] ])
    (reg:SI 319 [ MEM[(uint8_t *)ip_229 + 3B] ]))
Index 53 (hash value 124; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 4 [0x4])) [0 MEM[(uint8_t *)ref_182 + 4B]+0 S1 A8]))
Index 54 (hash value 4; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 4 [0x4])) [0 MEM[(uint8_t *)ip_229 + 4B]+0 S1 A8]))
Index 55 (hash value 53; max distance 0)
  (compare:CC (reg:SI 321 [ MEM[(uint8_t *)ref_182 + 4B] ])
    (reg:SI 322 [ MEM[(uint8_t *)ip_229 + 4B] ]))
Index 56 (hash value 125; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 5 [0x5])) [0 MEM[(uint8_t *)ref_182 + 5B]+0 S1 A8]))
Index 57 (hash value 5; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 5 [0x5])) [0 MEM[(uint8_t *)ip_229 + 5B]+0 S1 A8]))
Index 58 (hash value 59; max distance 0)
  (compare:CC (reg:SI 324 [ MEM[(uint8_t *)ref_182 + 5B] ])
    (reg:SI 325 [ MEM[(uint8_t *)ip_229 + 5B] ]))
Index 59 (hash value 126; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 6 [0x6])) [0 MEM[(uint8_t *)ref_182 + 6B]+0 S1 A8]))
Index 60 (hash value 6; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 6 [0x6])) [0 MEM[(uint8_t *)ip_229 + 6B]+0 S1 A8]))
Index 61 (hash value 65; max distance 0)
  (compare:CC (reg:SI 327 [ MEM[(uint8_t *)ref_182 + 6B] ])
    (reg:SI 328 [ MEM[(uint8_t *)ip_229 + 6B] ]))
Index 62 (hash value 127; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 7 [0x7])) [0 MEM[(uint8_t *)ref_182 + 7B]+0 S1 A8]))
Index 63 (hash value 7; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 7 [0x7])) [0 MEM[(uint8_t *)ip_229 + 7B]+0 S1 A8]))
Index 64 (hash value 71; max distance 0)
  (compare:CC (reg:SI 330 [ MEM[(uint8_t *)ref_182 + 7B] ])
    (reg:SI 331 [ MEM[(uint8_t *)ip_229 + 7B] ]))
Index 65 (hash value 128; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 8 [0x8])) [0 MEM[(uint8_t *)ref_182 + 8B]+0 S1 A8]))
Index 66 (hash value 8; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 8 [0x8])) [0 MEM[(uint8_t *)ip_229 + 8B]+0 S1 A8]))
Index 67 (hash value 77; max distance 0)
  (compare:CC (reg:SI 333 [ MEM[(uint8_t *)ref_182 + 8B] ])
    (reg:SI 334 [ MEM[(uint8_t *)ip_229 + 8B] ]))
Index 68 (hash value 129; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 9 [0x9])) [0 MEM[(uint8_t *)ref_182 + 9B]+0 S1 A8]))
Index 69 (hash value 9; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 9 [0x9])) [0 MEM[(uint8_t *)ip_229 + 9B]+0 S1 A8]))
Index 70 (hash value 83; max distance 0)
  (compare:CC (reg:SI 336 [ MEM[(uint8_t *)ref_182 + 9B] ])
    (reg:SI 337 [ MEM[(uint8_t *)ip_229 + 9B] ]))
Index 71 (hash value 130; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 10 [0xa])) [0 MEM[(uint8_t *)ref_182 + 10B]+0 S1 A8]))
Index 72 (hash value 10; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 10 [0xa])) [0 MEM[(uint8_t *)ip_229 + 10B]+0 S1 A8]))
Index 73 (hash value 89; max distance 0)
  (compare:CC (reg:SI 339 [ MEM[(uint8_t *)ref_182 + 10B] ])
    (reg:SI 340 [ MEM[(uint8_t *)ip_229 + 10B] ]))
Index 74 (hash value 0; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 11 [0xb])) [0 MEM[(uint8_t *)ref_182 + 11B]+0 S1 A8]))
Index 75 (hash value 11; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 11 [0xb])) [0 MEM[(uint8_t *)ip_229 + 11B]+0 S1 A8]))
Index 76 (hash value 95; max distance 0)
  (compare:CC (reg:SI 342 [ MEM[(uint8_t *)ref_182 + 11B] ])
    (reg:SI 343 [ MEM[(uint8_t *)ip_229 + 11B] ]))
Index 77 (hash value 1; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 12 [0xc])) [0 MEM[(uint8_t *)ref_182 + 12B]+0 S1 A8]))
Index 78 (hash value 12; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 12 [0xc])) [0 MEM[(uint8_t *)ip_229 + 12B]+0 S1 A8]))
Index 79 (hash value 101; max distance 0)
  (compare:CC (reg:SI 345 [ MEM[(uint8_t *)ref_182 + 12B] ])
    (reg:SI 346 [ MEM[(uint8_t *)ip_229 + 12B] ]))
Index 80 (hash value 2; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 13 [0xd])) [0 MEM[(uint8_t *)ref_182 + 13B]+0 S1 A8]))
Index 81 (hash value 13; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 13 [0xd])) [0 MEM[(uint8_t *)ip_229 + 13B]+0 S1 A8]))
Index 82 (hash value 107; max distance 0)
  (compare:CC (reg:SI 348 [ MEM[(uint8_t *)ref_182 + 13B] ])
    (reg:SI 349 [ MEM[(uint8_t *)ip_229 + 13B] ]))
Index 83 (hash value 3; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 14 [0xe])) [0 MEM[(uint8_t *)ref_182 + 14B]+0 S1 A8]))
Index 84 (hash value 14; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 14 [0xe])) [0 MEM[(uint8_t *)ip_229 + 14B]+0 S1 A8]))
Index 85 (hash value 113; max distance 0)
  (compare:CC (reg:SI 351 [ MEM[(uint8_t *)ref_182 + 14B] ])
    (reg:SI 352 [ MEM[(uint8_t *)ip_229 + 14B] ]))
Index 86 (hash value 4; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 15 [0xf])) [0 MEM[(uint8_t *)ref_182 + 15B]+0 S1 A8]))
Index 87 (hash value 15; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 15 [0xf])) [0 MEM[(uint8_t *)ip_229 + 15B]+0 S1 A8]))
Index 88 (hash value 119; max distance 0)
  (compare:CC (reg:SI 354 [ MEM[(uint8_t *)ref_182 + 15B] ])
    (reg:SI 355 [ MEM[(uint8_t *)ip_229 + 15B] ]))
Index 89 (hash value 5; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 16 [0x10])) [0 MEM[(uint8_t *)ref_182 + 16B]+0 S1 A8]))
Index 90 (hash value 16; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 16 [0x10])) [0 MEM[(uint8_t *)ip_229 + 16B]+0 S1 A8]))
Index 91 (hash value 125; max distance 0)
  (compare:CC (reg:SI 357 [ MEM[(uint8_t *)ref_182 + 16B] ])
    (reg:SI 358 [ MEM[(uint8_t *)ip_229 + 16B] ]))
Index 92 (hash value 6; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 17 [0x11])) [0 MEM[(uint8_t *)ref_182 + 17B]+0 S1 A8]))
Index 93 (hash value 17; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 17 [0x11])) [0 MEM[(uint8_t *)ip_229 + 17B]+0 S1 A8]))
Index 94 (hash value 0; max distance 0)
  (compare:CC (reg:SI 360 [ MEM[(uint8_t *)ref_182 + 17B] ])
    (reg:SI 361 [ MEM[(uint8_t *)ip_229 + 17B] ]))
Index 95 (hash value 7; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ])
            (const_int 18 [0x12])) [0 MEM[(uint8_t *)ref_182 + 18B]+0 S1 A8]))
Index 96 (hash value 18; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (const_int 18 [0x12])) [0 MEM[(uint8_t *)ip_229 + 18B]+0 S1 A8]))
Index 97 (hash value 6; max distance 0)
  (compare:CC (reg:SI 363 [ MEM[(uint8_t *)ref_182 + 18B] ])
    (reg:SI 364 [ MEM[(uint8_t *)ip_229 + 18B] ]))
Index 98 (hash value 6; max distance 0)
  (compare:CCUNS (reg:SI 304)
    (const_int 264 [0x108]))
Index 99 (hash value 61; max distance 0)
  (if_then_else:SI (gtu (reg:CCUNS 368)
        (const_int 0 [0]))
    (reg:SI 367)
    (reg:SI 304))
Index 100 (hash value 92; max distance 0)
  (zero_extend:DI (reg:SI 366))
Index 101 (hash value 77; max distance 0)
  (plus:SI (subreg/s/v:SI (reg/v:DI 244 [ len ]) 0)
    (const_int 1 [0x1]))
Index 102 (hash value 96; max distance 0)
  (zero_extend:DI (reg:SI 370))
Index 103 (hash value 121; max distance 0)
  (zero_extend:DI (subreg:SI (reg:DI 139 [ ivtmp.25 ]) 0))
Index 104 (hash value 104; max distance 0)
  (plus:SI (subreg/s/v:SI (reg:DI 142 [ _34 ]) 0)
    (const_int -1 [0xffffffffffffffff]))
Index 105 (hash value 97; max distance 0)
  (zero_extend:DI (reg:SI 371))
Index 106 (hash value 82; max distance 0)
  (plus:DI (reg/v/f:DI 262 [ ip ])
    (reg:DI 139 [ ivtmp.25 ]))
Index 107 (hash value 57; max distance 0)
  (compare:CCUNS (subreg/s/v:SI (reg:DI 142 [ _34 ]) 0)
    (subreg/s/v:SI (reg/v:DI 254 [ maxlen ]) 0))
Index 108 (hash value 110; max distance 0)
  (plus:DI (reg:DI 139 [ ivtmp.25 ])
    (const_int 1 [0x1]))
Index 109 (hash value 89; max distance 0)
  (plus:DI (reg/v/f:DI 251 [ ref ])
    (const_int -1 [0xffffffffffffffff]))
Index 110 (hash value 112; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ])
            (reg:DI 139 [ ivtmp.25 ])) [0 MEM[(uint8_t *)ip_229 + ivtmp.25_29 * 1]+0 S1 A8]))
Index 111 (hash value 93; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/f:DI 373)
            (reg:DI 140 [ ivtmp.25 ])) [0 MEM[(uint8_t *)_417 + ivtmp.25_31 * 1]+0 S1 A8]))
Index 112 (hash value 28; max distance 0)
  (compare:CC (reg:SI 374 [ MEM[(uint8_t *)ip_229 + ivtmp.25_29 * 1] ])
    (reg:SI 375 [ MEM[(uint8_t *)_417 + ivtmp.25_31 * 1] ]))
Index 113 (hash value 103; max distance 0)
  (plus:SI (subreg/s/v:SI (reg:DI 142 [ _34 ]) 0)
    (const_int -2 [0xfffffffffffffffe]))
Index 114 (hash value 103; max distance 0)
  (zero_extend:DI (reg:SI 377))
Index 115 (hash value 83; max distance 0)
  (compare:CCUNS (reg:SI 377)
    (const_int 6 [0x6]))
Index 116 (hash value 105; max distance 0)
  (ashift:SI (reg:SI 377)
    (const_int 5 [0x5]))
Index 117 (hash value 100; max distance 0)
  (zero_extend:DI (subreg:QI (reg:SI 380) 0))
Index 118 (hash value 95; max distance 0)
  (plus:DI (reg/v/f:DI 255 [ op ])
    (const_int 1 [0x1]))
Index 119 (hash value 1; max distance 0)
  (lshiftrt:DI (reg:DI 135 [ _22 ])
    (const_int 8 [0x8]))
Index 120 (hash value 54; max distance 0)
  (plus:SI (subreg:SI (reg:DI 381) 0)
    (subreg:SI (reg:DI 267 [ prephitmp_368 ]) 0))
Index 121 (hash value 103; max distance 0)
  (zero_extend:DI (subreg:QI (reg:SI 383) 0))
Index 122 (hash value 53; max distance 0)
  (plus:SI (subreg:SI (reg:DI 384) 0)
    (const_int -32 [0xffffffffffffffe0]))
Index 123 (hash value 106; max distance 0)
  (zero_extend:DI (subreg:QI (reg:SI 386) 0))
Index 124 (hash value 96; max distance 0)
  (plus:DI (reg/v/f:DI 255 [ op ])
    (const_int 2 [0x2]))
Index 125 (hash value 81; max distance 0)
  (plus:SI (subreg:SI (reg/v:DI 256 [ len ]) 0)
    (const_int -7 [0xfffffffffffffff9]))
Index 126 (hash value 82; max distance 0)
  (plus:DI (reg/v/f:DI 241 [ op ])
    (const_int 2 [0x2]))
Index 127 (hash value 0; max distance 0)
  (compare:CCUNS (reg/f:DI 264 [ _254 ])
    (reg/v/f:DI 194 [ ip ]))
Index 128 (hash value 61; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 194 [ ip ])
            (const_int -2 [0xfffffffffffffffe])) [0 MEM[(uint8_t *)prephitmp_385 + -2B]+0 S1 A8]))
Index 129 (hash value 121; max distance 0)
  (ashift:SI (reg:SI 390 [ MEM[(uint8_t *)prephitmp_385 + -2B] ])
    (const_int 8 [0x8]))
Index 130 (hash value 62; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 194 [ ip ])
            (const_int -1 [0xffffffffffffffff])) [0 MEM[(uint8_t *)prephitmp_385 + -1B]+0 S1 A8]))
Index 131 (hash value 83; max distance 0)
  (ior:SI (reg:SI 391)
    (reg:SI 392 [ MEM[(uint8_t *)prephitmp_385 + -1B] ]))
Index 132 (hash value 124; max distance 0)
  (ashift:SI (reg:SI 393 [ hval ])
    (const_int 8 [0x8]))
Index 133 (hash value 98; max distance 0)
  (zero_extend:SI (mem:QI (reg/v/f:DI 194 [ ip ]) [0 *prephitmp_385+0 S1 A8]))
Index 134 (hash value 89; max distance 0)
  (ior:SI (reg:SI 395 [ *prephitmp_385 ])
    (reg:SI 394))
Index 135 (hash value 130; max distance 0)
  (lshiftrt:SI (reg:SI 396)
    (const_int 8 [0x8]))
Index 136 (hash value 121; max distance 0)
  (ashift:SI (reg:SI 396)
    (const_int 2 [0x2]))
Index 137 (hash value 111; max distance 0)
  (mult:SI (reg:SI 396)
    (const_int 5 [0x5]))
Index 138 (hash value 86; max distance 0)
  (minus:SI (reg:SI 397)
    (reg:SI 401))
Index 139 (hash value 25; max distance 0)
  (and:SI (reg:SI 402)
    (const_int 65535 [0xffff]))
Index 140 (hash value 129; max distance 0)
  (zero_extend:DI (reg:SI 403))
Index 141 (hash value 0; max distance 0)
  (ashift:DI (reg:DI 404)
    (const_int 3 [0x3]))
Index 142 (hash value 12; max distance 0)
  (plus:DI (reg/f:DI 450)
    (reg:DI 405))
Index 143 (hash value 31; max distance 0)
  (plus:DI (reg/v/f:DI 194 [ ip ])
    (const_int -2 [0xfffffffffffffffe]))
Index 144 (hash value 100; max distance 0)
  (minus:DI (reg/f:DI 407 [ ip ])
    (reg/v/f:DI 273 [ in_data ]))
Index 145 (hash value 127; max distance 0)
  (ashift:SI (reg:SI 396)
    (const_int 8 [0x8]))
Index 146 (hash value 64; max distance 0)
  (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 194 [ ip ])
            (const_int 1 [0x1])) [0 MEM[(uint8_t *)prephitmp_385 + 1B]+0 S1 A8]))
Index 147 (hash value 119; max distance 0)
  (ior:SI (reg:SI 410 [ MEM[(uint8_t *)prephitmp_385 + 1B] ])
    (reg:SI 409))
Index 148 (hash value 6; max distance 0)
  (zero_extend:DI (reg:SI 411))
Index 149 (hash value 14; max distance 0)
  (lshiftrt:SI (reg:SI 411)
    (const_int 8 [0x8]))
Index 150 (hash value 5; max distance 0)
  (ashift:SI (reg:SI 411)
    (const_int 2 [0x2]))
Index 151 (hash value 126; max distance 0)
  (mult:SI (reg:SI 411)
    (const_int 5 [0x5]))
Index 152 (hash value 116; max distance 0)
  (minus:SI (reg:SI 412)
    (reg:SI 416))
Index 153 (hash value 40; max distance 0)
  (and:SI (reg:SI 417)
    (const_int 65535 [0xffff]))
Index 154 (hash value 13; max distance 0)
  (zero_extend:DI (reg:SI 418))
Index 155 (hash value 15; max distance 0)
  (ashift:DI (reg:DI 419)
    (const_int 3 [0x3]))
Index 156 (hash value 27; max distance 0)
  (plus:DI (reg/f:DI 450)
    (reg:DI 420))
Index 157 (hash value 80; max distance 0)
  (plus:DI (reg/v/f:DI 262 [ ip ])
    (reg:DI 268 [ prephitmp_370 ]))
Index 158 (hash value 115; max distance 0)
  (minus:DI (reg:DI 422 [ ip ])
    (reg/v/f:DI 273 [ in_data ]))
Index 159 (hash value 17; max distance 0)
  (compare:CCUNS (reg/v/f:DI 249 [ out_end ])
    (reg/v/f:DI 226 [ op ]))
Index 160 (hash value 34; max distance 0)
  (plus:SI (subreg/s/u:SI (reg/v:DI 201 [ lit ]) 0)
    (const_int 1 [0x1]))
Index 161 (hash value 19; max distance 0)
  (sign_extend:DI (reg:SI 425))
Index 162 (hash value 102; max distance 0)
  (plus:DI (reg/v/f:DI 262 [ ip ])
    (const_int 1 [0x1]))
Index 163 (hash value 36; max distance 0)
  (zero_extend:DI (mem:QI (reg/v/f:DI 262 [ ip ]) [0 *ip_229+0 S1 A8]))
Index 164 (hash value 25; max distance 0)
  (compare:CC (reg:SI 425)
    (const_int 32 [0x20]))
Index 165 (hash value 66; max distance 0)
  (plus:DI (reg/v/f:DI 226 [ op ])
    (const_int 1 [0x1]))
Index 166 (hash value 67; max distance 0)
  (plus:DI (reg/v/f:DI 226 [ op ])
    (const_int 2 [0x2]))
Index 167 (hash value 0; max distance 0)
  (compare:CCUNS (reg/v/f:DI 194 [ ip ])
    (reg/f:DI 264 [ _254 ]))
Index 168 (hash value 89; max distance 0)
  (compare:CCUNS (reg/v/f:DI 249 [ out_end ])
    (reg/f:DI 429))
Index 169 (hash value 115; max distance 0)
  (compare:CCUNS (reg/v/f:DI 194 [ ip ])
    (reg/v/f:DI 248 [ in_end ]))
Index 170 (hash value 124; max distance 0)
  (minus:DI (reg/v/f:DI 248 [ in_end ])
    (reg/v/f:DI 194 [ ip ]))
Index 171 (hash value 32; max distance 0)
  (plus:DI (reg/v/f:DI 194 [ ip ])
    (const_int -1 [0xffffffffffffffff]))
Index 172 (hash value 26; max distance 0)
  (sign_extend:DI (reg:SI 432))
Index 173 (hash value 92; max distance 0)
  (plus:DI (reg:DI 121 [ ivtmp.19 ])
    (const_int 1 [0x1]))
Index 174 (hash value 26; max distance 0)
  (zero_extend:DI (mem:QI (reg:DI 121 [ ivtmp.19 ]) [0 MEM[(uint8_t *)_28]+0 S1 A8]))
Index 175 (hash value 32; max distance 0)
  (compare:CC (reg:SI 432)
    (const_int 32 [0x20]))
Index 176 (hash value 28; max distance 0)
  (plus:DI (reg:DI 190 [ doloop.16 ])
    (const_int -1 [0xffffffffffffffff]))
Index 177 (hash value 20; max distance 0)
  (compare:CC (reg:DI 190 [ doloop.16 ])
    (const_int 0 [0]))
Index 178 (hash value 31; max distance 0)
  (sign_extend:DI (reg:SI 437))
Index 179 (hash value 83; max distance 0)
  (plus:DI (reg/v/f:DI 226 [ op ])
    (reg:DI 438))
Index 180 (hash value 38; max distance 0)
  (zero_extend:DI (reg:SI 443))
Index 181 (hash value 88; max distance 0)
  (minus:DI (reg/v/f:DI 226 [ op ])
    (reg:DI 442))
Index 182 (hash value 8; max distance 0)
  (minus:DI (reg:DI 444 [ op ])
    (reg/v/f:DI 275 [ out_data ]))
Index 183 (hash value 33; max distance 0)
  (sign_extend:DI (subreg:SI (reg:DI 445) 0))
Index 184 (hash value 95; max distance 0)
  (plus:SI (subreg/s/v:SI (reg/v:DI 265 [ len ]) 0)
    (const_int -2 [0xfffffffffffffffe]))
Index 185 (hash value 42; max distance 0)
  (zero_extend:DI (reg:SI 447))
Index 186 (hash value 96; max distance 0)
  (plus:SI (subreg/s/v:SI (reg/v:DI 265 [ len ]) 0)
    (const_int -1 [0xffffffffffffffff]))
Index 187 (hash value 43; max distance 0)
  (zero_extend:DI (reg:SI 448))
Index 188 (hash value 77; max distance 0)
  (plus:DI (reg/v/f:DI 262 [ ip ])
    (reg/v:DI 265 [ len ]))

scanning new insn with uid = 528.
deleting insn with uid = 265.
PRE: redundant insn 265 (expression 106) in bb 41, reaching reg is 452
scanning new insn with uid = 529.
deleting insn with uid = 517.
PRE: redundant insn 517 (expression 21) in bb 44, reaching reg is 453
scanning new insn with uid = 530.
deleting insn with uid = 516.
PRE: redundant insn 516 (expression 21) in bb 9, reaching reg is 453
PRE: edge (8,9), copy expression 21
PRE: edge (75,41), copy expression 106
PRE: edge (76,41), copy expression 106
PRE: edge (77,41), copy expression 106
PRE: edge (78,41), copy expression 106
PRE: edge (79,41), copy expression 106
PRE: edge (80,41), copy expression 106
rescanning insn with uid = 235.
scanning new insn with uid = 538.
PRE: bb 37, insn 538, copy expression 106 in insn 235 to reg 452
scanning new insn with uid = 531.
scanning new insn with uid = 532.
scanning new insn with uid = 533.
scanning new insn with uid = 534.
scanning new insn with uid = 535.
scanning new insn with uid = 536.
scanning new insn with uid = 537.
PRE GCSE of compute_on_bytes, 79 basic blocks, 46096 bytes needed, 3 substs, 8 insns created


try_optimize_cfg iteration 1

starting the processing of deferred insns
ending the processing of deferred insns


compute_on_bytes

Dataflow summary:
;;  fully invalidated by EH 	 0 [0] 3 [3] 4 [4] 5 [5] 6 [6] 7 [7] 8 [8] 9 [9] 10 [10] 11 [11] 12 [12] 13 [13] 32 [0] 33 [1] 34 [2] 35 [3] 36 [4] 37 [5] 38 [6] 39 [7] 40 [8] 41 [9] 42 [10] 43 [11] 44 [12] 45 [13] 64 [0] 65 [1] 66 [2] 67 [3] 68 [4] 69 [5] 70 [6] 71 [7] 72 [8] 73 [9] 74 [10] 75 [11] 76 [12] 77 [13] 78 [14] 79 [15] 80 [16] 81 [17] 82 [18] 83 [19] 96 [lr] 97 [ctr] 98 [ca] 100 [0] 101 [1] 105 [5] 106 [6] 107 [7] 109 [vscr]
;;  hardware regs used 	 1 [1] 2 [2] 99 [ap] 109 [vscr] 110 [sfp]
;;  regular block artificial uses 	 1 [1] 2 [2] 31 [31] 99 [ap] 110 [sfp]
;;  eh block artificial uses 	 1 [1] 2 [2] 31 [31] 99 [ap] 110 [sfp]
;;  entry block defs 	 1 [1] 2 [2] 3 [3] 4 [4] 5 [5] 6 [6] 7 [7] 8 [8] 9 [9] 10 [10] 31 [31] 33 [1] 34 [2] 35 [3] 36 [4] 37 [5] 38 [6] 39 [7] 40 [8] 41 [9] 42 [10] 43 [11] 44 [12] 45 [13] 66 [2] 67 [3] 68 [4] 69 [5] 70 [6] 71 [7] 72 [8] 73 [9] 74 [10] 75 [11] 76 [12] 77 [13] 96 [lr] 99 [ap] 109 [vscr] 110 [sfp]
;;  exit block uses 	 1 [1] 2 [2] 3 [3] 31 [31] 108 [vrsave] 109 [vscr] 110 [sfp]
;;  regs ever live 	 3 [3] 4 [4] 5 [5] 6 [6]
;;  ref usage 	r1={1d,78u} r2={1d,78u} r3={2d,3u} r4={1d,1u} r5={1d,1u} r6={1d,1u} r7={1d} r8={1d} r9={1d} r10={1d} r31={1d,78u} r33={1d} r34={1d} r35={1d} r36={1d} r37={1d} r38={1d} r39={1d} r40={1d} r41={1d} r42={1d} r43={1d} r44={1d} r45={1d} r66={1d} r67={1d} r68={1d} r69={1d} r70={1d} r71={1d} r72={1d} r73={1d} r74={1d} r75={1d} r76={1d} r77={1d} r96={1d} r99={1d,77u} r108={1u} r109={1d,1u} r110={1d,79u,2e} r121={2d,2u} r123={1d,2u} r135={1d,4u} r139={8d,10u,1e} r140={1d,2u} r142={1d,3u} r190={3d,4u} r194={5d,11u} r195={2d,1u} r201={9d,9u} r226={6d,16u} r227={1d,1u} r229={3d,1u} r230={1d,1u} r241={2d,2u} r243={2d,2u} r244={2d,1u} r248={1d,4u} r249={1d,4u} r250={1d,2u} r251={1d,21u} r254={1d,1u} r255={1d,4u} r256={2d,1u} r262={2d,33u,1e} r264={1d,3u} r265={10d,3u} r267={7d,1u} r268={8d,1u} r272={2d,1u} r273={1d,10u} r274={1d,2u} r275={1d,3u} r276={1d,2u} r277={1d,1u} r278={1d,1u} r279={1d,1u} r280={1d,1u} r281={1d,1u} r282={1d,1u} r283={1d,4u,1e} r284={1d,1u} r287={1d,1u} r288={1d,1u} r289={1d,1u} r290={1d,1u} r291={1d,1u} r292={1d,1u} r293={1d,1u} r294={1d,1u} r295={1d,1u} r296={1d,1u} r297={1d,1u} r298={1d,1u} r299={1d,1u} r300={1d,1u} r301={1d,1u} r302={1d,1u} r303={1d,1u} r304={1d,3u,1e} r305={1d,1u} r306={1d,1u} r307={1d,1u} r308={1d,1u} r309={1d,1u} r310={1d,1u} r311={1d,1u} r312={1d,1u} r314={1d,1u} r315={1d,1u} r316={1d,1u} r317={1d,1u} r318={1d,1u} r319={1d,1u} r320={1d,1u} r321={1d,1u} r322={1d,1u} r323={1d,1u} r324={1d,1u} r325={1d,1u} r326={1d,1u} r327={1d,1u} r328={1d,1u} r329={1d,1u} r330={1d,1u} r331={1d,1u} r332={1d,1u} r333={1d,1u} r334={1d,1u} r335={1d,1u} r336={1d,1u} r337={1d,1u} r338={1d,1u} r339={1d,1u} r340={1d,1u} r341={1d,1u} r342={1d,1u} r343={1d,1u} r344={1d,1u} r345={1d,1u} r346={1d,1u} r347={1d,1u} r348={1d,1u} r349={1d,1u} r350={1d,1u} r351={1d,1u} r352={1d,1u} r353={1d,1u} r354={1d,1u} r355={1d,1u} r356={1d,1u} r357={1d,1u} r358={1d,1u} r359={1d,1u} r360={1d,1u} r361={1d,1u} r362={1d,1u} r363={1d,1u} r364={1d,1u} r365={1d,1u} r366={1d,1u} r367={1d,1u} r368={1d,1u,1e} r370={1d,1u} r371={1d,1u} r372={1d,1u} r373={1d,1u} r374={1d,1u} r375={1d,1u} r376={1d,1u} r377={1d,3u} r378={1d,1u} r380={1d,1u} r381={1d,1u} r383={1d,1u} r384={1d,1u} r386={1d,1u} r388={1d,1u} r389={1d,1u} r390={1d,1u} r391={1d,1u} r392={1d,1u} r393={1d,1u} r394={1d,1u} r395={1d,1u} r396={1d,4u,1e} r397={1d,1u} r400={1d,1u} r401={1d,1u} r402={1d,1u} r403={1d,1u} r404={1d,1u} r405={1d,1u} r406={1d,1u} r407={1d,1u} r408={1d,1u} r409={1d,1u} r410={1d,1u} r411={1d,4u,1e} r412={1d,1u} r415={1d,1u} r416={1d,1u} r417={1d,1u} r418={1d,1u} r419={1d,1u} r420={1d,1u} r421={1d,1u} r422={1d,1u} r423={1d,1u} r424={1d,1u} r425={1d,2u} r426={1d,1u} r427={1d,1u} r428={1d,1u} r429={1d,1u} r430={1d,1u} r431={1d,1u} r432={1d,2u} r433={1d,1u} r434={1d,1u} r435={1d,1u} r436={1d,1u} r437={1d,1u} r438={1d,1u} r439={1d,1u} r441={1d,1u} r442={1d,1u} r443={1d,1u} r444={1d,1u} r445={1d,1u} r447={1d,1u} r448={1d,1u} r449={1d,1u} r450={1d,2u} r452={7d,2u} r453={1d,2u} 
;;    total ref usage 1056{301d,746u,9e} in 313{313 regular + 0 call} insns.
   49: NOTE_INSN_BASIC_BLOCK 2
    2: r273:DI=%3:DI
      REG_DEAD %3:DI
    3: r274:DI=%4:DI
      REG_DEAD %4:DI
    4: r275:DI=%5:DI
      REG_DEAD %5:DI
    5: r276:DI=%6:DI
      REG_DEAD %6:DI
    6: NOTE_INSN_FUNCTION_BEG
   51: r277:CC=cmp(r274:DI,0)
   52: pc={(r277:CC!=0)?L56:pc}
      REG_DEAD r277:CC
      REG_BR_PROB 708669604
   58: L58:
   53: NOTE_INSN_BASIC_BLOCK 3
   20: r272:DI=0
      ; pc falls through to BB 78
   56: L56:
   57: NOTE_INSN_BASIC_BLOCK 4
   59: r278:CC=cmp(r276:DI,0)
   60: pc={(r278:CC==0)?L58:pc}
      REG_DEAD r278:CC
      REG_BR_PROB 365072228
   61: NOTE_INSN_BASIC_BLOCK 5
   62: r248:DI=r273:DI+r274:DI
      REG_DEAD r274:DI
   63: r249:DI=r275:DI+r276:DI
      REG_DEAD r276:DI
   64: r226:DI=r275:DI+0x1
   65: r279:HI=bswap([r273:DI])
   66: r280:SI=zero_extend(r279:HI)
      REG_DEAD r279:HI
   67: r229:DI=zero_extend(r280:SI)
      REG_DEAD r280:SI
   68: r264:DI=r248:DI-0x2
   69: r281:CCUNS=cmp(r273:DI,r264:DI)
   70: pc={(geu(r281:CCUNS,0))?L439:pc}
      REG_DEAD r281:CCUNS
      REG_BR_PROB 29527908
  437: NOTE_INSN_BASIC_BLOCK 6
    8: r262:DI=r273:DI
    9: r201:DI=0
  531: r453:DI=sfp:DI+0x20
  358: L358:
   71: NOTE_INSN_BASIC_BLOCK 7
   72: r282:SI=r229:DI#0<<0x8
      REG_DEAD r229:DI
   74: r123:DI=zero_extend([r262:DI+0x2])
   75: r283:SI=r123:DI#0|r282:SI
   76: r229:DI=zero_extend(r283:SI)
      REG_DEAD r283:SI
   77: r284:SI=r283:SI 0>>0x8
   80: r287:SI=r283:SI<<0x2
   82: r288:SI=r287:SI+r283:SI
      REG_EQUAL r283:SI*0x5
   83: r289:SI=r284:SI-r288:SI
      REG_DEAD r288:SI
      REG_DEAD r284:SI
   84: r290:SI=r289:SI&0xffff
      REG_DEAD r289:SI
   85: r291:DI=zero_extend(r290:SI)
      REG_DEAD r290:SI
   86: r292:DI=r291:DI<<0x3
      REG_DEAD r291:DI
  530: r449:DI=r453:DI
      REG_EQUAL sfp:DI+0x20
   87: r250:DI=r449:DI+r292:DI
      REG_DEAD r449:DI
      REG_DEAD r292:DI
   88: r293:DI=zero_extend([r250:DI])
   89: r251:DI=r273:DI+r293:DI
      REG_DEAD r293:DI
   90: r294:DI=r262:DI-r273:DI
   91: [r250:DI]=r294:DI#0
      REG_DEAD r294:DI
      REG_DEAD r250:DI
   92: r295:DI=r262:DI-r251:DI
   93: r135:DI=r295:DI-0x1
      REG_DEAD r295:DI
   97: r296:CCUNS=cmp(r135:DI,0x1fff)
   98: pc={(gtu(r296:CCUNS,0))?L331:pc}
      REG_DEAD r296:CCUNS
      REG_BR_PROB 536870916
   99: NOTE_INSN_BASIC_BLOCK 8
  100: r297:CCUNS=cmp(r273:DI,r251:DI)
  101: pc={(geu(r297:CCUNS,0))?L331:pc}
      REG_DEAD r297:CCUNS
      REG_BR_PROB 536870916
  102: NOTE_INSN_BASIC_BLOCK 9
  103: r298:SI=zero_extend([r251:DI+0x2])
  104: r299:CC=cmp(r298:SI,r123:DI#0)
      REG_DEAD r298:SI
      REG_DEAD r123:DI
  105: pc={(r299:CC!=0)?L331:pc}
      REG_DEAD r299:CC
      REG_BR_PROB 708669604
  106: NOTE_INSN_BASIC_BLOCK 10
  107: r300:SI=zero_extend([r251:DI])
  108: r301:SI=zero_extend([r262:DI])
  109: r302:CC=cmp(r300:SI,r301:SI)
      REG_DEAD r301:SI
      REG_DEAD r300:SI
  110: pc={(r302:CC!=0)?L331:pc}
      REG_DEAD r302:CC
      REG_BR_PROB 708669604
  111: NOTE_INSN_BASIC_BLOCK 11
  112: r303:DI=r248:DI-r262:DI
  113: r304:SI=r303:DI#0-0x2
      REG_DEAD r303:DI
  115: r305:DI=r226:DI+0x4
  116: r306:CCUNS=cmp(r249:DI,r305:DI)
      REG_DEAD r305:DI
  117: pc={(gtu(r306:CCUNS,0))?L125:pc}
      REG_DEAD r306:CCUNS
      REG_BR_PROB 536870916
  118: NOTE_INSN_BASIC_BLOCK 12
  119: r307:CC=cmp(r201:DI,0)
  120: pc={(r307:CC!=0)?L58:pc}
      REG_DEAD r307:CC
      REG_BR_PROB 536870916
  121: NOTE_INSN_BASIC_BLOCK 13
  122: r308:DI=r226:DI+0x3
  123: r309:CCUNS=cmp(r249:DI,r308:DI)
      REG_DEAD r308:DI
  124: pc={(leu(r309:CCUNS,0))?L58:pc}
      REG_DEAD r309:CCUNS
      REG_BR_PROB 4
  125: L125:
  126: NOTE_INSN_BASIC_BLOCK 14
  127: r310:SI=-r201:DI#0
  128: r311:DI=sign_extend(r310:SI)
      REG_DEAD r310:SI
  129: r312:DI=r226:DI+r311:DI
      REG_DEAD r311:DI
  131: r314:SI=r201:DI#0-0x1
  132: [r312:DI-0x1]=r314:SI#0
      REG_DEAD r314:SI
      REG_DEAD r312:DI
  133: {r316:SI=r201:DI#0==0;clobber scratch;clobber scratch;}
      REG_DEAD r201:DI
  134: r315:DI=zero_extend(r316:SI)
      REG_DEAD r316:SI
  135: r255:DI=r226:DI-r315:DI
      REG_DEAD r315:DI
      REG_DEAD r226:DI
  138: r317:CCUNS=cmp(r304:SI,0x10)
  139: pc={(leu(r317:CCUNS,0))?L444:pc}
      REG_DEAD r317:CCUNS
      REG_BR_PROB 289910300
  140: NOTE_INSN_BASIC_BLOCK 15
  141: r318:SI=zero_extend([r251:DI+0x3])
  142: r319:SI=zero_extend([r262:DI+0x3])
  143: r320:CC=cmp(r318:SI,r319:SI)
      REG_DEAD r319:SI
      REG_DEAD r318:SI
  144: pc={(r320:CC!=0)?L446:pc}
      REG_DEAD r320:CC
      REG_BR_PROB 708669604
  145: NOTE_INSN_BASIC_BLOCK 16
  146: r321:SI=zero_extend([r251:DI+0x4])
  147: r322:SI=zero_extend([r262:DI+0x4])
  148: r323:CC=cmp(r321:SI,r322:SI)
      REG_DEAD r322:SI
      REG_DEAD r321:SI
  149: pc={(r323:CC!=0)?L450:pc}
      REG_DEAD r323:CC
      REG_BR_PROB 708669604
  150: NOTE_INSN_BASIC_BLOCK 17
  151: r324:SI=zero_extend([r251:DI+0x5])
  152: r325:SI=zero_extend([r262:DI+0x5])
  153: r326:CC=cmp(r324:SI,r325:SI)
      REG_DEAD r325:SI
      REG_DEAD r324:SI
  154: pc={(r326:CC!=0)?L454:pc}
      REG_DEAD r326:CC
      REG_BR_PROB 708669604
  155: NOTE_INSN_BASIC_BLOCK 18
  156: r327:SI=zero_extend([r251:DI+0x6])
  157: r328:SI=zero_extend([r262:DI+0x6])
  158: r329:CC=cmp(r327:SI,r328:SI)
      REG_DEAD r328:SI
      REG_DEAD r327:SI
  159: pc={(r329:CC!=0)?L458:pc}
      REG_DEAD r329:CC
      REG_BR_PROB 708669604
  160: NOTE_INSN_BASIC_BLOCK 19
  161: r330:SI=zero_extend([r251:DI+0x7])
  162: r331:SI=zero_extend([r262:DI+0x7])
  163: r332:CC=cmp(r330:SI,r331:SI)
      REG_DEAD r331:SI
      REG_DEAD r330:SI
  164: pc={(r332:CC!=0)?L462:pc}
      REG_DEAD r332:CC
      REG_BR_PROB 708669604
  165: NOTE_INSN_BASIC_BLOCK 20
  166: r333:SI=zero_extend([r251:DI+0x8])
  167: r334:SI=zero_extend([r262:DI+0x8])
  168: r335:CC=cmp(r333:SI,r334:SI)
      REG_DEAD r334:SI
      REG_DEAD r333:SI
  169: pc={(r335:CC!=0)?L466:pc}
      REG_DEAD r335:CC
      REG_BR_PROB 708669604
  170: NOTE_INSN_BASIC_BLOCK 21
  171: r336:SI=zero_extend([r251:DI+0x9])
  172: r337:SI=zero_extend([r262:DI+0x9])
  173: r338:CC=cmp(r336:SI,r337:SI)
      REG_DEAD r337:SI
      REG_DEAD r336:SI
  174: pc={(r338:CC!=0)?L468:pc}
      REG_DEAD r338:CC
      REG_BR_PROB 708669604
  175: NOTE_INSN_BASIC_BLOCK 22
  176: r339:SI=zero_extend([r251:DI+0xa])
  177: r340:SI=zero_extend([r262:DI+0xa])
  178: r341:CC=cmp(r339:SI,r340:SI)
      REG_DEAD r340:SI
      REG_DEAD r339:SI
  179: pc={(r341:CC!=0)?L472:pc}
      REG_DEAD r341:CC
      REG_BR_PROB 708669604
  180: NOTE_INSN_BASIC_BLOCK 23
  181: r342:SI=zero_extend([r251:DI+0xb])
  182: r343:SI=zero_extend([r262:DI+0xb])
  183: r344:CC=cmp(r342:SI,r343:SI)
      REG_DEAD r343:SI
      REG_DEAD r342:SI
  184: pc={(r344:CC!=0)?L476:pc}
      REG_DEAD r344:CC
      REG_BR_PROB 708669604
  185: NOTE_INSN_BASIC_BLOCK 24
  186: r345:SI=zero_extend([r251:DI+0xc])
  187: r346:SI=zero_extend([r262:DI+0xc])
  188: r347:CC=cmp(r345:SI,r346:SI)
      REG_DEAD r346:SI
      REG_DEAD r345:SI
  189: pc={(r347:CC!=0)?L480:pc}
      REG_DEAD r347:CC
      REG_BR_PROB 708669604
  190: NOTE_INSN_BASIC_BLOCK 25
  191: r348:SI=zero_extend([r251:DI+0xd])
  192: r349:SI=zero_extend([r262:DI+0xd])
  193: r350:CC=cmp(r348:SI,r349:SI)
      REG_DEAD r349:SI
      REG_DEAD r348:SI
  194: pc={(r350:CC!=0)?L484:pc}
      REG_DEAD r350:CC
      REG_BR_PROB 708669604
  195: NOTE_INSN_BASIC_BLOCK 26
  196: r351:SI=zero_extend([r251:DI+0xe])
  197: r352:SI=zero_extend([r262:DI+0xe])
  198: r353:CC=cmp(r351:SI,r352:SI)
      REG_DEAD r352:SI
      REG_DEAD r351:SI
  199: pc={(r353:CC!=0)?L488:pc}
      REG_DEAD r353:CC
      REG_BR_PROB 708669604
  200: NOTE_INSN_BASIC_BLOCK 27
  201: r354:SI=zero_extend([r251:DI+0xf])
  202: r355:SI=zero_extend([r262:DI+0xf])
  203: r356:CC=cmp(r354:SI,r355:SI)
      REG_DEAD r355:SI
      REG_DEAD r354:SI
  204: pc={(r356:CC!=0)?L492:pc}
      REG_DEAD r356:CC
      REG_BR_PROB 708669604
  205: NOTE_INSN_BASIC_BLOCK 28
  206: r357:SI=zero_extend([r251:DI+0x10])
  207: r358:SI=zero_extend([r262:DI+0x10])
  208: r359:CC=cmp(r357:SI,r358:SI)
      REG_DEAD r358:SI
      REG_DEAD r357:SI
  209: pc={(r359:CC!=0)?L496:pc}
      REG_DEAD r359:CC
      REG_BR_PROB 708669604
  210: NOTE_INSN_BASIC_BLOCK 29
  211: r360:SI=zero_extend([r251:DI+0x11])
  212: r361:SI=zero_extend([r262:DI+0x11])
  213: r362:CC=cmp(r360:SI,r361:SI)
      REG_DEAD r361:SI
      REG_DEAD r360:SI
  214: pc={(r362:CC!=0)?L500:pc}
      REG_DEAD r362:CC
      REG_BR_PROB 708669604
  215: NOTE_INSN_BASIC_BLOCK 30
  216: r363:SI=zero_extend([r251:DI+0x12])
  217: r364:SI=zero_extend([r262:DI+0x12])
  218: r365:CC=cmp(r363:SI,r364:SI)
      REG_DEAD r364:SI
      REG_DEAD r363:SI
  219: pc={(r365:CC!=0)?L504:pc}
      REG_DEAD r365:CC
      REG_BR_PROB 901943140
  440: NOTE_INSN_BASIC_BLOCK 31
   10: r244:DI=0x12
      ; pc falls through to BB 33
  444: L444:
  443: NOTE_INSN_BASIC_BLOCK 32
   11: r244:DI=0x2
  220: L220:
  221: NOTE_INSN_BASIC_BLOCK 33
  224: r367:SI=0x108
  225: r368:CCUNS=cmp(r304:SI,0x108)
  227: r366:SI={(gtu(r368:CCUNS,0))?r367:SI:r304:SI}
      REG_EQUAL {(gtu(r368:CCUNS,0))?0x108:r304:SI}
  228: r254:DI=zero_extend(r366:SI)
      REG_DEAD r366:SI
  229: r370:SI=r244:DI#0+0x1
      REG_DEAD r244:DI
  230: r139:DI=zero_extend(r370:SI)
      REG_DEAD r370:SI
      ; pc falls through to BB 35
  508: L508:
  507: NOTE_INSN_BASIC_BLOCK 34
   12: r139:DI=r140:DI
      REG_DEAD r140:DI
  240: L240:
  231: NOTE_INSN_BASIC_BLOCK 35
  232: r142:DI=zero_extend(r139:DI#0)
  233: r371:SI=r142:DI#0-0x1
  234: r243:DI=zero_extend(r371:SI)
      REG_DEAD r371:SI
  235: r452:DI=r262:DI+r139:DI
  538: r194:DI=r452:DI
  236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
  237: pc={(geu(r372:CCUNS,0))?L246:pc}
      REG_DEAD r372:CCUNS
      REG_BR_PROB 59055804
  238: NOTE_INSN_BASIC_BLOCK 36
  239: r140:DI=r139:DI+0x1
  241: r373:DI=r251:DI-0x1
  242: r374:SI=zero_extend([r262:DI+r139:DI])
  243: r375:SI=zero_extend([r373:DI+r140:DI])
      REG_DEAD r373:DI
  244: r376:CC=cmp(r374:SI,r375:SI)
      REG_DEAD r375:SI
      REG_DEAD r374:SI
  245: pc={(r376:CC==0)?L508:pc}
      REG_DEAD r376:CC
      REG_BR_PROB 1014686028
  246: L246:
  247: NOTE_INSN_BASIC_BLOCK 37
  248: r377:SI=r142:DI#0-0x2
      REG_DEAD r142:DI
  249: r256:DI=zero_extend(r377:SI)
      REG_DEAD r377:SI
  252: r378:CCUNS=cmp(r377:SI,0x6)
  253: pc={(gtu(r378:CCUNS,0))?L268:pc}
      REG_DEAD r378:CCUNS
      REG_BR_PROB 1073741828
  254: NOTE_INSN_BASIC_BLOCK 38
  255: r268:DI=r243:DI
      REG_DEAD r243:DI
  257: r380:SI=r377:SI<<0x5
  258: r267:DI=zero_extend(r380:SI#0)
      REG_DEAD r380:SI
  430: L430:
  259: NOTE_INSN_BASIC_BLOCK 39
  260: r241:DI=r255:DI+0x1
  261: r381:DI=r135:DI 0>>0x8
  263: r383:SI=r381:DI#0+r267:DI#0
  264: r195:DI=zero_extend(r383:SI#0)
      REG_DEAD r383:SI
  528: r194:DI=r452:DI
      REG_EQUAL r262:DI+r139:DI
      ; pc falls through to BB 41
  268: L268:
  269: NOTE_INSN_BASIC_BLOCK 40
  270: r384:DI=r135:DI 0>>0x8
  272: r386:SI=r384:DI#0-0x20
  273: r195:DI=zero_extend(r386:SI#0)
      REG_DEAD r386:SI
  274: r241:DI=r255:DI+0x2
  276: r388:SI=r256:DI#0-0x7
  277: [r255:DI+0x1]=r388:SI#0
      REG_DEAD r388:SI
  278: r268:DI=r243:DI
      REG_DEAD r243:DI
  279: L279:
  280: NOTE_INSN_BASIC_BLOCK 41
  281: [r255:DI]=r195:DI#0
      REG_DEAD r255:DI
      REG_DEAD r195:DI
  282: [r241:DI]=r135:DI#0
      REG_DEAD r135:DI
  283: r226:DI=r241:DI+0x2
      REG_DEAD r241:DI
  284: r389:CCUNS=cmp(r264:DI,r194:DI)
  285: pc={(leu(r389:CCUNS,0))?L512:pc}
      REG_DEAD r389:CCUNS
      REG_BR_PROB 29527908
  286: NOTE_INSN_BASIC_BLOCK 42
  287: r390:SI=zero_extend([r194:DI-0x2])
  288: r391:SI=r390:SI<<0x8
      REG_DEAD r390:SI
  289: r392:SI=zero_extend([r194:DI-0x1])
  290: r393:SI=r391:SI|r392:SI
      REG_DEAD r392:SI
      REG_DEAD r391:SI
  291: r394:SI=r393:SI<<0x8
      REG_DEAD r393:SI
  293: r395:SI=zero_extend([r194:DI])
  294: r396:SI=r395:SI|r394:SI
  296: r397:SI=r396:SI 0>>0x8
  299: r400:SI=r396:SI<<0x2
  301: r401:SI=r400:SI+r396:SI
      REG_EQUAL r396:SI*0x5
  302: r402:SI=r397:SI-r401:SI
      REG_DEAD r401:SI
      REG_DEAD r397:SI
  303: r403:SI=r402:SI&0xffff
      REG_DEAD r402:SI
  304: r404:DI=zero_extend(r403:SI)
      REG_DEAD r403:SI
  305: r405:DI=r404:DI<<0x3
      REG_DEAD r404:DI
  529: r450:DI=r453:DI
      REG_EQUAL sfp:DI+0x20
  306: r406:DI=r450:DI+r405:DI
      REG_DEAD r450:DI
      REG_DEAD r405:DI
  307: r407:DI=r194:DI-0x2
  308: r408:DI=r407:DI-r273:DI
      REG_DEAD r407:DI
  309: [r406:DI]=r408:DI
      REG_DEAD r408:DI
      REG_DEAD r406:DI
  310: r409:SI=r396:SI<<0x8
  312: r410:SI=zero_extend([r194:DI+0x1])
  313: r411:SI=r410:SI|r409:SI
  314: r229:DI=zero_extend(r411:SI)
      REG_DEAD r411:SI
  315: r412:SI=r411:SI 0>>0x8
  318: r415:SI=r411:SI<<0x2
  320: r416:SI=r415:SI+r411:SI
      REG_EQUAL r411:SI*0x5
  321: r417:SI=r412:SI-r416:SI
      REG_DEAD r416:SI
      REG_DEAD r412:SI
  322: r418:SI=r417:SI&0xffff
      REG_DEAD r417:SI
  323: r419:DI=zero_extend(r418:SI)
      REG_DEAD r418:SI
  324: r420:DI=r419:DI<<0x3
      REG_DEAD r419:DI
  325: r421:DI=r450:DI+r420:DI
      REG_DEAD r451:DI
      REG_DEAD r420:DI
  326: r422:DI=r262:DI+r268:DI
      REG_DEAD r268:DI
      REG_DEAD r262:DI
  327: r423:DI=r422:DI-r273:DI
      REG_DEAD r422:DI
  328: [r421:DI]=r423:DI
      REG_DEAD r423:DI
      REG_DEAD r421:DI
   14: r201:DI=0
      ; pc falls through to BB 48
  331: L331:
  332: NOTE_INSN_BASIC_BLOCK 43
  333: r424:CCUNS=cmp(r249:DI,r226:DI)
  334: pc={(leu(r424:CCUNS,0))?L58:pc}
      REG_DEAD r424:CCUNS
      REG_BR_PROB 29527908
  335: NOTE_INSN_BASIC_BLOCK 44
  336: r425:SI=r201:DI#0+0x1
      REG_DEAD r201:DI
  337: r201:DI=sign_extend(r425:SI)
      REG_DEAD r425:SI
  338: r194:DI=r262:DI+0x1
  339: r227:DI=zero_extend([r262:DI])
      REG_DEAD r262:DI
  340: [r226:DI]=r227:DI#0
      REG_DEAD r227:DI
  341: r426:CC=cmp(r425:SI,0x20)
  342: pc={(r426:CC==0)?L347:pc}
      REG_DEAD r426:CC
      REG_BR_PROB 365072228
  343: NOTE_INSN_BASIC_BLOCK 45
  344: r226:DI=r226:DI+0x1
      ; pc falls through to BB 47
  347: L347:
  348: NOTE_INSN_BASIC_BLOCK 46
  349: r427:QI=0x1f
  350: [r226:DI-0x20]=r427:QI
      REG_DEAD r427:QI
  351: r226:DI=r226:DI+0x2
   13: r201:DI=0
  352: L352:
  353: NOTE_INSN_BASIC_BLOCK 47
  354: r428:CCUNS=cmp(r194:DI,r264:DI)
  355: pc={(geu(r428:CCUNS,0))?L361:pc}
      REG_DEAD r428:CCUNS
      REG_BR_PROB 30394580
  356: L356:
  357: NOTE_INSN_BASIC_BLOCK 48
    7: r262:DI=r194:DI
      REG_DEAD r194:DI
      ; pc falls through to BB 7
  439: L439:
  438: NOTE_INSN_BASIC_BLOCK 49
   16: r194:DI=r273:DI
      REG_DEAD r273:DI
   17: r201:DI=0
      ; pc falls through to BB 51
  512: L512:
  511: NOTE_INSN_BASIC_BLOCK 50
   15: r201:DI=0
  361: L361:
  362: NOTE_INSN_BASIC_BLOCK 51
  363: r429:DI=r226:DI+0x3
  364: r430:CCUNS=cmp(r249:DI,r429:DI)
      REG_DEAD r429:DI
      REG_DEAD r249:DI
  365: pc={(ltu(r430:CCUNS,0))?L58:pc}
      REG_DEAD r430:CCUNS
      REG_BR_PROB 365072228
  366: NOTE_INSN_BASIC_BLOCK 52
  367: r431:CCUNS=cmp(r194:DI,r248:DI)
  368: pc={(geu(r431:CCUNS,0))?L401:pc}
      REG_DEAD r431:CCUNS
      REG_BR_PROB 118111604
  369: NOTE_INSN_BASIC_BLOCK 53
  370: r190:DI=r248:DI-r194:DI
      REG_DEAD r248:DI
  371: r121:DI=r194:DI-0x1
      REG_DEAD r194:DI
  398: L398:
  372: NOTE_INSN_BASIC_BLOCK 54
  373: r432:SI=r201:DI#0+0x1
      REG_DEAD r201:DI
  374: r201:DI=sign_extend(r432:SI)
      REG_DEAD r432:SI
  375: r121:DI=r121:DI+0x1
  376: r230:DI=zero_extend([r121:DI])
  377: [r226:DI]=r230:DI#0
      REG_DEAD r230:DI
  378: r433:CC=cmp(r432:SI,0x20)
  379: pc={(r433:CC==0)?L388:pc}
      REG_DEAD r433:CC
      REG_BR_PROB 365072228
  380: NOTE_INSN_BASIC_BLOCK 55
  381: r226:DI=r226:DI+0x1
  382: r190:DI=r190:DI-0x1
  383: r434:CC=cmp(r190:DI,0)
  384: pc={(r434:CC!=0)?L396:pc}
      REG_DEAD r434:CC
      REG_BR_PROB 955630228
      ; pc falls through to BB 60
  388: L388:
  389: NOTE_INSN_BASIC_BLOCK 56
  390: r435:QI=0x1f
  391: [r226:DI-0x20]=r435:QI
      REG_DEAD r435:QI
  392: r226:DI=r226:DI+0x2
  393: r190:DI=r190:DI-0x1
  394: r436:CC=cmp(r190:DI,0)
  395: pc={(r436:CC==0)?L515:pc}
      REG_DEAD r436:CC
      REG_BR_PROB 118111604
  513: NOTE_INSN_BASIC_BLOCK 57
   18: r201:DI=0
  396: L396:
  397: NOTE_INSN_BASIC_BLOCK 58
      ; pc falls through to BB 54
  515: L515:
  514: NOTE_INSN_BASIC_BLOCK 59
   19: r201:DI=0
  401: L401:
  402: NOTE_INSN_BASIC_BLOCK 60
  403: r437:SI=-r201:DI#0
  404: r438:DI=sign_extend(r437:SI)
      REG_DEAD r437:SI
  405: r439:DI=r226:DI+r438:DI
      REG_DEAD r438:DI
  407: r441:SI=r201:DI#0-0x1
  408: [r439:DI-0x1]=r441:SI#0
      REG_DEAD r441:SI
      REG_DEAD r439:DI
  409: {r443:SI=r201:DI#0==0;clobber scratch;clobber scratch;}
      REG_DEAD r201:DI
  410: r442:DI=zero_extend(r443:SI)
      REG_DEAD r443:SI
  411: r444:DI=r226:DI-r442:DI
      REG_DEAD r442:DI
      REG_DEAD r226:DI
  412: r445:DI=r444:DI-r275:DI
      REG_DEAD r444:DI
      REG_DEAD r275:DI
  413: r272:DI=sign_extend(r445:DI#0)
      REG_DEAD r445:DI
      ; pc falls through to BB 78
  468: L468:
  467: NOTE_INSN_BASIC_BLOCK 61
   30: r265:DI=0x9
      ; pc falls through to BB 71
  472: L472:
  471: NOTE_INSN_BASIC_BLOCK 62
   29: r265:DI=0xa
      ; pc falls through to BB 71
  476: L476:
  475: NOTE_INSN_BASIC_BLOCK 63
   28: r265:DI=0xb
      ; pc falls through to BB 71
  480: L480:
  479: NOTE_INSN_BASIC_BLOCK 64
   27: r265:DI=0xc
      ; pc falls through to BB 71
  484: L484:
  483: NOTE_INSN_BASIC_BLOCK 65
   26: r265:DI=0xd
      ; pc falls through to BB 71
  488: L488:
  487: NOTE_INSN_BASIC_BLOCK 66
   25: r265:DI=0xe
      ; pc falls through to BB 71
  492: L492:
  491: NOTE_INSN_BASIC_BLOCK 67
   24: r265:DI=0xf
      ; pc falls through to BB 71
  496: L496:
  495: NOTE_INSN_BASIC_BLOCK 68
   23: r265:DI=0x10
      ; pc falls through to BB 71
  500: L500:
  499: NOTE_INSN_BASIC_BLOCK 69
   22: r265:DI=0x11
      ; pc falls through to BB 71
  504: L504:
  503: NOTE_INSN_BASIC_BLOCK 70
   21: r265:DI=0x12
  419: L419:
  420: NOTE_INSN_BASIC_BLOCK 71
  421: r447:SI=r265:DI#0-0x2
  422: r256:DI=zero_extend(r447:SI)
      REG_DEAD r447:SI
  423: r448:SI=r265:DI#0-0x1
  424: r243:DI=zero_extend(r448:SI)
      REG_DEAD r448:SI
  425: r194:DI=r262:DI+r265:DI
      REG_DEAD r265:DI
      ; pc falls through to BB 40
  446: L446:
  445: NOTE_INSN_BASIC_BLOCK 72
   46: r139:DI=0x3
   47: r268:DI=0x2
   48: r267:DI=0x20
  532: r452:DI=r262:DI+r139:DI
      ; pc falls through to BB 39
  450: L450:
  449: NOTE_INSN_BASIC_BLOCK 73
   43: r139:DI=0x4
   44: r268:DI=0x3
   45: r267:DI=0x40
  533: r452:DI=r262:DI+r139:DI
      ; pc falls through to BB 39
  454: L454:
  453: NOTE_INSN_BASIC_BLOCK 74
   40: r139:DI=0x5
   41: r268:DI=0x4
   42: r267:DI=0x60
  534: r452:DI=r262:DI+r139:DI
      ; pc falls through to BB 39
  458: L458:
  457: NOTE_INSN_BASIC_BLOCK 75
   37: r139:DI=0x6
   38: r268:DI=0x5
   39: r267:DI=0x80
  535: r452:DI=r262:DI+r139:DI
      ; pc falls through to BB 39
  462: L462:
  461: NOTE_INSN_BASIC_BLOCK 76
   34: r139:DI=0x7
   35: r268:DI=0x6
   36: r267:DI=0xa0
  536: r452:DI=r262:DI+r139:DI
      ; pc falls through to BB 39
  466: L466:
  465: NOTE_INSN_BASIC_BLOCK 77
   31: r139:DI=0x8
   32: r268:DI=0x7
   33: r267:DI=0xc0
  537: r452:DI=r262:DI+r139:DI
      ; pc falls through to BB 39
  436: NOTE_INSN_BASIC_BLOCK 78
  434: %3:DI=r272:DI
      REG_DEAD r272:DI
  435: use %3:DI

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-04-20  9:23                 ` Xionghu Luo
@ 2021-04-21 11:54                   ` Richard Biener
  2021-05-14  7:10                     ` Xionghu Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-04-21 11:54 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

On Tue, 20 Apr 2021, Xionghu Luo wrote:

> 
> 
> On 2021/4/15 19:34, Richard Biener wrote:
> > On Thu, 15 Apr 2021, Xionghu Luo wrote:
> > 
> >> Thanks,
> >>
> >> On 2021/4/14 14:41, Richard Biener wrote:
> >>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> >>>> but it moves #538 first, then #235, there is strong dependency here. It
> >>>> seemsdoesn't like the LCM framework that could solve all and do the
> >>>> delete-insert in one iteration.
> >>> So my question was whether we want to do both within the LCM store
> >>> sinking framework.  The LCM dataflow is also used by RTL PRE which
> >>> handles both loads and non-loads so in principle it should be able
> >>> to handle stores and non-stores for the sinking case (PRE on the
> >>> reverse CFG).
> >>>
> >>> A global dataflow is more powerful than any local ad-hoc method.
> >>
> >> My biggest concern is whether the LCM DF framework could support sinking
> >> *multiple* reverse-dependent non-store instructions together by *one*
> >> calling of LCM DF.   If this is not supported, we need run multiple LCM
> >> until no new changes, it would be time consuming obviously (unless
> >> compiling time is not important here).
> > 
> > As said it is used for PRE and there it most definitely can do that.
> 
> I did some investigation about PRE and attached a case to show how it
> works, it is quite like store-motion, and actually there is a rtl-hoist
> pass in gcse.c which only works for code size.  All of them are
> leveraging the LCM framework to move instructions upward or downward.
> 
> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
> The four problems are all converted to the LCM DF problem with
> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
> and two outputs of where to insert/delete.
> 
> PRE scan each instruction and hash the SRC to table without *checking the
> relationship between instructions*, for the case attached, BB 37, BB 38
> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
> save it to index 106, BB 38 save it to index 110. After finishing this pass,
> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
> finally update instruction in BB 37.

I'm not familiar with the actual PRE code but reading the toplevel comment
it seems that indeed it can only handle expressions contained in a single
insn unless a REG_EQUAL note provides a short-hand for the larger one.

That of course means it would need to mark things as not transparent
for correctness where they'd be if moved together.  Now, nothing
prevents you changing the granularity of what you feed LCM.

So originally we arrived at looking into LCM because there's already
a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
one didn't look like the best obvious solution.

That said, LCM would work for single-instruction expressions.  
Alternatively a greedy algorithm like you prototyped could be used.
Another pass to look at would be RTL invariant motion which seems to
compute some kind of dependency graph - not sure if that would be
adaptable for the reverse CFG problem.

> Issues witnessed in the PRE run:
> 1) "r262:DI+r139:DI" in BLOCK 38 is not replaced;
> 2) PRE also use pseudo register to store the result like store-motion and
> even rtl-hoist. Actually we need real "move" instead of "replace" for
> rtl-sink to improve performance though also potential register pressure issue
> like rtl-hoist?
> 3) SRC instruction is scanned WITHOUT back chain check cross instructions
> in hash_scan_set, it couldn't handle below case though a+c is same with b+d.
> So I am afraid single run couldn't solve the instruction dependency issue
> to sink multiple instructions out as before for rtl-sink.
> 
> BB1:
> a = b;
> c = d;
> s = a + c;
> BB2:
> s = b + d;
> 
> 
> gcse.c:
> changed = one_pre_gcse_pass ()
> alloc_hash_table (&expr_hash_table);
> compute_hash_table (&expr_hash_table);
> compute_hash_table_work (table);
>   FOR_EACH_BB_FN (current_bb, cfun)
>    FOR_BB_INSNS (current_bb, insn)
>     hash_scan_insn (insn, table);
>       hash_scan_set (pat, insn, table);
>        if REG_P (dest)
>          insert_expr_in_table (src, GET_MODE (dest), insn, antic_p, avail_p,
>          max_distance, table);
>            hash = hash_expr (x, mode, &do_not_record_p, table->size);
> dump_hash_table (dump_file, "Expression", &expr_hash_table);
> edge_list = compute_pre_data ();
> compute_local_properties (transp, comp, antloc, &expr_hash_table);
> edge_list = pre_edge_lcm (expr_hash_table.n_elems, transp, comp, antloc,
> ae_kill, &pre_insert_map, &pre_delete_map);
> changed |= pre_gcse (edge_list);
> changed = pre_delete ();  /* Create a pseudo-reg to store the result of
> reaching expressions into. */
> did_insert = pre_edge_insert (edge_list, index_map); 
> 
> > 
> >>
> >>>
> >>> Richard.
> >>>
> >>>> However, there are still some common methods could be shared, like the
> >>>> def-use check(though store-motion is per bb, rtl-sink is per loop),
> >>>> insert_store, commit_edge_insertions etc.
> >>>>
> >>>>
> >>>>     508: L508:
> >>>>     507: NOTE_INSN_BASIC_BLOCK 34
> >>>>      12: r139:DI=r140:DI
> >>>>         REG_DEAD r140:DI
> >>>>     240: L240:
> >>>>     231: NOTE_INSN_BASIC_BLOCK 35
> >>>>     232: r142:DI=zero_extend(r139:DI#0)
> >>>>     233: r371:SI=r142:DI#0-0x1
> >>>>     234: r243:DI=zero_extend(r371:SI)
> >>>>         REG_DEAD r371:SI
> >>>>     235: r452:DI=r262:DI+r139:DI
> >>>>     538: r194:DI=r452:DI
> >>>>     236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
> >>
> >>
> >> Like here, Each instruction's dest reg is calculated in the input vector
> >> bitmap, after solving the equations by calling pre_edge_rev_lcm,
> >> move #538 out of loop for the first call, then move #235 out of loop
> >> after a second call... 4 repeat calls needed in total here, is the LCM
> >> framework smart enough to move the all 4 instruction within one iteration?
> >> I am worried that the input vector bitmap couldn't solve the dependency
> >> problem for two back chained instructions.
> >>
> >>
> >>
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-04-21 11:54                   ` Richard Biener
@ 2021-05-14  7:10                     ` Xionghu Luo
  2021-05-17  8:11                       ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Xionghu Luo @ 2021-05-14  7:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

Hi Richi,

On 2021/4/21 19:54, Richard Biener wrote:
> On Tue, 20 Apr 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/4/15 19:34, Richard Biener wrote:
>>> On Thu, 15 Apr 2021, Xionghu Luo wrote:
>>>
>>>> Thanks,
>>>>
>>>> On 2021/4/14 14:41, Richard Biener wrote:
>>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>>>>>> but it moves #538 first, then #235, there is strong dependency here. It
>>>>>> seemsdoesn't like the LCM framework that could solve all and do the
>>>>>> delete-insert in one iteration.
>>>>> So my question was whether we want to do both within the LCM store
>>>>> sinking framework.  The LCM dataflow is also used by RTL PRE which
>>>>> handles both loads and non-loads so in principle it should be able
>>>>> to handle stores and non-stores for the sinking case (PRE on the
>>>>> reverse CFG).
>>>>>
>>>>> A global dataflow is more powerful than any local ad-hoc method.
>>>>
>>>> My biggest concern is whether the LCM DF framework could support sinking
>>>> *multiple* reverse-dependent non-store instructions together by *one*
>>>> calling of LCM DF.   If this is not supported, we need run multiple LCM
>>>> until no new changes, it would be time consuming obviously (unless
>>>> compiling time is not important here).
>>>
>>> As said it is used for PRE and there it most definitely can do that.
>>
>> I did some investigation about PRE and attached a case to show how it
>> works, it is quite like store-motion, and actually there is a rtl-hoist
>> pass in gcse.c which only works for code size.  All of them are
>> leveraging the LCM framework to move instructions upward or downward.
>>
>> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
>> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
>> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
>> The four problems are all converted to the LCM DF problem with
>> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
>> and two outputs of where to insert/delete.
>>
>> PRE scan each instruction and hash the SRC to table without *checking the
>> relationship between instructions*, for the case attached, BB 37, BB 38
>> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
>> save it to index 106, BB 38 save it to index 110. After finishing this pass,
>> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
>> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
>> finally update instruction in BB 37.
> 
> I'm not familiar with the actual PRE code but reading the toplevel comment
> it seems that indeed it can only handle expressions contained in a single
> insn unless a REG_EQUAL note provides a short-hand for the larger one.
> 
> That of course means it would need to mark things as not transparent
> for correctness where they'd be if moved together.  Now, nothing
> prevents you changing the granularity of what you feed LCM.
> 
> So originally we arrived at looking into LCM because there's already
> a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
> one didn't look like the best obvious solution.
> 
> That said, LCM would work for single-instruction expressions.
> Alternatively a greedy algorithm like you prototyped could be used.
> Another pass to look at would be RTL invariant motion which seems to
> compute some kind of dependency graph - not sure if that would be
> adaptable for the reverse CFG problem.
> 

Actually my RTL sinking pass patch is borrowed from RTL loop invariant
motion, it is  quite limited since only moves instructions from loop header
to loop exits, though it could be refined with various of algorithms.
Compared to the initial method of running gimple sink pass once more, 
it seems much more complicated and limited without gaining obvious performance
benefit, shall we turn back to consider gimple sink2 pass from original since
we are in stage1 now?


Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-05-14  7:10                     ` Xionghu Luo
@ 2021-05-17  8:11                       ` Richard Biener
  2021-05-18  5:20                         ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-05-17  8:11 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

On Fri, 14 May 2021, Xionghu Luo wrote:

> Hi Richi,
> 
> On 2021/4/21 19:54, Richard Biener wrote:
> > On Tue, 20 Apr 2021, Xionghu Luo wrote:
> > 
> >>
> >>
> >> On 2021/4/15 19:34, Richard Biener wrote:
> >>> On Thu, 15 Apr 2021, Xionghu Luo wrote:
> >>>
> >>>> Thanks,
> >>>>
> >>>> On 2021/4/14 14:41, Richard Biener wrote:
> >>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> >>>>>> but it moves #538 first, then #235, there is strong dependency here. It
> >>>>>> seemsdoesn't like the LCM framework that could solve all and do the
> >>>>>> delete-insert in one iteration.
> >>>>> So my question was whether we want to do both within the LCM store
> >>>>> sinking framework.  The LCM dataflow is also used by RTL PRE which
> >>>>> handles both loads and non-loads so in principle it should be able
> >>>>> to handle stores and non-stores for the sinking case (PRE on the
> >>>>> reverse CFG).
> >>>>>
> >>>>> A global dataflow is more powerful than any local ad-hoc method.
> >>>>
> >>>> My biggest concern is whether the LCM DF framework could support sinking
> >>>> *multiple* reverse-dependent non-store instructions together by *one*
> >>>> calling of LCM DF.   If this is not supported, we need run multiple LCM
> >>>> until no new changes, it would be time consuming obviously (unless
> >>>> compiling time is not important here).
> >>>
> >>> As said it is used for PRE and there it most definitely can do that.
> >>
> >> I did some investigation about PRE and attached a case to show how it
> >> works, it is quite like store-motion, and actually there is a rtl-hoist
> >> pass in gcse.c which only works for code size.  All of them are
> >> leveraging the LCM framework to move instructions upward or downward.
> >>
> >> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
> >> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
> >> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
> >> The four problems are all converted to the LCM DF problem with
> >> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
> >> and two outputs of where to insert/delete.
> >>
> >> PRE scan each instruction and hash the SRC to table without *checking the
> >> relationship between instructions*, for the case attached, BB 37, BB 38
> >> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
> >> save it to index 106, BB 38 save it to index 110. After finishing this pass,
> >> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
> >> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
> >> finally update instruction in BB 37.
> > 
> > I'm not familiar with the actual PRE code but reading the toplevel comment
> > it seems that indeed it can only handle expressions contained in a single
> > insn unless a REG_EQUAL note provides a short-hand for the larger one.
> > 
> > That of course means it would need to mark things as not transparent
> > for correctness where they'd be if moved together.  Now, nothing
> > prevents you changing the granularity of what you feed LCM.
> > 
> > So originally we arrived at looking into LCM because there's already
> > a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
> > one didn't look like the best obvious solution.
> > 
> > That said, LCM would work for single-instruction expressions.
> > Alternatively a greedy algorithm like you prototyped could be used.
> > Another pass to look at would be RTL invariant motion which seems to
> > compute some kind of dependency graph - not sure if that would be
> > adaptable for the reverse CFG problem.
> > 
> 
> Actually my RTL sinking pass patch is borrowed from RTL loop invariant
> motion, it is  quite limited since only moves instructions from loop header
> to loop exits, though it could be refined with various of algorithms.
> Compared to the initial method of running gimple sink pass once more, 
> it seems much more complicated and limited without gaining obvious performance
> benefit, shall we turn back to consider gimple sink2 pass from original since
> we are in stage1 now?

OK, so while there might be new sinking opportunities exposed during
RTL expansion and early RTL opts we can consider adding another sink pass
on GIMPLE.  Since it's basically a scheduling optimization placement
shouldn't matter much but I suppose we should run it before store
merging, so anywhere between cd_dce and that.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Run pass_sink_code once more before store_mergin
  2021-05-17  8:11                       ` Richard Biener
@ 2021-05-18  5:20                         ` Xionghu Luo
  2021-05-18  7:02                           ` Richard Biener
  2021-09-02 15:45                           ` Martin Jambor
  0 siblings, 2 replies; 20+ messages in thread
From: Xionghu Luo @ 2021-05-18  5:20 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 4731 bytes --]

Hi,

On 2021/5/17 16:11, Richard Biener wrote:
> On Fri, 14 May 2021, Xionghu Luo wrote:
> 
>> Hi Richi,
>>
>> On 2021/4/21 19:54, Richard Biener wrote:
>>> On Tue, 20 Apr 2021, Xionghu Luo wrote:
>>>
>>>>
>>>>
>>>> On 2021/4/15 19:34, Richard Biener wrote:
>>>>> On Thu, 15 Apr 2021, Xionghu Luo wrote:
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> On 2021/4/14 14:41, Richard Biener wrote:
>>>>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>>>>>>>> but it moves #538 first, then #235, there is strong dependency here. It
>>>>>>>> seemsdoesn't like the LCM framework that could solve all and do the
>>>>>>>> delete-insert in one iteration.
>>>>>>> So my question was whether we want to do both within the LCM store
>>>>>>> sinking framework.  The LCM dataflow is also used by RTL PRE which
>>>>>>> handles both loads and non-loads so in principle it should be able
>>>>>>> to handle stores and non-stores for the sinking case (PRE on the
>>>>>>> reverse CFG).
>>>>>>>
>>>>>>> A global dataflow is more powerful than any local ad-hoc method.
>>>>>>
>>>>>> My biggest concern is whether the LCM DF framework could support sinking
>>>>>> *multiple* reverse-dependent non-store instructions together by *one*
>>>>>> calling of LCM DF.   If this is not supported, we need run multiple LCM
>>>>>> until no new changes, it would be time consuming obviously (unless
>>>>>> compiling time is not important here).
>>>>>
>>>>> As said it is used for PRE and there it most definitely can do that.
>>>>
>>>> I did some investigation about PRE and attached a case to show how it
>>>> works, it is quite like store-motion, and actually there is a rtl-hoist
>>>> pass in gcse.c which only works for code size.  All of them are
>>>> leveraging the LCM framework to move instructions upward or downward.
>>>>
>>>> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
>>>> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
>>>> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
>>>> The four problems are all converted to the LCM DF problem with
>>>> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
>>>> and two outputs of where to insert/delete.
>>>>
>>>> PRE scan each instruction and hash the SRC to table without *checking the
>>>> relationship between instructions*, for the case attached, BB 37, BB 38
>>>> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
>>>> save it to index 106, BB 38 save it to index 110. After finishing this pass,
>>>> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
>>>> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
>>>> finally update instruction in BB 37.
>>>
>>> I'm not familiar with the actual PRE code but reading the toplevel comment
>>> it seems that indeed it can only handle expressions contained in a single
>>> insn unless a REG_EQUAL note provides a short-hand for the larger one.
>>>
>>> That of course means it would need to mark things as not transparent
>>> for correctness where they'd be if moved together.  Now, nothing
>>> prevents you changing the granularity of what you feed LCM.
>>>
>>> So originally we arrived at looking into LCM because there's already
>>> a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
>>> one didn't look like the best obvious solution.
>>>
>>> That said, LCM would work for single-instruction expressions.
>>> Alternatively a greedy algorithm like you prototyped could be used.
>>> Another pass to look at would be RTL invariant motion which seems to
>>> compute some kind of dependency graph - not sure if that would be
>>> adaptable for the reverse CFG problem.
>>>
>>
>> Actually my RTL sinking pass patch is borrowed from RTL loop invariant
>> motion, it is  quite limited since only moves instructions from loop header
>> to loop exits, though it could be refined with various of algorithms.
>> Compared to the initial method of running gimple sink pass once more,
>> it seems much more complicated and limited without gaining obvious performance
>> benefit, shall we turn back to consider gimple sink2 pass from original since
>> we are in stage1 now?
> 
> OK, so while there might be new sinking opportunities exposed during
> RTL expansion and early RTL opts we can consider adding another sink pass
> on GIMPLE.  Since it's basically a scheduling optimization placement
> shouldn't matter much but I suppose we should run it before store
> merging, so anywhere between cd_dce and that.
> 
> Richard.
> 

Attached the patch as discussed, put it before store_merging is fine.
Regression tested pass on P8LE, OK for trunk? :)


Thanks,
Xionghu

[-- Attachment #2: 0001-Run-pass_sink_code-once-more-before-store_merging.patch --]
[-- Type: text/plain, Size: 14563 bytes --]

From 7fcc6ca9ef3b6acbfbcbd3da4be1d1c0eef4be80 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luoxhu@linux.ibm.com>
Date: Mon, 17 May 2021 20:46:15 -0500
Subject: [PATCH] Run pass_sink_code once more before store_merging

Gimple sink code pass runs quite early, there may be some new
oppertunities exposed by later gimple optmization passes, this patch
runs the sink code pass once more before store_merging.  For detailed
discussion, please refer to:
https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562352.html

Tested the SPEC2017 performance on P8LE, 544.nab_r is improved
by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
small with 0.25%.

gcc/ChangeLog:

	* passes.def: Add sink_code before store_merging.
	* tree-ssa-sink.c (pass_sink_code:clone): New.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-sink-1.c: Adjust.
	* gcc.dg/tree-ssa/ssa-sink-2.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-3.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-4.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-5.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-6.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-7.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-8.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-9.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-10.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-13.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-14.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-16.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-17.c: Ditto.
	* gcc.dg/tree-ssa/ssa-sink-18.c: New.
---
 gcc/passes.def                              |   1 +
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c |   4 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c |   4 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c | 196 ++++++++++++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c  |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c  |   2 +-
 gcc/tree-ssa-sink.c                         |   1 +
 17 files changed, 214 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c

diff --git a/gcc/passes.def b/gcc/passes.def
index de39fa48b3d..945d2bc797c 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -348,6 +348,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_phiopt, false /* early_p */);
       NEXT_PASS (pass_fold_builtins);
       NEXT_PASS (pass_optimize_widening_mul);
+      NEXT_PASS (pass_sink_code);
       NEXT_PASS (pass_store_merging);
       NEXT_PASS (pass_tail_calls);
       /* If DCE is not run before checking for uninitialized uses,
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c
index 411585a6dc4..57b501681f3 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c
@@ -7,4 +7,4 @@ foo (int a, int b, int c)
   return c ? x : a;
 }
 /* We should sink the x = a * b calculation into the branch that returns x. */
-/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c
index 37e4d2fe687..535cb3208f5 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c
@@ -16,4 +16,4 @@ void foo (void)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Sinking # VUSE" 4 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sinking # VUSE" 4 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c
index a65ba35d4ba..584fd91f43a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c
@@ -21,5 +21,5 @@ void test ()
 
 /* We should sink/merge all stores and end up with a single BB.  */
 
-/* { dg-final { scan-tree-dump-times "MEM\[^\n\r\]* = 0;" 3 "sink" } } */
-/* { dg-final { scan-tree-dump-times "<bb " 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "MEM\[^\n\r\]* = 0;" 3 "sink1" } } */
+/* { dg-final { scan-tree-dump-times "<bb " 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c
index 771cd4420c4..f5418b06deb 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c
@@ -13,5 +13,5 @@ void foo (int b)
 /* We should have sunk the store and inserted a PHI to merge the
    stored values.  */
 
-/* { dg-final { scan-tree-dump-times " = PHI" 1 "sink" } } */
-/* { dg-final { scan-tree-dump-times "x = " 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times " = PHI" 1 "sink1" } } */
+/* { dg-final { scan-tree-dump-times "x = " 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c
index 610c8d60ebe..012b165fbab 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c
@@ -10,5 +10,5 @@ int f(int n)
   return j;
 }
 
-/* { dg-final { scan-tree-dump "Sinking j_. = __builtin_ffs" "sink" } } */
+/* { dg-final { scan-tree-dump "Sinking j_. = __builtin_ffs" "sink1" } } */
 /* { dg-final { scan-tree-dump "return 2;" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c
index cf2e2a0f766..d0aeeb312cc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c
@@ -12,4 +12,4 @@ int my_f(int a, int b)
 }
 
 /* We should sink the call to pure_f to the if block.  */
-/* { dg-final { scan-tree-dump "Sinking # VUSE" "sink" } } */
+/* { dg-final { scan-tree-dump "Sinking # VUSE" "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
new file mode 100644
index 00000000000..b2f5ca9c847
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
@@ -0,0 +1,196 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+
+#include <stdint.h>
+
+#define HLOG 16
+#define        MAX_LIT        (1 <<  5)
+typedef const uint8_t *LZF_HSLOT;
+typedef LZF_HSLOT LZF_STATE[1 << (HLOG)];
+
+int
+compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len)
+{
+  LZF_STATE htab;
+
+  uint8_t *ip = in_data;
+  uint8_t *op = out_data;
+  uint8_t *in_end = ip + in_len;
+  uint8_t *out_end = op + out_len;
+  uint8_t *ref;
+
+  unsigned long off;
+  unsigned int hval;
+  int lit;
+
+  if (!in_len || !out_len)
+    return 0;
+
+  lit = 0;
+  op++;
+  hval = (((ip[0]) << 8) | ip[1]);
+
+  while (ip < in_end - 2)
+    {
+      uint8_t *hslot;
+
+      hval = (((hval) << 8) | ip[2]);
+      hslot = (uint8_t*)(htab + (((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1)));
+
+      ref = *hslot + in_data;
+      *hslot = ip - in_data;
+
+      if (1 && (off = ip - ref - 1) < (1 << 13) && ref > in_data
+	  && ref[2] == ip[2]
+	  && ((ref[1] << 8) | ref[0]) == ((ip[1] << 8) | ip[0]))
+	{
+	  unsigned int len = 2;
+	  unsigned int maxlen = in_end - ip - len;
+	  maxlen
+	    = maxlen > ((1 << 8) + (1 << 3)) ? ((1 << 8) + (1 << 3)) : maxlen;
+
+	  if ((op + 3 + 1 >= out_end) != 0)
+	    if (op - !lit + 3 + 1 >= out_end)
+	      return 0;
+
+	  op[-lit - 1] = lit - 1;
+	  op -= !lit;
+
+	  for (;;)
+	    {
+	      if (maxlen > 16)
+		{
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		  len++;
+		  if (ref[len] != ip[len])
+		    break;
+		}
+
+	      do
+		{
+		  len++;
+		}
+	      while (len < maxlen && ip[len] == ref[len]);
+
+	      break;
+	    }
+
+	  len -= 2;
+	  ip++;
+
+	  if (len < 7)
+	    {
+	      *op++ = (off >> 8) + (len << 5);
+	    }
+	  else
+	    {
+	      *op++ = (off >> 8) + (7 << 5);
+	      *op++ = len - 7;
+	    }
+	  *op++ = off;
+	  lit = 0;
+	  op++;
+	  ip += len + 1;
+
+	  if (ip >= in_end - 2)
+	    break;
+
+	  --ip;
+	  --ip;
+
+	  hval = (((ip[0]) << 8) | ip[1]);
+	  hval = (((hval) << 8) | ip[2]);
+	  htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))]
+	    = (LZF_HSLOT)(ip - in_data);
+	  ip++;
+
+	  hval = (((hval) << 8) | ip[2]);
+	  htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))]
+	    = (LZF_HSLOT)(ip - in_data);
+	  ip++;
+	}
+      else
+	{
+	  if (op >= out_end)
+	    return 0;
+
+	  lit++;
+	  *op++ = *ip++;
+
+	  if (lit == (1 << 5))
+	    {
+	      op[-lit - 1] = lit - 1;
+	      lit = 0;
+	      op++;
+	    }
+	}
+    }
+  if (op + 3 > out_end) /* at most 3 bytes can be missing here */
+    return 0;
+
+  while (ip < in_end)
+    {
+      lit++;
+      *op++ = *ip++;
+      if (lit == MAX_LIT)
+	{
+	  op[-lit - 1] = lit - 1; /* stop run */
+	  lit = 0;
+	  op++; /* start run */
+	}
+    }
+
+  op[-lit - 1] = lit - 1; /* end run */
+  op -= !lit;		  /* undo run if length is zero */
+
+  return op - out_data;
+ }
+
+/* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c
index 6aa5a182a3a..a0b4734b1e0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c
@@ -9,4 +9,4 @@ bar (int a, int b, int c)
   return y;
 }
 /* We should sink the x = a * b calculation into the else branch  */
-/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c
index 599997e0e6b..ad88ccc4f5b 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c
@@ -12,4 +12,4 @@ main (int argc)
     }
 }
 /* We should sink the a = argc + 1 calculation into the if branch  */
-/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c
index 784edd2fc87..1e3cfa93fa8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c
@@ -17,4 +17,4 @@ main (int argc)
   foo2 (a);
 }
 /* We should sink the first a = b + c calculation into the else branch  */
-/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c
index dbdde39add6..f04da5da9b0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c
@@ -44,4 +44,4 @@ void foo(int16_t runs[], uint8_t alpha[], int x, int count)
 }
 
 /* We should not sink the next_runs = runs + x calculation after the loop.  */
-/* { dg-final { scan-tree-dump-times "Sunk statements:" 0 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sunk statements:" 0 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c
index 1abae9f7943..31f5af330f9 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c
@@ -14,4 +14,4 @@ int foo(int *a, int r)
 
 /* *a = 1 should be sunk to the else block.  */
 
-/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c
index ec3288f4e69..bd748442edc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c
@@ -15,4 +15,4 @@ int foo(int *a, int r, short *b)
 
 /* *a = 1 should be sunk to the else block.  */
 
-/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c
index 48af4218fc0..4b23b567fd0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c
@@ -24,4 +24,4 @@ int foo(int *a, int r, short *b)
 
 /* *a = 1 should be sunk into the default case.  */
 
-/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c
index 509a76330a4..32bfc81741a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c
@@ -15,4 +15,4 @@ int foo(int *a, int r, int *b)
 
 /* *a = 1 should be sunk to the else block.  */
 
-/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */
+/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */
diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c
index d33e56ef915..d252cbb5c51 100644
--- a/gcc/tree-ssa-sink.c
+++ b/gcc/tree-ssa-sink.c
@@ -819,6 +819,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_sink != 0; }
   virtual unsigned int execute (function *);
+  opt_pass *clone (void) { return new pass_sink_code (m_ctxt); }
 
 }; // class pass_sink_code
 
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Run pass_sink_code once more before store_mergin
  2021-05-18  5:20                         ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo
@ 2021-05-18  7:02                           ` Richard Biener
  2021-05-18  9:00                             ` Xionghu Luo
  2021-09-02 15:45                           ` Martin Jambor
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-05-18  7:02 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

On Tue, 18 May 2021, Xionghu Luo wrote:

> Hi,
> 
> On 2021/5/17 16:11, Richard Biener wrote:
> > On Fri, 14 May 2021, Xionghu Luo wrote:
> > 
> >> Hi Richi,
> >>
> >> On 2021/4/21 19:54, Richard Biener wrote:
> >>> On Tue, 20 Apr 2021, Xionghu Luo wrote:
> >>>
> >>>>
> >>>>
> >>>> On 2021/4/15 19:34, Richard Biener wrote:
> >>>>> On Thu, 15 Apr 2021, Xionghu Luo wrote:
> >>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> On 2021/4/14 14:41, Richard Biener wrote:
> >>>>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by
> >>>>>>>> rtl-sink,
> >>>>>>>> but it moves #538 first, then #235, there is strong dependency here.
> >>>>>>>> It
> >>>>>>>> seemsdoesn't like the LCM framework that could solve all and do the
> >>>>>>>> delete-insert in one iteration.
> >>>>>>> So my question was whether we want to do both within the LCM store
> >>>>>>> sinking framework.  The LCM dataflow is also used by RTL PRE which
> >>>>>>> handles both loads and non-loads so in principle it should be able
> >>>>>>> to handle stores and non-stores for the sinking case (PRE on the
> >>>>>>> reverse CFG).
> >>>>>>>
> >>>>>>> A global dataflow is more powerful than any local ad-hoc method.
> >>>>>>
> >>>>>> My biggest concern is whether the LCM DF framework could support
> >>>>>> sinking
> >>>>>> *multiple* reverse-dependent non-store instructions together by *one*
> >>>>>> calling of LCM DF.   If this is not supported, we need run multiple LCM
> >>>>>> until no new changes, it would be time consuming obviously (unless
> >>>>>> compiling time is not important here).
> >>>>>
> >>>>> As said it is used for PRE and there it most definitely can do that.
> >>>>
> >>>> I did some investigation about PRE and attached a case to show how it
> >>>> works, it is quite like store-motion, and actually there is a rtl-hoist
> >>>> pass in gcse.c which only works for code size.  All of them are
> >>>> leveraging the LCM framework to move instructions upward or downward.
> >>>>
> >>>> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
> >>>> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
> >>>> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
> >>>> The four problems are all converted to the LCM DF problem with
> >>>> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as
> >>>> input
> >>>> and two outputs of where to insert/delete.
> >>>>
> >>>> PRE scan each instruction and hash the SRC to table without *checking the
> >>>> relationship between instructions*, for the case attached, BB 37, BB 38
> >>>> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB
> >>>> 41
> >>>> save it to index 106, BB 38 save it to index 110. After finishing this
> >>>> pass,
> >>>> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
> >>>> expr to BB 75~BB 80 to create full redundancies from partial
> >>>> redundancies,
> >>>> finally update instruction in BB 37.
> >>>
> >>> I'm not familiar with the actual PRE code but reading the toplevel comment
> >>> it seems that indeed it can only handle expressions contained in a single
> >>> insn unless a REG_EQUAL note provides a short-hand for the larger one.
> >>>
> >>> That of course means it would need to mark things as not transparent
> >>> for correctness where they'd be if moved together.  Now, nothing
> >>> prevents you changing the granularity of what you feed LCM.
> >>>
> >>> So originally we arrived at looking into LCM because there's already
> >>> a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
> >>> one didn't look like the best obvious solution.
> >>>
> >>> That said, LCM would work for single-instruction expressions.
> >>> Alternatively a greedy algorithm like you prototyped could be used.
> >>> Another pass to look at would be RTL invariant motion which seems to
> >>> compute some kind of dependency graph - not sure if that would be
> >>> adaptable for the reverse CFG problem.
> >>>
> >>
> >> Actually my RTL sinking pass patch is borrowed from RTL loop invariant
> >> motion, it is  quite limited since only moves instructions from loop header
> >> to loop exits, though it could be refined with various of algorithms.
> >> Compared to the initial method of running gimple sink pass once more,
> >> it seems much more complicated and limited without gaining obvious
> >> performance
> >> benefit, shall we turn back to consider gimple sink2 pass from original
> >> since
> >> we are in stage1 now?
> > 
> > OK, so while there might be new sinking opportunities exposed during
> > RTL expansion and early RTL opts we can consider adding another sink pass
> > on GIMPLE.  Since it's basically a scheduling optimization placement
> > shouldn't matter much but I suppose we should run it before store
> > merging, so anywhere between cd_dce and that.
> > 
> > Richard.
> > 
> 
> Attached the patch as discussed, put it before store_merging is fine.
> Regression tested pass on P8LE, OK for trunk? :)

Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add
a comment explaining what operations we expect to sink?  The testcase
is likely somewhat fragile in the exact number of sinkings
(can you check on some other target and maybe P8BE with -m32 for 
example?), so for future adjustments it would be nice to easiliy
figure out what we expect to happen.

OK with that change.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Run pass_sink_code once more before store_mergin
  2021-05-18  7:02                           ` Richard Biener
@ 2021-05-18  9:00                             ` Xionghu Luo
  2021-05-18 10:34                               ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Xionghu Luo @ 2021-05-18  9:00 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

Hi,

On 2021/5/18 15:02, Richard Biener wrote:
> Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add
> a comment explaining what operations we expect to sink?  The testcase
> is likely somewhat fragile in the exact number of sinkings
> (can you check on some other target and maybe P8BE with -m32 for
> example?), so for future adjustments it would be nice to easiliy
> figure out what we expect to happen.
> 
> OK with that change.

Thanks a lot for the reminder! ssa-sink-18.c generates different code
for m32 and m64 exactly due to different type size conversion and ivopts
selection, since -m32 and -m64 couldn't co-exist in one test file, shall
I restrict it to -m64 only or check target lp64/target ilp32?
I've verified this case shows same behavior on X86, AArch64 and Power for
both m32 and m64.

-m32:
  <bb 29> [local count: 75120046]:
  # len_155 = PHI <len_127(28), len_182(55)>
  len_182 = len_155 + 1;
  _35 = (unsigned int) ip_249;
  _36 = _35 + len_182;
  _380 = (uint8_t *) _36;
  if (maxlen_179 > len_182)
    goto <bb 30>; [94.50%]
  else
    goto <bb 31>; [5.50%]
...

Sinking _329 = (uint8_t *) _36;
 from bb 29 to bb 86
Sinking _36 = _35 + len_182;
 from bb 29 to bb 86
Sinking _35 = (unsigned int) ip_249;
 from bb 29 to bb 86

Pass statistics of "sink": ----------------
Sunk statements: 3


-m64:
  <bb 29> [local count: 75120046]:
  # ivtmp.23_34 = PHI <ivtmp.23_36(28), ivtmp.23_35(55)>
  _38 = (unsigned int) ivtmp.23_34;
  len_161 = _38 + 4294967295;
  _434 = (unsigned long) ip_254;
  _433 = ivtmp.23_34 + _434;
  _438 = (uint8_t *) _433;
  if (_38 < maxlen_187)
    goto <bb 30>; [94.50%]
  else
    goto <bb 31>; [5.50%]
...

Sinking _367 = (uint8_t *) _320;
 from bb 31 to bb 90
Sinking _320 = _321 + ivtmp.25_326;
 from bb 31 to bb 90
Sinking _321 = (unsigned long) ip_229;
 from bb 31 to bb 90
Sinking len_158 = _322 + 4294967295;
 from bb 31 to bb 33

Pass statistics of "sink": ----------------
Sunk statements: 4


Regarding to the comments part:

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
index 52b9a74b65f..5147f7b85cd 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
@@ -193,16 +193,17 @@ compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len)
   return op - out_data;
  }
+ /* For this case, pass sink2 sinks statements from hot loop header to loop
+    exits after gimple loop optimizations, which generates instructions executed
+    each iteration in loop, but the results are used outside of loop:
+    With -m64,
+    "Sinking _367 = (uint8_t *) _320;
+     from bb 31 to bb 90
+     Sinking _320 = _321 + ivtmp.25_326;
+     from bb 31 to bb 90
+     Sinking _321 = (unsigned long) ip_229;
+     from bb 31 to bb 90
+     Sinking len_158 = _322 + 4294967295;
+    from bb 31 to bb 33"  */

- /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */
+ /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" { target lp64 } } } */
+ /* { dg-final { scan-tree-dump-times "Sunk statements: 3" 1 "sink2" { target ilp32 } } } */

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Run pass_sink_code once more before store_mergin
  2021-05-18  9:00                             ` Xionghu Luo
@ 2021-05-18 10:34                               ` Richard Biener
  2021-05-19  8:03                                 ` Bernd Edlinger
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-05-18 10:34 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc

On Tue, 18 May 2021, Xionghu Luo wrote:

> Hi,
> 
> On 2021/5/18 15:02, Richard Biener wrote:
> > Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add
> > a comment explaining what operations we expect to sink?  The testcase
> > is likely somewhat fragile in the exact number of sinkings
> > (can you check on some other target and maybe P8BE with -m32 for
> > example?), so for future adjustments it would be nice to easiliy
> > figure out what we expect to happen.
> > 
> > OK with that change.
> 
> Thanks a lot for the reminder! ssa-sink-18.c generates different code
> for m32 and m64 exactly due to different type size conversion and ivopts
> selection, since -m32 and -m64 couldn't co-exist in one test file, shall
> I restrict it to -m64 only or check target lp64/target ilp32?
> I've verified this case shows same behavior on X86, AArch64 and Power for
> both m32 and m64.
> 
> -m32:
>   <bb 29> [local count: 75120046]:
>   # len_155 = PHI <len_127(28), len_182(55)>
>   len_182 = len_155 + 1;
>   _35 = (unsigned int) ip_249;
>   _36 = _35 + len_182;
>   _380 = (uint8_t *) _36;
>   if (maxlen_179 > len_182)
>     goto <bb 30>; [94.50%]
>   else
>     goto <bb 31>; [5.50%]
> ...
> 
> Sinking _329 = (uint8_t *) _36;
>  from bb 29 to bb 86
> Sinking _36 = _35 + len_182;
>  from bb 29 to bb 86
> Sinking _35 = (unsigned int) ip_249;
>  from bb 29 to bb 86
> 
> Pass statistics of "sink": ----------------
> Sunk statements: 3
> 
> 
> -m64:
>   <bb 29> [local count: 75120046]:
>   # ivtmp.23_34 = PHI <ivtmp.23_36(28), ivtmp.23_35(55)>
>   _38 = (unsigned int) ivtmp.23_34;
>   len_161 = _38 + 4294967295;
>   _434 = (unsigned long) ip_254;
>   _433 = ivtmp.23_34 + _434;
>   _438 = (uint8_t *) _433;
>   if (_38 < maxlen_187)
>     goto <bb 30>; [94.50%]
>   else
>     goto <bb 31>; [5.50%]
> ...
> 
> Sinking _367 = (uint8_t *) _320;
>  from bb 31 to bb 90
> Sinking _320 = _321 + ivtmp.25_326;
>  from bb 31 to bb 90
> Sinking _321 = (unsigned long) ip_229;
>  from bb 31 to bb 90
> Sinking len_158 = _322 + 4294967295;
>  from bb 31 to bb 33
> 
> Pass statistics of "sink": ----------------
> Sunk statements: 4
> 
> 
> Regarding to the comments part:
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
> index 52b9a74b65f..5147f7b85cd 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
> @@ -193,16 +193,17 @@ compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len)
>    return op - out_data;
>   }
> + /* For this case, pass sink2 sinks statements from hot loop header to loop
> +    exits after gimple loop optimizations, which generates instructions executed
> +    each iteration in loop, but the results are used outside of loop:
> +    With -m64,
> +    "Sinking _367 = (uint8_t *) _320;
> +     from bb 31 to bb 90
> +     Sinking _320 = _321 + ivtmp.25_326;
> +     from bb 31 to bb 90
> +     Sinking _321 = (unsigned long) ip_229;
> +     from bb 31 to bb 90
> +     Sinking len_158 = _322 + 4294967295;
> +    from bb 31 to bb 33"  */
> 
> - /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */
> + /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" { target lp64 } } } */
> + /* { dg-final { scan-tree-dump-times "Sunk statements: 3" 1 "sink2" { target ilp32 } } } */

Yes, that looks good.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Run pass_sink_code once more before store_mergin
  2021-05-18 10:34                               ` Richard Biener
@ 2021-05-19  8:03                                 ` Bernd Edlinger
  0 siblings, 0 replies; 20+ messages in thread
From: Bernd Edlinger @ 2021-05-19  8:03 UTC (permalink / raw)
  To: Richard Biener, Xionghu Luo; +Cc: gcc-patches, dje.gcc, wschmidt, linkw, segher

[-- Attachment #1: Type: text/plain, Size: 3875 bytes --]

On 5/18/21 12:34 PM, Richard Biener wrote:
> On Tue, 18 May 2021, Xionghu Luo wrote:
> 
>> Hi,
>>
>> On 2021/5/18 15:02, Richard Biener wrote:
>>> Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add
>>> a comment explaining what operations we expect to sink?  The testcase
>>> is likely somewhat fragile in the exact number of sinkings
>>> (can you check on some other target and maybe P8BE with -m32 for
>>> example?), so for future adjustments it would be nice to easiliy
>>> figure out what we expect to happen.
>>>
>>> OK with that change.
>>
>> Thanks a lot for the reminder! ssa-sink-18.c generates different code
>> for m32 and m64 exactly due to different type size conversion and ivopts
>> selection, since -m32 and -m64 couldn't co-exist in one test file, shall
>> I restrict it to -m64 only or check target lp64/target ilp32?
>> I've verified this case shows same behavior on X86, AArch64 and Power for
>> both m32 and m64.
>>
>> -m32:
>>   <bb 29> [local count: 75120046]:
>>   # len_155 = PHI <len_127(28), len_182(55)>
>>   len_182 = len_155 + 1;
>>   _35 = (unsigned int) ip_249;
>>   _36 = _35 + len_182;
>>   _380 = (uint8_t *) _36;
>>   if (maxlen_179 > len_182)
>>     goto <bb 30>; [94.50%]
>>   else
>>     goto <bb 31>; [5.50%]
>> ...
>>
>> Sinking _329 = (uint8_t *) _36;
>>  from bb 29 to bb 86
>> Sinking _36 = _35 + len_182;
>>  from bb 29 to bb 86
>> Sinking _35 = (unsigned int) ip_249;
>>  from bb 29 to bb 86
>>
>> Pass statistics of "sink": ----------------
>> Sunk statements: 3
>>
>>
>> -m64:
>>   <bb 29> [local count: 75120046]:
>>   # ivtmp.23_34 = PHI <ivtmp.23_36(28), ivtmp.23_35(55)>
>>   _38 = (unsigned int) ivtmp.23_34;
>>   len_161 = _38 + 4294967295;
>>   _434 = (unsigned long) ip_254;
>>   _433 = ivtmp.23_34 + _434;
>>   _438 = (uint8_t *) _433;
>>   if (_38 < maxlen_187)
>>     goto <bb 30>; [94.50%]
>>   else
>>     goto <bb 31>; [5.50%]
>> ...
>>
>> Sinking _367 = (uint8_t *) _320;
>>  from bb 31 to bb 90
>> Sinking _320 = _321 + ivtmp.25_326;
>>  from bb 31 to bb 90
>> Sinking _321 = (unsigned long) ip_229;
>>  from bb 31 to bb 90
>> Sinking len_158 = _322 + 4294967295;
>>  from bb 31 to bb 33
>>
>> Pass statistics of "sink": ----------------
>> Sunk statements: 4
>>
>>
>> Regarding to the comments part:
>>
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
>> index 52b9a74b65f..5147f7b85cd 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c
>> @@ -193,16 +193,17 @@ compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len)
>>    return op - out_data;
>>   }
>> + /* For this case, pass sink2 sinks statements from hot loop header to loop
>> +    exits after gimple loop optimizations, which generates instructions executed
>> +    each iteration in loop, but the results are used outside of loop:
>> +    With -m64,
>> +    "Sinking _367 = (uint8_t *) _320;
>> +     from bb 31 to bb 90
>> +     Sinking _320 = _321 + ivtmp.25_326;
>> +     from bb 31 to bb 90
>> +     Sinking _321 = (unsigned long) ip_229;
>> +     from bb 31 to bb 90
>> +     Sinking len_158 = _322 + 4294967295;
>> +    from bb 31 to bb 33"  */
>>
>> - /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */
>> + /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" { target lp64 } } } */
>> + /* { dg-final { scan-tree-dump-times "Sunk statements: 3" 1 "sink2" { target ilp32 } } } */
> 
> Yes, that looks good.
> 
> Thanks,
> Richard.
> 

Hi,

I've noticed the test case gcc.dg/tree-ssa/ssa-sink-3.c was accidentally
committed as empty file, and therefore fails:

FAIL: gcc.dg/tree-ssa/ssa-sink-3.c (test for excess errors)

I've commited as obvious the following fix (which restores the test case and
Xionghu Luo's intended change.


Thanks
Bernd.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-commit-mistake-in-testcase-gcc.dg-tree-ssa-ssa-s.patch --]
[-- Type: text/x-patch; name="0001-Fix-commit-mistake-in-testcase-gcc.dg-tree-ssa-ssa-s.patch", Size: 1107 bytes --]

From 51cfa55431c38f3c29c7b72833337ad8a2da5c06 Mon Sep 17 00:00:00 2001
From: Bernd Edlinger <bernd.edlinger@hotmail.de>
Date: Wed, 19 May 2021 09:51:44 +0200
Subject: [PATCH] Fix commit mistake in testcase gcc.dg/tree-ssa/ssa-sink-3.c

the test case was accidenally changed to empty file.

2021-05-19  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* gcc.dg/tree-ssa/ssa-sink-3.c: Fix test case.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c
index e69de29..ad88ccc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */ 
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+extern void foo(int a);
+int
+main (int argc)
+{
+  int a;
+  a = argc + 1;
+  if (argc + 3)
+    {
+      foo (a);
+    }
+}
+/* We should sink the a = argc + 1 calculation into the if branch  */
+/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Run pass_sink_code once more before store_mergin
  2021-05-18  5:20                         ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo
  2021-05-18  7:02                           ` Richard Biener
@ 2021-09-02 15:45                           ` Martin Jambor
  1 sibling, 0 replies; 20+ messages in thread
From: Martin Jambor @ 2021-09-02 15:45 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches

Hi,

On Tue, May 18 2021, Xionghu Luo via Gcc-patches wrote:
>

[...]

> From 7fcc6ca9ef3b6acbfbcbd3da4be1d1c0eef4be80 Mon Sep 17 00:00:00 2001
> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> Date: Mon, 17 May 2021 20:46:15 -0500
> Subject: [PATCH] Run pass_sink_code once more before store_merging
>
> Gimple sink code pass runs quite early, there may be some new
> oppertunities exposed by later gimple optmization passes, this patch
> runs the sink code pass once more before store_merging.  For detailed
> discussion, please refer to:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562352.html
>
> Tested the SPEC2017 performance on P8LE, 544.nab_r is improved
> by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
> small with 0.25%.
>
> gcc/ChangeLog:
>
> 	* passes.def: Add sink_code before store_merging.
> 	* tree-ssa-sink.c (pass_sink_code:clone): New.


Unfortunately, this seems to have caused PR 102178
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178)

Sorry about the bad news,

Martin


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Run pass_sink_code once more after ivopts/fre
  2021-04-14  6:41           ` Richard Biener
  2021-04-15  6:20             ` Xionghu Luo
@ 2021-04-24  3:44             ` Jeff Law
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Law @ 2021-04-24  3:44 UTC (permalink / raw)
  To: Richard Biener, Xionghu Luo; +Cc: gcc-patches, dje.gcc, wschmidt, linkw, segher


On 4/14/2021 12:41 AM, Richard Biener wrote:
> On Wed, 14 Apr 2021, Xionghu Luo wrote:
>
>> Hi,
>>
>> On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote:
>>>> Also we already have a sinking pass on RTL which even computes
>>>> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
>>>> I'm not sure whether this deals with non-stores but the
>>>> LCM machinery definitely can handle arbitrary expressions.  I wonder
>>>> if it makes more sense to extend this rather than inventing a new
>>>> ad-hoc sinking pass?
>>>   From the literal, my pass doesn't handle or process store instructions
>>> like store-motion..  Thanks, will check it.
>> Store motion only processes store instructions with data flow equations,
>> generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it
>> by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map,
>> st_insert_map) globally, each store place is independently represented in
>> the input bitmap vectors. Output is which should be delete and where to
>> insert, current code does what you said "emit copies to a new pseudo at
>> the original insn location and use it in followed bb", actually it is
>> "store replacement" instead of "store move", why not save one pseudo by
>> moving the store instruction to target edge directly?
> It probably simply saves the pass from doing analysis whether the
> stored value is clobbered on the sinking path, enabling more store
> sinking.  For stores that might be even beneficial, for non-stores
> it becomes more of a cost issue, yes.
>
>> There are many differences between the newly added rtl-sink pass and
>> store-motion pass.
>> 1. Store motion moves only store instructions, rtl-sink ignores store
>> instructions.
>> 2. Store motion is a global DF problem solving, rtl-sink only processes
>> loop header reversely with dependency check in loop, take the below RTL
>> as example,
>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
>> but it moves #538 first, then #235, there is strong dependency here. It
>> seemsdoesn't like the LCM framework that could solve all and do the
>> delete-insert in one iteration.
> So my question was whether we want to do both within the LCM store
> sinking framework.  The LCM dataflow is also used by RTL PRE which
> handles both loads and non-loads so in principle it should be able
> to handle stores and non-stores for the sinking case (PRE on the
> reverse CFG).
>
> A global dataflow is more powerful than any local ad-hoc method.

IIRC you can use LCM on stores like this, but you have to run it 
independently on each store to pick up the secondary effects.   I 
believe the basic concepts are discussed in Morgan's book.   That may 
turn out to be too expensive in practice -- I've never tried it though.

jeff



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-09-02 15:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-21  9:03 [RFC] Run pass_sink_code once more after ivopts/fre Xiong Hu Luo
2020-12-22 16:53 ` Richard Biener
2021-03-23  3:07   ` Xionghu Luo
2021-03-23  8:50     ` Richard Biener
2021-03-26  7:35       ` Xionghu Luo
2021-04-14  1:51         ` Xionghu Luo
2021-04-14  6:41           ` Richard Biener
2021-04-15  6:20             ` Xionghu Luo
2021-04-15 11:34               ` Richard Biener
2021-04-20  9:23                 ` Xionghu Luo
2021-04-21 11:54                   ` Richard Biener
2021-05-14  7:10                     ` Xionghu Luo
2021-05-17  8:11                       ` Richard Biener
2021-05-18  5:20                         ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo
2021-05-18  7:02                           ` Richard Biener
2021-05-18  9:00                             ` Xionghu Luo
2021-05-18 10:34                               ` Richard Biener
2021-05-19  8:03                                 ` Bernd Edlinger
2021-09-02 15:45                           ` Martin Jambor
2021-04-24  3:44             ` [RFC] Run pass_sink_code once more after ivopts/fre Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).