* [RFC] Run pass_sink_code once more after ivopts/fre @ 2020-12-21 9:03 Xiong Hu Luo 2020-12-22 16:53 ` Richard Biener 0 siblings, 1 reply; 20+ messages in thread From: Xiong Hu Luo @ 2020-12-21 9:03 UTC (permalink / raw) To: gcc-patches Cc: segher, dje.gcc, wschmidt, guojiufu, linkw, rguenther, Xiong Hu Luo Here comes another case that requires run a pass once more, as this is not the common suggested direction to solve problems, not quite sure whether it is still a reasonble fix here. Source code is something like: ref = ip + *hslot; while (ip < in_end - 2) { unsigned int len = 2; len++; for () { do len++; while (len < maxlen && ref[len] == ip[len]); //sink code here. break; } len -= 2; ip++; ip += len + 1; if (ip >= in_end - 2) break; } Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1: <bb 31> [local count: 75120046]: # len_160 = PHI <len_161(30), len_189(58)> len_189 = len_160 + 1; _423 = (sizetype) len_189; _424 = ip_229 + _423; if (maxlen_186 > len_189) goto <bb 32>; [94.50%] else goto <bb 33>; [5.50%] <bb 32> [local count: 70988443]: _84 = *_424; _86 = ref_182 + _423; _87 = *_86; if (_84 == _87) goto <bb 58>; [94.50%] else goto <bb 33>; [5.50%] <bb 58> [local count: 67084079]: goto <bb 31>; [100.00%] <bb 33> [local count: 14847855]: # len_263 = PHI <len_160(32), len_160(31)> # _262 = PHI <_423(32), _423(31)> # _264 = PHI <_424(32), _424(31)> len_190 = len_263 + 4294967295; if (len_190 <= 6) goto <bb 34>; [0.00%] else goto <bb 36>; [100.00%] Then in ivopts, instructions are updated to xxx.c.174t.ivopts: <bb 31> [local count: 75120046]: # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)> _34 = (unsigned int) ivtmp.30_29; len_160 = _34 + 4294967295; _423 = ivtmp.30_29; _35 = (unsigned long) ip_229; _420 = ivtmp.30_29 + _35; _419 = (uint8_t *) _420; _424 = _419; len_418 = (unsigned int) ivtmp.30_29; if (maxlen_186 > len_418) goto <bb 32>; [94.50%] else goto <bb 33>; [5.50%] <bb 32> [local count: 70988443]: _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1]; ivtmp.30_31 = ivtmp.30_29 + 1; _417 = ref_182 + 18446744073709551615; _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1]; if (_84 == _87) goto <bb 58>; [94.50%] else goto <bb 33>; [5.50%] <bb 58> [local count: 67084079]: goto <bb 31>; [100.00%] <bb 33> [local count: 14847855]: # len_263 = PHI <len_160(32), len_160(31)> # _262 = PHI <_423(32), _423(31)> # _264 = PHI <_424(32), _424(31)> len_190 = len_263 + 4294967295; if (len_190 <= 6) goto <bb 34>; [0.00%] else goto <bb 36>; [100.00%] Some instructions in BB 31 are not used in the loop and could be sinked out of loop to reduce the computation, but they are not sinked throughout all passes later. Run the sink_code pass once more at least after fre5 could improve this typical case performance 23% due to few instructions exausted in loop. xxx.c.209t.sink2: Sinking _419 = (uint8_t *) _420; from bb 31 to bb 89 Sinking _420 = ivtmp.30_29 + _35; from bb 31 to bb 89 Sinking _35 = (unsigned long) ip_229; from bb 31 to bb 89 Sinking len_160 = _34 + 4294967295; from bb 31 to bb 33 I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved by 2.43%, but no big changes to other cases, GEOMEAN is improved quite small with 0.25%. The reason why it should be run after fre5 is fre would do some phi optimization to expose the optimization. The patch put it after pass_modref is due to my guess that some gimple optimizations like thread_jumps, dse, dce etc. could provide more opportunities for sinking code. Not sure it is the correct place to put. I also verified this issue exists in both X86 and ARM64. Any comments? Thanks. --- gcc/passes.def | 1 + gcc/tree-ssa-sink.c | 1 + 2 files changed, 2 insertions(+) diff --git a/gcc/passes.def b/gcc/passes.def index 21b2e2af0f7..69106615729 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -355,6 +355,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_uncprop); NEXT_PASS (pass_local_pure_const); NEXT_PASS (pass_modref); + NEXT_PASS (pass_sink_code); POP_INSERT_PASSES () NEXT_PASS (pass_all_optimizations_g); PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations_g) diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c index b0abf4147d6..824659f3919 100644 --- a/gcc/tree-ssa-sink.c +++ b/gcc/tree-ssa-sink.c @@ -819,6 +819,7 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { return flag_tree_sink != 0; } virtual unsigned int execute (function *); + opt_pass *clone (void) { return new pass_sink_code (m_ctxt); } }; // class pass_sink_code -- 2.27.0.90.geebb51ba8c ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2020-12-21 9:03 [RFC] Run pass_sink_code once more after ivopts/fre Xiong Hu Luo @ 2020-12-22 16:53 ` Richard Biener 2021-03-23 3:07 ` Xionghu Luo 0 siblings, 1 reply; 20+ messages in thread From: Richard Biener @ 2020-12-22 16:53 UTC (permalink / raw) To: Xiong Hu Luo, gcc-patches; +Cc: segher, dje.gcc, wschmidt, guojiufu, linkw On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: >Here comes another case that requires run a pass once more, as this is >not the common suggested direction to solve problems, not quite sure >whether it is still a reasonble fix here. Source code is something >like: > >ref = ip + *hslot; >while (ip < in_end - 2) { > unsigned int len = 2; > len++; > for () { > do len++; > while (len < maxlen && ref[len] == ip[len]); //sink code here. > break; > } > len -= 2; > ip++; > ip += len + 1; > if (ip >= in_end - 2) > break; >} > >Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1: > > <bb 31> [local count: 75120046]: > # len_160 = PHI <len_161(30), len_189(58)> > len_189 = len_160 + 1; > _423 = (sizetype) len_189; > _424 = ip_229 + _423; > if (maxlen_186 > len_189) > goto <bb 32>; [94.50%] > else > goto <bb 33>; [5.50%] > > <bb 32> [local count: 70988443]: > _84 = *_424; > _86 = ref_182 + _423; > _87 = *_86; > if (_84 == _87) > goto <bb 58>; [94.50%] > else > goto <bb 33>; [5.50%] > > <bb 58> [local count: 67084079]: > goto <bb 31>; [100.00%] > > <bb 33> [local count: 14847855]: > # len_263 = PHI <len_160(32), len_160(31)> > # _262 = PHI <_423(32), _423(31)> > # _264 = PHI <_424(32), _424(31)> > len_190 = len_263 + 4294967295; > if (len_190 <= 6) > goto <bb 34>; [0.00%] > else > goto <bb 36>; [100.00%] > >Then in ivopts, instructions are updated to xxx.c.174t.ivopts: > > <bb 31> [local count: 75120046]: > # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)> > _34 = (unsigned int) ivtmp.30_29; > len_160 = _34 + 4294967295; > _423 = ivtmp.30_29; > _35 = (unsigned long) ip_229; > _420 = ivtmp.30_29 + _35; > _419 = (uint8_t *) _420; > _424 = _419; > len_418 = (unsigned int) ivtmp.30_29; > if (maxlen_186 > len_418) > goto <bb 32>; [94.50%] > else > goto <bb 33>; [5.50%] > > <bb 32> [local count: 70988443]: > _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1]; > ivtmp.30_31 = ivtmp.30_29 + 1; > _417 = ref_182 + 18446744073709551615; > _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1]; > if (_84 == _87) > goto <bb 58>; [94.50%] > else > goto <bb 33>; [5.50%] > > <bb 58> [local count: 67084079]: > goto <bb 31>; [100.00%] > > <bb 33> [local count: 14847855]: > # len_263 = PHI <len_160(32), len_160(31)> > # _262 = PHI <_423(32), _423(31)> > # _264 = PHI <_424(32), _424(31)> > len_190 = len_263 + 4294967295; > if (len_190 <= 6) > goto <bb 34>; [0.00%] > else > goto <bb 36>; [100.00%] > >Some instructions in BB 31 are not used in the loop and could be sinked >out of loop to reduce the computation, but they are not sinked >throughout all passes later. Run the sink_code pass once more at least >after fre5 could improve this typical case performance 23% due to few >instructions exausted in loop. >xxx.c.209t.sink2: > >Sinking _419 = (uint8_t *) _420; > from bb 31 to bb 89 >Sinking _420 = ivtmp.30_29 + _35; > from bb 31 to bb 89 >Sinking _35 = (unsigned long) ip_229; > from bb 31 to bb 89 >Sinking len_160 = _34 + 4294967295; > from bb 31 to bb 33 > >I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved >by 2.43%, but no big changes to other cases, GEOMEAN is improved quite >small with 0.25%. > >The reason why it should be run after fre5 is fre would do some phi >optimization to expose the optimization. The patch put it after >pass_modref is due to my guess that some gimple optimizations like >thread_jumps, dse, dce etc. could provide more opportunities for >sinking code. Not sure it is the correct place to put. I also >verified this issue exists in both X86 and ARM64. >Any comments? Thanks. It definitely should be before uncprop (but context stops there). And yes, re-running passes isn't the very, very best thing to do without explaining it cannot be done in other ways. Not for late stage 3 anyway. Richard. >--- > gcc/passes.def | 1 + > gcc/tree-ssa-sink.c | 1 + > 2 files changed, 2 insertions(+) > >diff --git a/gcc/passes.def b/gcc/passes.def >index 21b2e2af0f7..69106615729 100644 >--- a/gcc/passes.def >+++ b/gcc/passes.def >@@ -355,6 +355,7 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_uncprop); > NEXT_PASS (pass_local_pure_const); > NEXT_PASS (pass_modref); >+ NEXT_PASS (pass_sink_code); > POP_INSERT_PASSES () > NEXT_PASS (pass_all_optimizations_g); > PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations_g) >diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c >index b0abf4147d6..824659f3919 100644 >--- a/gcc/tree-ssa-sink.c >+++ b/gcc/tree-ssa-sink.c >@@ -819,6 +819,7 @@ public: > /* opt_pass methods: */ > virtual bool gate (function *) { return flag_tree_sink != 0; } > virtual unsigned int execute (function *); >+ opt_pass *clone (void) { return new pass_sink_code (m_ctxt); } > > }; // class pass_sink_code > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2020-12-22 16:53 ` Richard Biener @ 2021-03-23 3:07 ` Xionghu Luo 2021-03-23 8:50 ` Richard Biener 0 siblings, 1 reply; 20+ messages in thread From: Xionghu Luo @ 2021-03-23 3:07 UTC (permalink / raw) To: Richard Biener, gcc-patches; +Cc: segher, dje.gcc, wschmidt, guojiufu, linkw [-- Attachment #1: Type: text/plain, Size: 4361 bytes --] On 2020/12/23 00:53, Richard Biener wrote: > On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo <luoxhu@linux.ibm.com> wrote: >> Here comes another case that requires run a pass once more, as this is >> not the common suggested direction to solve problems, not quite sure >> whether it is still a reasonble fix here. Source code is something >> like: >> >> ref = ip + *hslot; >> while (ip < in_end - 2) { >> unsigned int len = 2; >> len++; >> for () { >> do len++; >> while (len < maxlen && ref[len] == ip[len]); //sink code here. >> break; >> } >> len -= 2; >> ip++; >> ip += len + 1; >> if (ip >= in_end - 2) >> break; >> } >> >> Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1: >> >> <bb 31> [local count: 75120046]: >> # len_160 = PHI <len_161(30), len_189(58)> >> len_189 = len_160 + 1; >> _423 = (sizetype) len_189; >> _424 = ip_229 + _423; >> if (maxlen_186 > len_189) >> goto <bb 32>; [94.50%] >> else >> goto <bb 33>; [5.50%] >> >> <bb 32> [local count: 70988443]: >> _84 = *_424; >> _86 = ref_182 + _423; >> _87 = *_86; >> if (_84 == _87) >> goto <bb 58>; [94.50%] >> else >> goto <bb 33>; [5.50%] >> >> <bb 58> [local count: 67084079]: >> goto <bb 31>; [100.00%] >> >> <bb 33> [local count: 14847855]: >> # len_263 = PHI <len_160(32), len_160(31)> >> # _262 = PHI <_423(32), _423(31)> >> # _264 = PHI <_424(32), _424(31)> >> len_190 = len_263 + 4294967295; >> if (len_190 <= 6) >> goto <bb 34>; [0.00%] >> else >> goto <bb 36>; [100.00%] >> >> Then in ivopts, instructions are updated to xxx.c.174t.ivopts: >> >> <bb 31> [local count: 75120046]: >> # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)> >> _34 = (unsigned int) ivtmp.30_29; >> len_160 = _34 + 4294967295; >> _423 = ivtmp.30_29; >> _35 = (unsigned long) ip_229; >> _420 = ivtmp.30_29 + _35; >> _419 = (uint8_t *) _420; >> _424 = _419; >> len_418 = (unsigned int) ivtmp.30_29; >> if (maxlen_186 > len_418) >> goto <bb 32>; [94.50%] >> else >> goto <bb 33>; [5.50%] >> >> <bb 32> [local count: 70988443]: >> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1]; >> ivtmp.30_31 = ivtmp.30_29 + 1; >> _417 = ref_182 + 18446744073709551615; >> _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1]; >> if (_84 == _87) >> goto <bb 58>; [94.50%] >> else >> goto <bb 33>; [5.50%] >> >> <bb 58> [local count: 67084079]: >> goto <bb 31>; [100.00%] >> >> <bb 33> [local count: 14847855]: >> # len_263 = PHI <len_160(32), len_160(31)> >> # _262 = PHI <_423(32), _423(31)> >> # _264 = PHI <_424(32), _424(31)> >> len_190 = len_263 + 4294967295; >> if (len_190 <= 6) >> goto <bb 34>; [0.00%] >> else >> goto <bb 36>; [100.00%] >> >> Some instructions in BB 31 are not used in the loop and could be sinked >> out of loop to reduce the computation, but they are not sinked >> throughout all passes later. Run the sink_code pass once more at least >> after fre5 could improve this typical case performance 23% due to few >> instructions exausted in loop. >> xxx.c.209t.sink2: >> >> Sinking _419 = (uint8_t *) _420; >> from bb 31 to bb 89 >> Sinking _420 = ivtmp.30_29 + _35; >> from bb 31 to bb 89 >> Sinking _35 = (unsigned long) ip_229; >> from bb 31 to bb 89 >> Sinking len_160 = _34 + 4294967295; >> from bb 31 to bb 33 >> >> I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved >> by 2.43%, but no big changes to other cases, GEOMEAN is improved quite >> small with 0.25%. >> >> The reason why it should be run after fre5 is fre would do some phi >> optimization to expose the optimization. The patch put it after >> pass_modref is due to my guess that some gimple optimizations like >> thread_jumps, dse, dce etc. could provide more opportunities for >> sinking code. Not sure it is the correct place to put. I also >> verified this issue exists in both X86 and ARM64. >> Any comments? Thanks. > > It definitely should be before uncprop (but context stops there). And yes, re-running passes isn't the very, very best thing to do without explaining it cannot be done in other ways. Not for late stage 3 anyway. > > Richard. > Thanks. Also tried to implement this in a seperate RTL pass, which would be better? I guess this would also be stage1 issues... Xionghu [-- Attachment #2: 0001-RTL-loop-sink-pass.patch --] [-- Type: text/plain, Size: 18658 bytes --] From ac6e161a592ea259f2807f40d1021ccbfc3a965f Mon Sep 17 00:00:00 2001 From: Xiong Hu Luo <luoxhu@linux.ibm.com> Date: Thu, 4 Mar 2021 05:05:19 -0600 Subject: [PATCH] RTL loop sink pass This is a rtl loop sink pass that check loop header instructions, if the instruction's dest is not used in loop and it's source is not updated after current instruction, then this instruction's result is not used but calculated in loop each itration, sink it out of loop could reduce executions theoretically(register pressure is not considered yet). Number of Instructions sank out of loop when running on P8LE: 1. SPEC2017 int: 371 2. SPEC2017 float: 949 3. bootstrap: 402 4. stage1 libraries: 115 5. regression tests: 4533 Though no obvious performance change for SPEC2017, rtl-sink-2.c could achieve 10% performance improvement by sinking 4 instructions out of loop header. This is a bit like run gimple sink pass twice I pasted several months ago, is implementing this in RTL better? https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562352.html gcc/ChangeLog: * Makefile.in: * dbgcnt.def (DEBUG_COUNTER): * passes.def: * timevar.def (TV_TREE_SINK): (TV_SINK): * tree-pass.h (make_pass_rtl_sink): * sink.c: New file. gcc/testsuite/ChangeLog: * gcc.dg/rtl-sink-1.c: New test. * gcc.dg/rtl-sink-2.c: New test. --- gcc/Makefile.in | 1 + gcc/dbgcnt.def | 1 + gcc/passes.def | 1 + gcc/sink.c | 350 ++++++++++++++++++++++++++++++ gcc/testsuite/gcc.dg/rtl-sink-1.c | 19 ++ gcc/testsuite/gcc.dg/rtl-sink-2.c | 200 +++++++++++++++++ gcc/timevar.def | 3 +- gcc/tree-pass.h | 1 + 8 files changed, 575 insertions(+), 1 deletion(-) create mode 100644 gcc/sink.c create mode 100644 gcc/testsuite/gcc.dg/rtl-sink-1.c create mode 100644 gcc/testsuite/gcc.dg/rtl-sink-2.c diff --git a/gcc/Makefile.in b/gcc/Makefile.in index a63c5d9cab6..b7ec7970aac 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1334,6 +1334,7 @@ OBJS = \ cppbuiltin.o \ cppdefault.o \ cprop.o \ + sink.o \ cse.o \ cselib.o \ data-streamer.o \ diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def index 93e7b4fd30e..c0702650ad3 100644 --- a/gcc/dbgcnt.def +++ b/gcc/dbgcnt.def @@ -197,6 +197,7 @@ DEBUG_COUNTER (sched_region) DEBUG_COUNTER (sel_sched_cnt) DEBUG_COUNTER (sel_sched_insn_cnt) DEBUG_COUNTER (sel_sched_region_cnt) +DEBUG_COUNTER (sink) DEBUG_COUNTER (sms_sched_loop) DEBUG_COUNTER (split_for_sched2) DEBUG_COUNTER (store_merging) diff --git a/gcc/passes.def b/gcc/passes.def index e9ed3c7bc57..820b54155c5 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -436,6 +436,7 @@ along with GCC; see the file COPYING3. If not see /* Perform loop optimizations. It might be better to do them a bit sooner, but we want the profile feedback to work more efficiently. */ + NEXT_PASS (pass_rtl_sink); NEXT_PASS (pass_loop2); PUSH_INSERT_PASSES_WITHIN (pass_loop2) NEXT_PASS (pass_rtl_loop_init); diff --git a/gcc/sink.c b/gcc/sink.c new file mode 100644 index 00000000000..0c33f0e7c8b --- /dev/null +++ b/gcc/sink.c @@ -0,0 +1,350 @@ +/* Code sink for RTL. + Copyright (C) 1997-2020 Free Software Foundation, Inc. + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free +Software Foundation; either version 3, or (at your option) any later +version. + +GCC is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +<http://www.gnu.org/licenses/>. */ + +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "target.h" +#include "rtl.h" +#include "cfghooks.h" +#include "df.h" +#include "insn-config.h" +#include "memmodel.h" +#include "emit-rtl.h" +#include "recog.h" +#include "diagnostic-core.h" +#include "toplev.h" +#include "cfgrtl.h" +#include "cfganal.h" +#include "lcm.h" +#include "cfgcleanup.h" +#include "cselib.h" +#include "intl.h" +#include "tree-pass.h" +#include "dbgcnt.h" +#include "cfgloop.h" +#include "gcse.h" +#include "loop-unroll.h" + +/* Check whether the instruction could be sunk out of loop by checking dest's + uss and source's def. */ + +static bool +can_sink_reg_in_loop (loop *loop, rtx_insn *insn, rtx dest, rtx reg) +{ + df_ref def, use; + unsigned int src_regno, dest_regno, defs_in_loop_count = 0; + basic_block bb = BLOCK_FOR_INSN (insn); + basic_block use_bb; + + while (GET_CODE (reg) == ZERO_EXTEND || GET_CODE (reg) == SIGN_EXTEND) + reg = XEXP (reg, 0); + + if (GET_CODE (reg) == SUBREG) + reg = SUBREG_REG (reg); + + if (UNARY_P (reg)) + return can_sink_reg_in_loop (loop, insn, dest, XEXP (reg, 0)); + + if (BINARY_P (reg)) + { + rtx src0 = XEXP (reg, 0); + rtx src1 = XEXP (reg, 1); + + if (CONST_INT_P (src1)) + return can_sink_reg_in_loop (loop, insn, dest, src0); + else + return can_sink_reg_in_loop (loop, insn, dest, src0) + && can_sink_reg_in_loop (loop, insn, dest, src1); + } + + if (!REG_P (reg) || HARD_REGISTER_P (reg)) + return false; + + src_regno = REGNO (reg); + dest_regno = REGNO (dest); + rtx_insn *use_insn; + + for (use = DF_REG_USE_CHAIN (dest_regno); use; use = DF_REF_NEXT_REG (use)) + { + if (!DF_REF_INSN_INFO (use)) + continue; + + use_insn = DF_REF_INSN (use); + use_bb = BLOCK_FOR_INSN (use_insn); + + /* Ignore instruction considered for moving. */ + if (use_insn == insn) + return false; + + /* Don't consider uses in loop. */ + if (!use_bb->loop_father + || (NONDEBUG_INSN_P (use_insn) + && flow_bb_inside_loop_p (loop, use_bb))) + return false; + } + + rtx_insn *def_insn; + basic_block def_bb; + /* Check for other defs. Any other def in the loop might reach a use + currently reached by the def in insn. */ + for (def = DF_REG_DEF_CHAIN (dest_regno); def; def = DF_REF_NEXT_REG (def)) + { + def_bb = DF_REF_BB (def); + + /* Defs in exit block cannot reach a use they weren't already. */ + if (single_succ_p (def_bb)) + { + basic_block def_bb_succ; + + def_bb_succ = single_succ (def_bb); + if (!flow_bb_inside_loop_p (loop, def_bb_succ)) + continue; + } + + if (flow_bb_inside_loop_p (loop, def_bb) && ++defs_in_loop_count > 1) + return false; + } + + for (def = DF_REG_DEF_CHAIN (src_regno); def; def = DF_REF_NEXT_REG (def)) + { + def_bb = DF_REF_BB (def); + def_insn = DF_REF_INSN (def); + if (!flow_bb_inside_loop_p (loop, def_bb)) + continue; + + if (def_bb == bb && DF_INSN_LUID (insn) <= DF_INSN_LUID (def_insn)) + return false; + + if (def_bb != bb && def_bb != loop->latch) + return false; + } + + auto_vec<edge> edges = get_loop_exit_edges (loop); + unsigned j; + edge e; + + FOR_EACH_VEC_ELT (edges, j, e) + if (!dominated_by_p (CDI_DOMINATORS, e->src, bb)) + return false; + else + continue; + return true; +} + +/* If the instruction's dest is only used by debug instruction in the loop, it + need also be sunk out of loop to preserve the debug information. */ + +rtx_insn * +sink_dest_in_debug_insn (rtx dest, rtx_insn *insn_sink_to, basic_block prev_bb) +{ + df_ref use; + rtx_insn *use_insn; + rtx pat; + basic_block use_bb; + + unsigned int dest_regno = REGNO (dest); + for (use = DF_REG_USE_CHAIN (dest_regno); use; use = DF_REF_NEXT_REG (use)) + { + if (!DF_REF_INSN_INFO (use)) + continue; + + use_insn = DF_REF_INSN (use); + + if (NONDEBUG_INSN_P (use_insn)) + continue; + + use_bb = BLOCK_FOR_INSN (use_insn); + + if (use_bb != prev_bb) + continue; + + pat = PATTERN (use_insn); + emit_debug_insn_after_noloc (copy_insn (pat), insn_sink_to); + return use_insn; + } + return NULL; +} + +/* If the instructions' dest is not used in loop and the source is not + re-defined after current instructions, sink it from loop header to every loop + exits. */ + +static void +sink_set_reg_in_loop (loop *loop) +{ + unsigned i,j; + basic_block *body = get_loop_body_in_dom_order (loop); + rtx_insn *insn, *insn_sink_to, *dbg_insn; + edge e; + basic_block bb, new_bb; + rtx set, src, dest, pat; + auto_vec<edge> edges = get_loop_exit_edges (loop); + + FOR_EACH_VEC_ELT (edges, j, e) + if (has_abnormal_or_eh_outgoing_edge_p (e->src)) + return; + + for (i = 0; i < loop->num_nodes; i++) + { + bb = body[i]; + if (bb->loop_father == loop && bb_loop_header_p (bb)) + { + FOR_BB_INSNS_REVERSE (bb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + + if (CALL_P (insn)) + continue; + + set = single_set (insn); + if (!set || side_effects_p (set)) + continue; + + src = SET_SRC (set); + dest = SET_DEST (set); + pat = PATTERN (insn); + + if (!REG_P (dest)) + continue; + + if (!can_sink_reg_in_loop (loop, insn, dest, src)) + continue; + + FOR_EACH_VEC_ELT (edges, j, e) + { + new_bb = e->dest; + if (!single_pred_p (new_bb)) + new_bb = split_edge (e); + + if (dump_file) + { + fprintf (dump_file, "Loop %d: sinking ", loop->num); + print_rtl (dump_file, set); + fprintf (dump_file, " from bb %d to bb %d \n", bb->index, + new_bb->index); + } + + insn_sink_to = BB_HEAD (new_bb); + while (!NOTE_INSN_BASIC_BLOCK_P (insn_sink_to)) + insn_sink_to = NEXT_INSN (insn_sink_to); + emit_insn_after_noloc (copy_rtx_if_shared (pat), insn_sink_to, + new_bb); + + dbg_insn = sink_dest_in_debug_insn (dest, NEXT_INSN (insn_sink_to), bb); + } + +#if 1 + static unsigned long counts = 0; + FILE *f = fopen ("/home3/luoxhu/workspace/gcc-git/gcc-master_build/sink.out", "a"); + fprintf (f, " %ld \n", counts); + counts++; + fclose (f); +#endif + delete_insn (insn); + if (dbg_insn) + { + delete_insn (dbg_insn); + dbg_insn = NULL; + } + } + } + } +} + +/* Sink the set instructins out of the loops. */ + +static unsigned int +sink_insn_in_loop (void) +{ + class loop *loop; + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + { + if (loop->num_nodes <= 1000) + sink_set_reg_in_loop (loop); + } + return 0; +} + +static unsigned int +execute_rtl_sink (void) +{ + delete_unreachable_blocks (); + df_set_flags (DF_LR_RUN_DCE); + df_note_add_problem (); + df_analyze (); + + calculate_dominance_info (CDI_POST_DOMINATORS); + calculate_dominance_info (CDI_DOMINATORS); + + loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS); + + if (dump_file) + fprintf (dump_file, "rtl sink begin:\n"); + + if (number_of_loops (cfun) > 1) + sink_insn_in_loop (); + + if (dump_file) + fprintf (dump_file, "rtl sink end:\n"); + + loop_optimizer_finalize (); + free_dominance_info (CDI_DOMINATORS); + free_dominance_info (CDI_POST_DOMINATORS); + + return 0; +} + +namespace { + +const pass_data pass_data_rtl_sink = { + RTL_PASS, /* type */ + "sink", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_SINK, /* tv_id */ + PROP_cfglayout, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_rtl_sink : public rtl_opt_pass +{ +public: + pass_rtl_sink (gcc::context *ctxt) : rtl_opt_pass (pass_data_rtl_sink, ctxt) + {} + + /* opt_pass methods: */ + opt_pass *clone () { return new pass_rtl_sink (m_ctxt); } + virtual bool gate (function *) { return optimize > 0 && dbg_cnt (sink); } + + virtual unsigned int execute (function *) { return execute_rtl_sink (); } + +}; // class pass_rtl_sink + +} // namespace + +rtl_opt_pass * +make_pass_rtl_sink (gcc::context *ctxt) +{ + return new pass_rtl_sink (ctxt); +} diff --git a/gcc/testsuite/gcc.dg/rtl-sink-1.c b/gcc/testsuite/gcc.dg/rtl-sink-1.c new file mode 100644 index 00000000000..0065c874cef --- /dev/null +++ b/gcc/testsuite/gcc.dg/rtl-sink-1.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-additional-options " -O2 -fdump-rtl-sink" } */ + +int +liveloop (int n, int *x, int *y) +{ + int i; + int ret; + + for (i = 0; i < n; ++i) + { + ret = x[i] + 5; + y[i] = ret; + } + return ret; +} + +/* { dg-final { scan-rtl-dump-times "sinking" 1 "sink" } } */ + diff --git a/gcc/testsuite/gcc.dg/rtl-sink-2.c b/gcc/testsuite/gcc.dg/rtl-sink-2.c new file mode 100644 index 00000000000..9888070e8b5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/rtl-sink-2.c @@ -0,0 +1,200 @@ +/* { dg-do compile } */ +/* { dg-additional-options " -O3 -fdump-rtl-sink" } */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> + +# define HLOG 16 +#define MAX_LIT (1 << 5) +typedef uint8_t *LZF_HSLOT; +typedef LZF_HSLOT LZF_STATE[1 << (HLOG)]; + +int +compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len) +{ + LZF_STATE htab; + + uint8_t *ip = in_data; + uint8_t *op = out_data; + uint8_t *in_end = ip + in_len; + uint8_t *out_end = op + out_len; + uint8_t *ref; + + unsigned long off; + unsigned int hval; + int lit; + + if (!in_len || !out_len) + return 0; + + lit = 0; + op++; + hval = (((ip[0]) << 8) | ip[1]); + + while (ip < in_end - 2) + { + uint8_t *hslot; + + hval = (((hval) << 8) | ip[2]); + hslot = (uint8_t *)(htab + (((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))); + + ref = *hslot + in_data; + *hslot = ip - in_data; + + if (1 && (off = ip - ref - 1) < (1 << 13) && ref > in_data + && ref[2] == ip[2] + && ((ref[1] << 8) | ref[0]) == ((ip[1] << 8) | ip[0])) + { +#if 1 + unsigned int len = 2; +#else + unsigned long len = 2; +#endif + unsigned int maxlen = in_end - ip - len; + maxlen + = maxlen > ((1 << 8) + (1 << 3)) ? ((1 << 8) + (1 << 3)) : maxlen; + + if ((op + 3 + 1 >= out_end) != 0) + if (op - !lit + 3 + 1 >= out_end) + return 0; + + op[-lit - 1] = lit - 1; + op -= !lit; + + for (;;) + { + if (maxlen > 16) + { + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + } + do + { + len++; + } + while (len < maxlen && ip[len] == ref[len]); + break; + } + + len -= 2; + ip++; + + if (len < 7) + { + *op++ = (off >> 8) + (len << 5); + } + else + { + *op++ = (off >> 8) + (7 << 5); + *op++ = len - 7; + } + *op++ = off; + lit = 0; + op++; + ip += len + 1; + + if (ip >= in_end - 2) + break; + + --ip; + --ip; + + hval = (((ip[0]) << 8) | ip[1]); + hval = (((hval) << 8) | ip[2]); + htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))] + = (LZF_HSLOT)(ip - in_data); + ip++; + + hval = (((hval) << 8) | ip[2]); + htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))] + = (LZF_HSLOT)(ip - in_data); + ip++; + } + else + { + if (op >= out_end) + return 0; + + lit++; + *op++ = *ip++; + + if (lit == (1 << 5)) + { + op[-lit - 1] = lit - 1; + lit = 0; + op++; + } + } + } + if (op + 3 > out_end) /* at most 3 bytes can be missing here */ + return 0; + + while (ip < in_end) + { + lit++; + *op++ = *ip++; + if (lit == MAX_LIT) + { + op[-lit - 1] = lit - 1; /* stop run */ + lit = 0; + op++; /* start run */ + } + } + + op[-lit - 1] = lit - 1; /* end run */ + op -= !lit; /* undo run if length is zero */ + + return op - out_data; +} + +/* { dg-final { scan-rtl-dump-times "sinking" 8 "sink" } } */ diff --git a/gcc/timevar.def b/gcc/timevar.def index 63c0b3306de..d6854a299ab 100644 --- a/gcc/timevar.def +++ b/gcc/timevar.def @@ -176,7 +176,7 @@ DEFTIMEVAR (TV_TREE_SPLIT_EDGES , "tree split crit edges") DEFTIMEVAR (TV_TREE_REASSOC , "tree reassociation") DEFTIMEVAR (TV_TREE_PRE , "tree PRE") DEFTIMEVAR (TV_TREE_FRE , "tree FRE") -DEFTIMEVAR (TV_TREE_SINK , "tree code sinking") +DEFTIMEVAR (TV_TREE_SINK , "tree code sinking") DEFTIMEVAR (TV_TREE_PHIOPT , "tree linearize phis") DEFTIMEVAR (TV_TREE_BACKPROP , "tree backward propagate") DEFTIMEVAR (TV_TREE_FORWPROP , "tree forward propagate") @@ -248,6 +248,7 @@ DEFTIMEVAR (TV_LOOP_UNROLL , "loop unrolling") DEFTIMEVAR (TV_LOOP_DOLOOP , "loop doloop") DEFTIMEVAR (TV_LOOP_FINI , "loop fini") DEFTIMEVAR (TV_CPROP , "CPROP") +DEFTIMEVAR (TV_SINK , "SINK") DEFTIMEVAR (TV_PRE , "PRE") DEFTIMEVAR (TV_HOIST , "code hoisting") DEFTIMEVAR (TV_LSM , "LSM") diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 15693fee150..fb4a5aaefda 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -548,6 +548,7 @@ extern rtl_opt_pass *make_pass_rtl_dse1 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_dse2 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_rtl_sink (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt); -- 2.27.0.90.geebb51ba8c ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-03-23 3:07 ` Xionghu Luo @ 2021-03-23 8:50 ` Richard Biener 2021-03-26 7:35 ` Xionghu Luo 0 siblings, 1 reply; 20+ messages in thread From: Richard Biener @ 2021-03-23 8:50 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Tue, 23 Mar 2021, Xionghu Luo wrote: > > > On 2020/12/23 00:53, Richard Biener wrote: > > On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo > > <luoxhu@linux.ibm.com> wrote: > >> Here comes another case that requires run a pass once more, as this is > >> not the common suggested direction to solve problems, not quite sure > >> whether it is still a reasonble fix here. Source code is something > >> like: > >> > >> ref = ip + *hslot; > >> while (ip < in_end - 2) { > >> unsigned int len = 2; > >> len++; > >> for () { > >> do len++; > >> while (len < maxlen && ref[len] == ip[len]); //sink code here. > >> break; > >> } > >> len -= 2; > >> ip++; > >> ip += len + 1; > >> if (ip >= in_end - 2) > >> break; > >> } > >> > >> Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1: > >> > >> <bb 31> [local count: 75120046]: > >> # len_160 = PHI <len_161(30), len_189(58)> > >> len_189 = len_160 + 1; > >> _423 = (sizetype) len_189; > >> _424 = ip_229 + _423; > >> if (maxlen_186 > len_189) > >> goto <bb 32>; [94.50%] > >> else > >> goto <bb 33>; [5.50%] > >> > >> <bb 32> [local count: 70988443]: > >> _84 = *_424; > >> _86 = ref_182 + _423; > >> _87 = *_86; > >> if (_84 == _87) > >> goto <bb 58>; [94.50%] > >> else > >> goto <bb 33>; [5.50%] > >> > >> <bb 58> [local count: 67084079]: > >> goto <bb 31>; [100.00%] > >> > >> <bb 33> [local count: 14847855]: > >> # len_263 = PHI <len_160(32), len_160(31)> > >> # _262 = PHI <_423(32), _423(31)> > >> # _264 = PHI <_424(32), _424(31)> > >> len_190 = len_263 + 4294967295; > >> if (len_190 <= 6) > >> goto <bb 34>; [0.00%] > >> else > >> goto <bb 36>; [100.00%] > >> > >> Then in ivopts, instructions are updated to xxx.c.174t.ivopts: > >> > >> <bb 31> [local count: 75120046]: > >> # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)> > >> _34 = (unsigned int) ivtmp.30_29; > >> len_160 = _34 + 4294967295; > >> _423 = ivtmp.30_29; > >> _35 = (unsigned long) ip_229; > >> _420 = ivtmp.30_29 + _35; > >> _419 = (uint8_t *) _420; > >> _424 = _419; > >> len_418 = (unsigned int) ivtmp.30_29; > >> if (maxlen_186 > len_418) > >> goto <bb 32>; [94.50%] > >> else > >> goto <bb 33>; [5.50%] > >> > >> <bb 32> [local count: 70988443]: > >> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1]; > >> ivtmp.30_31 = ivtmp.30_29 + 1; > >> _417 = ref_182 + 18446744073709551615; > >> _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1]; > >> if (_84 == _87) > >> goto <bb 58>; [94.50%] > >> else > >> goto <bb 33>; [5.50%] > >> > >> <bb 58> [local count: 67084079]: > >> goto <bb 31>; [100.00%] > >> > >> <bb 33> [local count: 14847855]: > >> # len_263 = PHI <len_160(32), len_160(31)> > >> # _262 = PHI <_423(32), _423(31)> > >> # _264 = PHI <_424(32), _424(31)> > >> len_190 = len_263 + 4294967295; > >> if (len_190 <= 6) > >> goto <bb 34>; [0.00%] > >> else > >> goto <bb 36>; [100.00%] > >> > >> Some instructions in BB 31 are not used in the loop and could be sinked > >> out of loop to reduce the computation, but they are not sinked > >> throughout all passes later. Run the sink_code pass once more at least > >> after fre5 could improve this typical case performance 23% due to few > >> instructions exausted in loop. > >> xxx.c.209t.sink2: > >> > >> Sinking _419 = (uint8_t *) _420; > >> from bb 31 to bb 89 > >> Sinking _420 = ivtmp.30_29 + _35; > >> from bb 31 to bb 89 > >> Sinking _35 = (unsigned long) ip_229; > >> from bb 31 to bb 89 > >> Sinking len_160 = _34 + 4294967295; > >> from bb 31 to bb 33 > >> > >> I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved > >> by 2.43%, but no big changes to other cases, GEOMEAN is improved quite > >> small with 0.25%. > >> > >> The reason why it should be run after fre5 is fre would do some phi > >> optimization to expose the optimization. The patch put it after > >> pass_modref is due to my guess that some gimple optimizations like > >> thread_jumps, dse, dce etc. could provide more opportunities for > >> sinking code. Not sure it is the correct place to put. I also > >> verified this issue exists in both X86 and ARM64. > >> Any comments? Thanks. > > > > It definitely should be before uncprop (but context stops there). And yes, > > re-running passes isn't the very, very best thing to do without explaining > > it cannot be done in other ways. Not for late stage 3 anyway. > > > > Richard. > > > > Thanks. Also tried to implement this in a seperate RTL pass, which > would be better? I guess this would also be stage1 issues... Yes, that's also for stage1 obviously. Can you check how the number of sink opportunities of your RTL pass changes if you add the 2nd GIMPLE sinking pass? I'll note that you are not sinking later stmts first (you only walk insns reverse but not BBs). GIMPLE sinking performs a domwalk over post dominators (but it has an easier job because of PHIs). I guess you'd want to walk starting from loop exit blocks (from innermost loops as you do) in reverse program order. I'll also note that you don't have to check whether stmts you want to sink have their uses set after it - you can emit copies to a new pseudo at the original insn location and use those after the loop (that of course comes at some cost). Also we already have a sinking pass on RTL which even computes a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c. I'm not sure whether this deals with non-stores but the LCM machinery definitely can handle arbitrary expressions. I wonder if it makes more sense to extend this rather than inventing a new ad-hoc sinking pass? Richard. > > Xionghu > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-03-23 8:50 ` Richard Biener @ 2021-03-26 7:35 ` Xionghu Luo 2021-04-14 1:51 ` Xionghu Luo 0 siblings, 1 reply; 20+ messages in thread From: Xionghu Luo @ 2021-03-26 7:35 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw Hi, sorry for late response, On 2021/3/23 16:50, Richard Biener wrote: >>> It definitely should be before uncprop (but context stops there). And yes, >>> re-running passes isn't the very, very best thing to do without explaining >>> it cannot be done in other ways. Not for late stage 3 anyway. >>> >>> Richard. >>> >> Thanks. Also tried to implement this in a seperate RTL pass, which >> would be better? I guess this would also be stage1 issues... > Yes, that's also for stage1 obviously. Can you check how the number > of sink opportunities of your RTL pass changes if you add the > 2nd GIMPLE sinking pass? Number of Instructions sank out of loop when running (no sink2 -> sink2): 1. SPEC2017 int: 371 -> 142 2. SPEC2017 float: 949 -> 343 3. bootstrap: 402 -> 229 4. stage1 libraries: 115 -> 68 5. regression tests: 4533 -> 2948 the case I used from the beginning could all be optimized by gimple sink2 instead, but there are still many instructions sunk even gimple sink2 is added, I guess most of them are produced by expand pass. One example(It was after #38 in 262r.reginfo, note that new block is created between exit->src and exit->dst to avoid other bb jumps into exit->dst cause execution error due to r132:DI updated unexpectedly.), sometimes extra zero extend in loop like this could cause serious performance issue: vect-live-4.ltrans0.ltrans.263r.sink: ... Loop 2: sinking (set (reg/v:DI 132 [ <retval> ]) (sign_extend:DI (reg:SI 144))) from bb 7 to bb 11 ... 44: L44: 35: NOTE_INSN_BASIC_BLOCK 7 36: r127:DI=r127:DI+0x4 37: r145:SI=[r127:DI] 38: r144:SI=r145:SI+0x5 REG_DEAD r145:SI 40: r125:DI=r125:DI+0x4 41: [r125:DI]=r144:SI REG_DEAD r144:SI 42: r146:SI=r126:DI#0-0x1 REG_DEAD r126:DI 43: r126:DI=zero_extend(r146:SI) REG_DEAD r146:SI 45: r147:CC=cmp(r126:DI,0) 46: pc={(r147:CC!=0)?L65:pc} REG_DEAD r147:CC REG_BR_PROB 941032164 68: NOTE_INSN_BASIC_BLOCK 11 69: r132:DI=sign_extend(r144:SI) ; pc falls through to BB 8 65: L65: 64: NOTE_INSN_BASIC_BLOCK 9 ; pc falls through to BB 7 47: L47: 48: NOTE_INSN_BASIC_BLOCK 8 > > I'll note that you are not sinking later stmts first (you only > walk insns reverse but not BBs). GIMPLE sinking performs a > domwalk over post dominators (but it has an easier job because > of PHIs). I guess you'd want to walk starting from loop exit > blocks (from innermost loops as you do) in reverse program order. Yes, this rtl sink could only sink instruction from *loop header* to every loop exit blocks in reverse order, it is a bit strict since this is the first step to see whether it is reasonable to add such a pass. For example, if the instruction is in loop body block, sink it out will cause execution error sometimes as it doesn't have information in it whether the loop body block will be executed or not, if the loop jumps from header to exit directly, the instructions sunk from body to exit will change register value unexpected, seems always_reached and always_executed in loop-invarinat.c could be reused here to determine such circumstance? > > I'll also note that you don't have to check whether stmts you > want to sink have their uses set after it - you can emit copies > to a new pseudo at the original insn location and use those > after the loop (that of course comes at some cost). Not sure whether I understood your point correctly, but the instruction is still in loop executed loop niter times? What I am trying to do is move r132:DI=sign_extend(r144:SI) out of loop, if it was executed in loop 100 times, r132 is not used in loop, and r144 is not updated after current instruction, then move it to loop exit to execute one once. > > Also we already have a sinking pass on RTL which even computes > a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c. > I'm not sure whether this deals with non-stores but the > LCM machinery definitely can handle arbitrary expressions. I wonder > if it makes more sense to extend this rather than inventing a new > ad-hoc sinking pass? From the literal, my pass doesn't handle or process store instructions like store-motion.. Thanks, will check it. -- Thanks, Xionghu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-03-26 7:35 ` Xionghu Luo @ 2021-04-14 1:51 ` Xionghu Luo 2021-04-14 6:41 ` Richard Biener 0 siblings, 1 reply; 20+ messages in thread From: Xionghu Luo @ 2021-04-14 1:51 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc Hi, On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote: >> Also we already have a sinking pass on RTL which even computes >> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c. >> I'm not sure whether this deals with non-stores but the >> LCM machinery definitely can handle arbitrary expressions. I wonder >> if it makes more sense to extend this rather than inventing a new >> ad-hoc sinking pass? > From the literal, my pass doesn't handle or process store instructions > like store-motion.. Thanks, will check it. Store motion only processes store instructions with data flow equations, generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map, st_insert_map) globally, each store place is independently represented in the input bitmap vectors. Output is which should be delete and where to insert, current code does what you said "emit copies to a new pseudo at the original insn location and use it in followed bb", actually it is "store replacement" instead of "store move", why not save one pseudo by moving the store instruction to target edge directly? There are many differences between the newly added rtl-sink pass and store-motion pass. 1. Store motion moves only store instructions, rtl-sink ignores store instructions. 2. Store motion is a global DF problem solving, rtl-sink only processes loop header reversely with dependency check in loop, take the below RTL as example, "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, but it moves #538 first, then #235, there is strong dependency here. It seemsdoesn't like the LCM framework that could solve all and do the delete-insert in one iteration. However, there are still some common methods could be shared, like the def-use check(though store-motion is per bb, rtl-sink is per loop), insert_store, commit_edge_insertions etc. 508: L508: 507: NOTE_INSN_BASIC_BLOCK 34 12: r139:DI=r140:DI REG_DEAD r140:DI 240: L240: 231: NOTE_INSN_BASIC_BLOCK 35 232: r142:DI=zero_extend(r139:DI#0) 233: r371:SI=r142:DI#0-0x1 234: r243:DI=zero_extend(r371:SI) REG_DEAD r371:SI 235: r452:DI=r262:DI+r139:DI 538: r194:DI=r452:DI 236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0) 237: pc={(geu(r372:CCUNS,0))?L246:pc} REG_DEAD r372:CCUNS REG_BR_PROB 59055804 238: NOTE_INSN_BASIC_BLOCK 36 239: r140:DI=r139:DI+0x1 241: r373:DI=r251:DI-0x1 242: r374:SI=zero_extend([r262:DI+r139:DI]) REG_DEAD r139:DI 243: r375:SI=zero_extend([r373:DI+r140:DI]) REG_DEAD r373:DI 244: r376:CC=cmp(r374:SI,r375:SI) REG_DEAD r375:SI REG_DEAD r374:SI 245: pc={(r376:CC==0)?L508:pc} REG_DEAD r376:CC REG_BR_PROB 1014686028 246: L246: 247: NOTE_INSN_BASIC_BLOCK 37 248: r377:SI=r142:DI#0-0x2 REG_DEAD r142:DI 249: r256:DI=zero_extend(r377:SI) -- Thanks, Xionghu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-04-14 1:51 ` Xionghu Luo @ 2021-04-14 6:41 ` Richard Biener 2021-04-15 6:20 ` Xionghu Luo 2021-04-24 3:44 ` [RFC] Run pass_sink_code once more after ivopts/fre Jeff Law 0 siblings, 2 replies; 20+ messages in thread From: Richard Biener @ 2021-04-14 6:41 UTC (permalink / raw) To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On Wed, 14 Apr 2021, Xionghu Luo wrote: > Hi, > > On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote: > >> Also we already have a sinking pass on RTL which even computes > >> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c. > >> I'm not sure whether this deals with non-stores but the > >> LCM machinery definitely can handle arbitrary expressions. I wonder > >> if it makes more sense to extend this rather than inventing a new > >> ad-hoc sinking pass? > > From the literal, my pass doesn't handle or process store instructions > > like store-motion.. Thanks, will check it. > > Store motion only processes store instructions with data flow equations, > generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it > by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map, > st_insert_map) globally, each store place is independently represented in > the input bitmap vectors. Output is which should be delete and where to > insert, current code does what you said "emit copies to a new pseudo at > the original insn location and use it in followed bb", actually it is > "store replacement" instead of "store move", why not save one pseudo by > moving the store instruction to target edge directly? It probably simply saves the pass from doing analysis whether the stored value is clobbered on the sinking path, enabling more store sinking. For stores that might be even beneficial, for non-stores it becomes more of a cost issue, yes. > There are many differences between the newly added rtl-sink pass and > store-motion pass. > 1. Store motion moves only store instructions, rtl-sink ignores store > instructions. > 2. Store motion is a global DF problem solving, rtl-sink only processes > loop header reversely with dependency check in loop, take the below RTL > as example, > "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, > but it moves #538 first, then #235, there is strong dependency here. It > seemsdoesn't like the LCM framework that could solve all and do the > delete-insert in one iteration. So my question was whether we want to do both within the LCM store sinking framework. The LCM dataflow is also used by RTL PRE which handles both loads and non-loads so in principle it should be able to handle stores and non-stores for the sinking case (PRE on the reverse CFG). A global dataflow is more powerful than any local ad-hoc method. Richard. > However, there are still some common methods could be shared, like the > def-use check(though store-motion is per bb, rtl-sink is per loop), > insert_store, commit_edge_insertions etc. > > > 508: L508: > 507: NOTE_INSN_BASIC_BLOCK 34 > 12: r139:DI=r140:DI > REG_DEAD r140:DI > 240: L240: > 231: NOTE_INSN_BASIC_BLOCK 35 > 232: r142:DI=zero_extend(r139:DI#0) > 233: r371:SI=r142:DI#0-0x1 > 234: r243:DI=zero_extend(r371:SI) > REG_DEAD r371:SI > 235: r452:DI=r262:DI+r139:DI > 538: r194:DI=r452:DI > 236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0) > 237: pc={(geu(r372:CCUNS,0))?L246:pc} > REG_DEAD r372:CCUNS > REG_BR_PROB 59055804 > 238: NOTE_INSN_BASIC_BLOCK 36 > 239: r140:DI=r139:DI+0x1 > 241: r373:DI=r251:DI-0x1 > 242: r374:SI=zero_extend([r262:DI+r139:DI]) > REG_DEAD r139:DI > 243: r375:SI=zero_extend([r373:DI+r140:DI]) > REG_DEAD r373:DI > 244: r376:CC=cmp(r374:SI,r375:SI) > REG_DEAD r375:SI > REG_DEAD r374:SI > 245: pc={(r376:CC==0)?L508:pc} > REG_DEAD r376:CC > REG_BR_PROB 1014686028 > 246: L246: > 247: NOTE_INSN_BASIC_BLOCK 37 > 248: r377:SI=r142:DI#0-0x2 > REG_DEAD r142:DI > 249: r256:DI=zero_extend(r377:SI) > > > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-04-14 6:41 ` Richard Biener @ 2021-04-15 6:20 ` Xionghu Luo 2021-04-15 11:34 ` Richard Biener 2021-04-24 3:44 ` [RFC] Run pass_sink_code once more after ivopts/fre Jeff Law 1 sibling, 1 reply; 20+ messages in thread From: Xionghu Luo @ 2021-04-15 6:20 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc Thanks, On 2021/4/14 14:41, Richard Biener wrote: >> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, >> but it moves #538 first, then #235, there is strong dependency here. It >> seemsdoesn't like the LCM framework that could solve all and do the >> delete-insert in one iteration. > So my question was whether we want to do both within the LCM store > sinking framework. The LCM dataflow is also used by RTL PRE which > handles both loads and non-loads so in principle it should be able > to handle stores and non-stores for the sinking case (PRE on the > reverse CFG). > > A global dataflow is more powerful than any local ad-hoc method. My biggest concern is whether the LCM DF framework could support sinking *multiple* reverse-dependent non-store instructions together by *one* calling of LCM DF. If this is not supported, we need run multiple LCM until no new changes, it would be time consuming obviously (unless compiling time is not important here). > > Richard. > >> However, there are still some common methods could be shared, like the >> def-use check(though store-motion is per bb, rtl-sink is per loop), >> insert_store, commit_edge_insertions etc. >> >> >> 508: L508: >> 507: NOTE_INSN_BASIC_BLOCK 34 >> 12: r139:DI=r140:DI >> REG_DEAD r140:DI >> 240: L240: >> 231: NOTE_INSN_BASIC_BLOCK 35 >> 232: r142:DI=zero_extend(r139:DI#0) >> 233: r371:SI=r142:DI#0-0x1 >> 234: r243:DI=zero_extend(r371:SI) >> REG_DEAD r371:SI >> 235: r452:DI=r262:DI+r139:DI >> 538: r194:DI=r452:DI >> 236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0) Like here, Each instruction's dest reg is calculated in the input vector bitmap, after solving the equations by calling pre_edge_rev_lcm, move #538 out of loop for the first call, then move #235 out of loop after a second call... 4 repeat calls needed in total here, is the LCM framework smart enough to move the all 4 instruction within one iteration? I am worried that the input vector bitmap couldn't solve the dependency problem for two back chained instructions. -- Thanks, Xionghu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-04-15 6:20 ` Xionghu Luo @ 2021-04-15 11:34 ` Richard Biener 2021-04-20 9:23 ` Xionghu Luo 0 siblings, 1 reply; 20+ messages in thread From: Richard Biener @ 2021-04-15 11:34 UTC (permalink / raw) To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On Thu, 15 Apr 2021, Xionghu Luo wrote: > Thanks, > > On 2021/4/14 14:41, Richard Biener wrote: > >> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, > >> but it moves #538 first, then #235, there is strong dependency here. It > >> seemsdoesn't like the LCM framework that could solve all and do the > >> delete-insert in one iteration. > > So my question was whether we want to do both within the LCM store > > sinking framework. The LCM dataflow is also used by RTL PRE which > > handles both loads and non-loads so in principle it should be able > > to handle stores and non-stores for the sinking case (PRE on the > > reverse CFG). > > > > A global dataflow is more powerful than any local ad-hoc method. > > My biggest concern is whether the LCM DF framework could support sinking > *multiple* reverse-dependent non-store instructions together by *one* > calling of LCM DF. If this is not supported, we need run multiple LCM > until no new changes, it would be time consuming obviously (unless > compiling time is not important here). As said it is used for PRE and there it most definitely can do that. > > > > > Richard. > > > >> However, there are still some common methods could be shared, like the > >> def-use check(though store-motion is per bb, rtl-sink is per loop), > >> insert_store, commit_edge_insertions etc. > >> > >> > >> 508: L508: > >> 507: NOTE_INSN_BASIC_BLOCK 34 > >> 12: r139:DI=r140:DI > >> REG_DEAD r140:DI > >> 240: L240: > >> 231: NOTE_INSN_BASIC_BLOCK 35 > >> 232: r142:DI=zero_extend(r139:DI#0) > >> 233: r371:SI=r142:DI#0-0x1 > >> 234: r243:DI=zero_extend(r371:SI) > >> REG_DEAD r371:SI > >> 235: r452:DI=r262:DI+r139:DI > >> 538: r194:DI=r452:DI > >> 236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0) > > > Like here, Each instruction's dest reg is calculated in the input vector > bitmap, after solving the equations by calling pre_edge_rev_lcm, > move #538 out of loop for the first call, then move #235 out of loop > after a second call... 4 repeat calls needed in total here, is the LCM > framework smart enough to move the all 4 instruction within one iteration? > I am worried that the input vector bitmap couldn't solve the dependency > problem for two back chained instructions. > > > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-04-15 11:34 ` Richard Biener @ 2021-04-20 9:23 ` Xionghu Luo 2021-04-21 11:54 ` Richard Biener 0 siblings, 1 reply; 20+ messages in thread From: Xionghu Luo @ 2021-04-20 9:23 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc [-- Attachment #1: Type: text/plain, Size: 5192 bytes --] On 2021/4/15 19:34, Richard Biener wrote: > On Thu, 15 Apr 2021, Xionghu Luo wrote: > >> Thanks, >> >> On 2021/4/14 14:41, Richard Biener wrote: >>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, >>>> but it moves #538 first, then #235, there is strong dependency here. It >>>> seemsdoesn't like the LCM framework that could solve all and do the >>>> delete-insert in one iteration. >>> So my question was whether we want to do both within the LCM store >>> sinking framework. The LCM dataflow is also used by RTL PRE which >>> handles both loads and non-loads so in principle it should be able >>> to handle stores and non-stores for the sinking case (PRE on the >>> reverse CFG). >>> >>> A global dataflow is more powerful than any local ad-hoc method. >> >> My biggest concern is whether the LCM DF framework could support sinking >> *multiple* reverse-dependent non-store instructions together by *one* >> calling of LCM DF. If this is not supported, we need run multiple LCM >> until no new changes, it would be time consuming obviously (unless >> compiling time is not important here). > > As said it is used for PRE and there it most definitely can do that. I did some investigation about PRE and attached a case to show how it works, it is quite like store-motion, and actually there is a rtl-hoist pass in gcse.c which only works for code size. All of them are leveraging the LCM framework to move instructions upward or downward. PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm. The four problems are all converted to the LCM DF problem with n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input and two outputs of where to insert/delete. PRE scan each instruction and hash the SRC to table without *checking the relationship between instructions*, for the case attached, BB 37, BB 38 and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41 save it to index 106, BB 38 save it to index 110. After finishing this pass, "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert expr to BB 75~BB 80 to create full redundancies from partial redundancies, finally update instruction in BB 37. Issues witnessed in the PRE run: 1) "r262:DI+r139:DI" in BLOCK 38 is not replaced; 2) PRE also use pseudo register to store the result like store-motion and even rtl-hoist. Actually we need real "move" instead of "replace" for rtl-sink to improve performance though also potential register pressure issue like rtl-hoist? 3) SRC instruction is scanned WITHOUT back chain check cross instructions in hash_scan_set, it couldn't handle below case though a+c is same with b+d. So I am afraid single run couldn't solve the instruction dependency issue to sink multiple instructions out as before for rtl-sink. BB1: a = b; c = d; s = a + c; BB2: s = b + d; gcse.c: changed = one_pre_gcse_pass () alloc_hash_table (&expr_hash_table); compute_hash_table (&expr_hash_table); compute_hash_table_work (table); FOR_EACH_BB_FN (current_bb, cfun) FOR_BB_INSNS (current_bb, insn) hash_scan_insn (insn, table); hash_scan_set (pat, insn, table); if REG_P (dest) insert_expr_in_table (src, GET_MODE (dest), insn, antic_p, avail_p, max_distance, table); hash = hash_expr (x, mode, &do_not_record_p, table->size); dump_hash_table (dump_file, "Expression", &expr_hash_table); edge_list = compute_pre_data (); compute_local_properties (transp, comp, antloc, &expr_hash_table); edge_list = pre_edge_lcm (expr_hash_table.n_elems, transp, comp, antloc, ae_kill, &pre_insert_map, &pre_delete_map); changed |= pre_gcse (edge_list); changed = pre_delete (); /* Create a pseudo-reg to store the result of reaching expressions into. */ did_insert = pre_edge_insert (edge_list, index_map); > >> >>> >>> Richard. >>> >>>> However, there are still some common methods could be shared, like the >>>> def-use check(though store-motion is per bb, rtl-sink is per loop), >>>> insert_store, commit_edge_insertions etc. >>>> >>>> >>>> 508: L508: >>>> 507: NOTE_INSN_BASIC_BLOCK 34 >>>> 12: r139:DI=r140:DI >>>> REG_DEAD r140:DI >>>> 240: L240: >>>> 231: NOTE_INSN_BASIC_BLOCK 35 >>>> 232: r142:DI=zero_extend(r139:DI#0) >>>> 233: r371:SI=r142:DI#0-0x1 >>>> 234: r243:DI=zero_extend(r371:SI) >>>> REG_DEAD r371:SI >>>> 235: r452:DI=r262:DI+r139:DI >>>> 538: r194:DI=r452:DI >>>> 236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0) >> >> >> Like here, Each instruction's dest reg is calculated in the input vector >> bitmap, after solving the equations by calling pre_edge_rev_lcm, >> move #538 out of loop for the first call, then move #235 out of loop >> after a second call... 4 repeat calls needed in total here, is the LCM >> framework smart enough to move the all 4 instruction within one iteration? >> I am worried that the input vector bitmap couldn't solve the dependency >> problem for two back chained instructions. >> >> >> > -- Thanks, Xionghu [-- Attachment #2: Screen Shot 2021-04-20 at 17.20.47.png --] [-- Type: image/png, Size: 244493 bytes --] [-- Attachment #3: test_gcse.c.255r.pre --] [-- Type: text/plain, Size: 48715 bytes --] ;; Function compute_on_bytes (compute_on_bytes, funcdef_no=24, decl_uid=4337, cgraph_uid=25, symbol_order=24) starting the processing of deferred insns ending the processing of deferred insns df_analyze called df_worklist_dataflow_doublequeue: n_basic_blocks 79 n_edges 117 count 89 ( 1.1) df_worklist_dataflow_doublequeue: n_basic_blocks 79 n_edges 117 count 97 ( 1.2) Expression hash table (131 buckets, 189 entries) Index 0 (hash value 104; max distance 0) (compare:CC (reg/v:DI 274 [ in_len ]) (const_int 0 [0])) Index 1 (hash value 106; max distance 0) (compare:CC (reg/v:DI 276 [ out_len ]) (const_int 0 [0])) Index 2 (hash value 97; max distance 0) (plus:DI (reg/v/f:DI 273 [ in_data ]) (reg/v:DI 274 [ in_len ])) Index 3 (hash value 101; max distance 0) (plus:DI (reg/v/f:DI 275 [ out_data ]) (reg/v:DI 276 [ out_len ])) Index 4 (hash value 115; max distance 0) (plus:DI (reg/v/f:DI 275 [ out_data ]) (const_int 1 [0x1])) Index 5 (hash value 59; max distance 0) (bswap:HI (mem:HI (reg/v/f:DI 273 [ in_data ]) [0 MEM <short unsigned int> [(uint8_t *)in_data_169(D)]+0 S2 A8])) Index 6 (hash value 4; max distance 0) (zero_extend:SI (reg:HI 279)) Index 7 (hash value 6; max distance 0) (zero_extend:DI (reg:SI 280)) Index 8 (hash value 85; max distance 0) (plus:DI (reg/v/f:DI 248 [ in_end ]) (const_int -2 [0xfffffffffffffffe])) Index 9 (hash value 79; max distance 0) (compare:CCUNS (reg/v/f:DI 273 [ in_data ]) (reg/f:DI 264 [ _254 ])) Index 10 (hash value 85; max distance 0) (ashift:SI (subreg/s/v:SI (reg/v:DI 229 [ hval ]) 0) (const_int 8 [0x8])) Index 11 (hash value 3; max distance 0) (zero_extend:DI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 2 [0x2])) [0 MEM[(uint8_t *)ip_229 + 2B]+0 S1 A8])) Index 12 (hash value 92; max distance 0) (ior:SI (subreg:SI (reg:DI 123 [ _10 ]) 0) (reg:SI 282)) Index 13 (hash value 9; max distance 0) (zero_extend:DI (reg:SI 283)) Index 14 (hash value 17; max distance 0) (lshiftrt:SI (reg:SI 283) (const_int 8 [0x8])) Index 15 (hash value 8; max distance 0) (ashift:SI (reg:SI 283) (const_int 2 [0x2])) Index 16 (hash value 129; max distance 0) (mult:SI (reg:SI 283) (const_int 5 [0x5])) Index 17 (hash value 122; max distance 0) (minus:SI (reg:SI 284) (reg:SI 288)) Index 18 (hash value 43; max distance 0) (and:SI (reg:SI 289) (const_int 65535 [0xffff])) Index 19 (hash value 16; max distance 0) (zero_extend:DI (reg:SI 290)) Index 20 (hash value 18; max distance 0) (ashift:DI (reg:DI 291) (const_int 3 [0x3])) Index 21 (hash value 112; max distance 0) (plus:DI (reg/f:DI 110 sfp) (const_int 32 [0x20])) Index 22 (hash value 29; max distance 0) (plus:DI (reg/f:DI 449) (reg:DI 292)) Index 23 (hash value 24; max distance 0) (zero_extend:DI (mem:QI (reg/v/f:DI 250 [ hslot ]) [0 *hslot_181+0 S1 A64])) Index 24 (hash value 116; max distance 0) (plus:DI (reg/v/f:DI 273 [ in_data ]) (reg:DI 293 [ *hslot_181 ])) Index 25 (hash value 86; max distance 0) (minus:DI (reg/v/f:DI 262 [ ip ]) (reg/v/f:DI 273 [ in_data ])) Index 26 (hash value 64; max distance 0) (minus:DI (reg/v/f:DI 262 [ ip ]) (reg/v/f:DI 251 [ ref ])) Index 27 (hash value 2; max distance 0) (plus:DI (reg:DI 295) (const_int -1 [0xffffffffffffffff])) Index 28 (hash value 35; max distance 0) (compare:CCUNS (reg:DI 135 [ _22 ]) (const_int 8191 [0x1fff])) Index 29 (hash value 66; max distance 0) (compare:CCUNS (reg/v/f:DI 273 [ in_data ]) (reg/v/f:DI 251 [ ref ])) Index 30 (hash value 122; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 2 [0x2])) [0 MEM[(uint8_t *)ref_182 + 2B]+0 S1 A8])) Index 31 (hash value 87; max distance 0) (compare:CC (reg:SI 298 [ MEM[(uint8_t *)ref_182 + 2B] ]) (subreg:SI (reg:DI 123 [ _10 ]) 0)) Index 32 (hash value 24; max distance 0) (zero_extend:SI (mem:HI (reg/v/f:DI 251 [ ref ]) [0 MEM <short unsigned int> [(uint8_t *)ref_182]+0 S2 A8])) Index 33 (hash value 35; max distance 0) (zero_extend:SI (mem:HI (reg/v/f:DI 262 [ ip ]) [0 MEM <short unsigned int> [(uint8_t *)ip_229]+0 S2 A8])) Index 34 (hash value 11; max distance 0) (compare:CC (reg:SI 300 [ MEM <short unsigned int> [(uint8_t *)ref_182] ]) (reg:SI 301 [ MEM <short unsigned int> [(uint8_t *)ip_229] ])) Index 35 (hash value 61; max distance 0) (minus:DI (reg/v/f:DI 248 [ in_end ]) (reg/v/f:DI 262 [ ip ])) Index 36 (hash value 2; max distance 0) (plus:SI (subreg:SI (reg:DI 303) 0) (const_int -2 [0xfffffffffffffffe])) Index 37 (hash value 69; max distance 0) (plus:DI (reg/v/f:DI 226 [ op ]) (const_int 4 [0x4])) Index 38 (hash value 96; max distance 0) (compare:CCUNS (reg/v/f:DI 249 [ out_end ]) (reg/f:DI 305)) Index 39 (hash value 31; max distance 0) (compare:CC (reg/v:DI 201 [ lit ]) (const_int 0 [0])) Index 40 (hash value 68; max distance 0) (plus:DI (reg/v/f:DI 226 [ op ]) (const_int 3 [0x3])) Index 41 (hash value 99; max distance 0) (compare:CCUNS (reg/v/f:DI 249 [ out_end ]) (reg/f:DI 308)) Index 42 (hash value 3; max distance 0) (neg:SI (subreg/s/u:SI (reg/v:DI 201 [ lit ]) 0)) Index 43 (hash value 35; max distance 0) (sign_extend:DI (reg:SI 310)) Index 44 (hash value 87; max distance 0) (plus:DI (reg/v/f:DI 226 [ op ]) (reg:DI 311)) Index 45 (hash value 32; max distance 0) (plus:SI (subreg:SI (reg/v:DI 201 [ lit ]) 0) (const_int -1 [0xffffffffffffffff])) Index 46 (hash value 65; max distance 0) (eq:SI (subreg/s/u:SI (reg/v:DI 201 [ lit ]) 0) (const_int 0 [0])) Index 47 (hash value 42; max distance 0) (zero_extend:DI (reg:SI 316)) Index 48 (hash value 92; max distance 0) (minus:DI (reg/v/f:DI 226 [ op ]) (reg:DI 315)) Index 49 (hash value 20; max distance 0) (compare:CCUNS (reg:SI 304) (const_int 16 [0x10])) Index 50 (hash value 123; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 3 [0x3])) [0 MEM[(uint8_t *)ref_182 + 3B]+0 S1 A8])) Index 51 (hash value 3; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 3 [0x3])) [0 MEM[(uint8_t *)ip_229 + 3B]+0 S1 A8])) Index 52 (hash value 47; max distance 0) (compare:CC (reg:SI 318 [ MEM[(uint8_t *)ref_182 + 3B] ]) (reg:SI 319 [ MEM[(uint8_t *)ip_229 + 3B] ])) Index 53 (hash value 124; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 4 [0x4])) [0 MEM[(uint8_t *)ref_182 + 4B]+0 S1 A8])) Index 54 (hash value 4; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 4 [0x4])) [0 MEM[(uint8_t *)ip_229 + 4B]+0 S1 A8])) Index 55 (hash value 53; max distance 0) (compare:CC (reg:SI 321 [ MEM[(uint8_t *)ref_182 + 4B] ]) (reg:SI 322 [ MEM[(uint8_t *)ip_229 + 4B] ])) Index 56 (hash value 125; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 5 [0x5])) [0 MEM[(uint8_t *)ref_182 + 5B]+0 S1 A8])) Index 57 (hash value 5; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 5 [0x5])) [0 MEM[(uint8_t *)ip_229 + 5B]+0 S1 A8])) Index 58 (hash value 59; max distance 0) (compare:CC (reg:SI 324 [ MEM[(uint8_t *)ref_182 + 5B] ]) (reg:SI 325 [ MEM[(uint8_t *)ip_229 + 5B] ])) Index 59 (hash value 126; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 6 [0x6])) [0 MEM[(uint8_t *)ref_182 + 6B]+0 S1 A8])) Index 60 (hash value 6; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 6 [0x6])) [0 MEM[(uint8_t *)ip_229 + 6B]+0 S1 A8])) Index 61 (hash value 65; max distance 0) (compare:CC (reg:SI 327 [ MEM[(uint8_t *)ref_182 + 6B] ]) (reg:SI 328 [ MEM[(uint8_t *)ip_229 + 6B] ])) Index 62 (hash value 127; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 7 [0x7])) [0 MEM[(uint8_t *)ref_182 + 7B]+0 S1 A8])) Index 63 (hash value 7; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 7 [0x7])) [0 MEM[(uint8_t *)ip_229 + 7B]+0 S1 A8])) Index 64 (hash value 71; max distance 0) (compare:CC (reg:SI 330 [ MEM[(uint8_t *)ref_182 + 7B] ]) (reg:SI 331 [ MEM[(uint8_t *)ip_229 + 7B] ])) Index 65 (hash value 128; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 8 [0x8])) [0 MEM[(uint8_t *)ref_182 + 8B]+0 S1 A8])) Index 66 (hash value 8; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 8 [0x8])) [0 MEM[(uint8_t *)ip_229 + 8B]+0 S1 A8])) Index 67 (hash value 77; max distance 0) (compare:CC (reg:SI 333 [ MEM[(uint8_t *)ref_182 + 8B] ]) (reg:SI 334 [ MEM[(uint8_t *)ip_229 + 8B] ])) Index 68 (hash value 129; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 9 [0x9])) [0 MEM[(uint8_t *)ref_182 + 9B]+0 S1 A8])) Index 69 (hash value 9; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 9 [0x9])) [0 MEM[(uint8_t *)ip_229 + 9B]+0 S1 A8])) Index 70 (hash value 83; max distance 0) (compare:CC (reg:SI 336 [ MEM[(uint8_t *)ref_182 + 9B] ]) (reg:SI 337 [ MEM[(uint8_t *)ip_229 + 9B] ])) Index 71 (hash value 130; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 10 [0xa])) [0 MEM[(uint8_t *)ref_182 + 10B]+0 S1 A8])) Index 72 (hash value 10; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 10 [0xa])) [0 MEM[(uint8_t *)ip_229 + 10B]+0 S1 A8])) Index 73 (hash value 89; max distance 0) (compare:CC (reg:SI 339 [ MEM[(uint8_t *)ref_182 + 10B] ]) (reg:SI 340 [ MEM[(uint8_t *)ip_229 + 10B] ])) Index 74 (hash value 0; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 11 [0xb])) [0 MEM[(uint8_t *)ref_182 + 11B]+0 S1 A8])) Index 75 (hash value 11; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 11 [0xb])) [0 MEM[(uint8_t *)ip_229 + 11B]+0 S1 A8])) Index 76 (hash value 95; max distance 0) (compare:CC (reg:SI 342 [ MEM[(uint8_t *)ref_182 + 11B] ]) (reg:SI 343 [ MEM[(uint8_t *)ip_229 + 11B] ])) Index 77 (hash value 1; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 12 [0xc])) [0 MEM[(uint8_t *)ref_182 + 12B]+0 S1 A8])) Index 78 (hash value 12; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 12 [0xc])) [0 MEM[(uint8_t *)ip_229 + 12B]+0 S1 A8])) Index 79 (hash value 101; max distance 0) (compare:CC (reg:SI 345 [ MEM[(uint8_t *)ref_182 + 12B] ]) (reg:SI 346 [ MEM[(uint8_t *)ip_229 + 12B] ])) Index 80 (hash value 2; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 13 [0xd])) [0 MEM[(uint8_t *)ref_182 + 13B]+0 S1 A8])) Index 81 (hash value 13; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 13 [0xd])) [0 MEM[(uint8_t *)ip_229 + 13B]+0 S1 A8])) Index 82 (hash value 107; max distance 0) (compare:CC (reg:SI 348 [ MEM[(uint8_t *)ref_182 + 13B] ]) (reg:SI 349 [ MEM[(uint8_t *)ip_229 + 13B] ])) Index 83 (hash value 3; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 14 [0xe])) [0 MEM[(uint8_t *)ref_182 + 14B]+0 S1 A8])) Index 84 (hash value 14; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 14 [0xe])) [0 MEM[(uint8_t *)ip_229 + 14B]+0 S1 A8])) Index 85 (hash value 113; max distance 0) (compare:CC (reg:SI 351 [ MEM[(uint8_t *)ref_182 + 14B] ]) (reg:SI 352 [ MEM[(uint8_t *)ip_229 + 14B] ])) Index 86 (hash value 4; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 15 [0xf])) [0 MEM[(uint8_t *)ref_182 + 15B]+0 S1 A8])) Index 87 (hash value 15; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 15 [0xf])) [0 MEM[(uint8_t *)ip_229 + 15B]+0 S1 A8])) Index 88 (hash value 119; max distance 0) (compare:CC (reg:SI 354 [ MEM[(uint8_t *)ref_182 + 15B] ]) (reg:SI 355 [ MEM[(uint8_t *)ip_229 + 15B] ])) Index 89 (hash value 5; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 16 [0x10])) [0 MEM[(uint8_t *)ref_182 + 16B]+0 S1 A8])) Index 90 (hash value 16; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 16 [0x10])) [0 MEM[(uint8_t *)ip_229 + 16B]+0 S1 A8])) Index 91 (hash value 125; max distance 0) (compare:CC (reg:SI 357 [ MEM[(uint8_t *)ref_182 + 16B] ]) (reg:SI 358 [ MEM[(uint8_t *)ip_229 + 16B] ])) Index 92 (hash value 6; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 17 [0x11])) [0 MEM[(uint8_t *)ref_182 + 17B]+0 S1 A8])) Index 93 (hash value 17; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 17 [0x11])) [0 MEM[(uint8_t *)ip_229 + 17B]+0 S1 A8])) Index 94 (hash value 0; max distance 0) (compare:CC (reg:SI 360 [ MEM[(uint8_t *)ref_182 + 17B] ]) (reg:SI 361 [ MEM[(uint8_t *)ip_229 + 17B] ])) Index 95 (hash value 7; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int 18 [0x12])) [0 MEM[(uint8_t *)ref_182 + 18B]+0 S1 A8])) Index 96 (hash value 18; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 18 [0x12])) [0 MEM[(uint8_t *)ip_229 + 18B]+0 S1 A8])) Index 97 (hash value 6; max distance 0) (compare:CC (reg:SI 363 [ MEM[(uint8_t *)ref_182 + 18B] ]) (reg:SI 364 [ MEM[(uint8_t *)ip_229 + 18B] ])) Index 98 (hash value 6; max distance 0) (compare:CCUNS (reg:SI 304) (const_int 264 [0x108])) Index 99 (hash value 61; max distance 0) (if_then_else:SI (gtu (reg:CCUNS 368) (const_int 0 [0])) (reg:SI 367) (reg:SI 304)) Index 100 (hash value 92; max distance 0) (zero_extend:DI (reg:SI 366)) Index 101 (hash value 77; max distance 0) (plus:SI (subreg/s/v:SI (reg/v:DI 244 [ len ]) 0) (const_int 1 [0x1])) Index 102 (hash value 96; max distance 0) (zero_extend:DI (reg:SI 370)) Index 103 (hash value 121; max distance 0) (zero_extend:DI (subreg:SI (reg:DI 139 [ ivtmp.25 ]) 0)) Index 104 (hash value 104; max distance 0) (plus:SI (subreg/s/v:SI (reg:DI 142 [ _34 ]) 0) (const_int -1 [0xffffffffffffffff])) Index 105 (hash value 97; max distance 0) (zero_extend:DI (reg:SI 371)) Index 106 (hash value 82; max distance 0) (plus:DI (reg/v/f:DI 262 [ ip ]) (reg:DI 139 [ ivtmp.25 ])) Index 107 (hash value 57; max distance 0) (compare:CCUNS (subreg/s/v:SI (reg:DI 142 [ _34 ]) 0) (subreg/s/v:SI (reg/v:DI 254 [ maxlen ]) 0)) Index 108 (hash value 110; max distance 0) (plus:DI (reg:DI 139 [ ivtmp.25 ]) (const_int 1 [0x1])) Index 109 (hash value 89; max distance 0) (plus:DI (reg/v/f:DI 251 [ ref ]) (const_int -1 [0xffffffffffffffff])) Index 110 (hash value 112; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 262 [ ip ]) (reg:DI 139 [ ivtmp.25 ])) [0 MEM[(uint8_t *)ip_229 + ivtmp.25_29 * 1]+0 S1 A8])) Index 111 (hash value 93; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/f:DI 373) (reg:DI 140 [ ivtmp.25 ])) [0 MEM[(uint8_t *)_417 + ivtmp.25_31 * 1]+0 S1 A8])) Index 112 (hash value 28; max distance 0) (compare:CC (reg:SI 374 [ MEM[(uint8_t *)ip_229 + ivtmp.25_29 * 1] ]) (reg:SI 375 [ MEM[(uint8_t *)_417 + ivtmp.25_31 * 1] ])) Index 113 (hash value 103; max distance 0) (plus:SI (subreg/s/v:SI (reg:DI 142 [ _34 ]) 0) (const_int -2 [0xfffffffffffffffe])) Index 114 (hash value 103; max distance 0) (zero_extend:DI (reg:SI 377)) Index 115 (hash value 83; max distance 0) (compare:CCUNS (reg:SI 377) (const_int 6 [0x6])) Index 116 (hash value 105; max distance 0) (ashift:SI (reg:SI 377) (const_int 5 [0x5])) Index 117 (hash value 100; max distance 0) (zero_extend:DI (subreg:QI (reg:SI 380) 0)) Index 118 (hash value 95; max distance 0) (plus:DI (reg/v/f:DI 255 [ op ]) (const_int 1 [0x1])) Index 119 (hash value 1; max distance 0) (lshiftrt:DI (reg:DI 135 [ _22 ]) (const_int 8 [0x8])) Index 120 (hash value 54; max distance 0) (plus:SI (subreg:SI (reg:DI 381) 0) (subreg:SI (reg:DI 267 [ prephitmp_368 ]) 0)) Index 121 (hash value 103; max distance 0) (zero_extend:DI (subreg:QI (reg:SI 383) 0)) Index 122 (hash value 53; max distance 0) (plus:SI (subreg:SI (reg:DI 384) 0) (const_int -32 [0xffffffffffffffe0])) Index 123 (hash value 106; max distance 0) (zero_extend:DI (subreg:QI (reg:SI 386) 0)) Index 124 (hash value 96; max distance 0) (plus:DI (reg/v/f:DI 255 [ op ]) (const_int 2 [0x2])) Index 125 (hash value 81; max distance 0) (plus:SI (subreg:SI (reg/v:DI 256 [ len ]) 0) (const_int -7 [0xfffffffffffffff9])) Index 126 (hash value 82; max distance 0) (plus:DI (reg/v/f:DI 241 [ op ]) (const_int 2 [0x2])) Index 127 (hash value 0; max distance 0) (compare:CCUNS (reg/f:DI 264 [ _254 ]) (reg/v/f:DI 194 [ ip ])) Index 128 (hash value 61; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 194 [ ip ]) (const_int -2 [0xfffffffffffffffe])) [0 MEM[(uint8_t *)prephitmp_385 + -2B]+0 S1 A8])) Index 129 (hash value 121; max distance 0) (ashift:SI (reg:SI 390 [ MEM[(uint8_t *)prephitmp_385 + -2B] ]) (const_int 8 [0x8])) Index 130 (hash value 62; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 194 [ ip ]) (const_int -1 [0xffffffffffffffff])) [0 MEM[(uint8_t *)prephitmp_385 + -1B]+0 S1 A8])) Index 131 (hash value 83; max distance 0) (ior:SI (reg:SI 391) (reg:SI 392 [ MEM[(uint8_t *)prephitmp_385 + -1B] ])) Index 132 (hash value 124; max distance 0) (ashift:SI (reg:SI 393 [ hval ]) (const_int 8 [0x8])) Index 133 (hash value 98; max distance 0) (zero_extend:SI (mem:QI (reg/v/f:DI 194 [ ip ]) [0 *prephitmp_385+0 S1 A8])) Index 134 (hash value 89; max distance 0) (ior:SI (reg:SI 395 [ *prephitmp_385 ]) (reg:SI 394)) Index 135 (hash value 130; max distance 0) (lshiftrt:SI (reg:SI 396) (const_int 8 [0x8])) Index 136 (hash value 121; max distance 0) (ashift:SI (reg:SI 396) (const_int 2 [0x2])) Index 137 (hash value 111; max distance 0) (mult:SI (reg:SI 396) (const_int 5 [0x5])) Index 138 (hash value 86; max distance 0) (minus:SI (reg:SI 397) (reg:SI 401)) Index 139 (hash value 25; max distance 0) (and:SI (reg:SI 402) (const_int 65535 [0xffff])) Index 140 (hash value 129; max distance 0) (zero_extend:DI (reg:SI 403)) Index 141 (hash value 0; max distance 0) (ashift:DI (reg:DI 404) (const_int 3 [0x3])) Index 142 (hash value 12; max distance 0) (plus:DI (reg/f:DI 450) (reg:DI 405)) Index 143 (hash value 31; max distance 0) (plus:DI (reg/v/f:DI 194 [ ip ]) (const_int -2 [0xfffffffffffffffe])) Index 144 (hash value 100; max distance 0) (minus:DI (reg/f:DI 407 [ ip ]) (reg/v/f:DI 273 [ in_data ])) Index 145 (hash value 127; max distance 0) (ashift:SI (reg:SI 396) (const_int 8 [0x8])) Index 146 (hash value 64; max distance 0) (zero_extend:SI (mem:QI (plus:DI (reg/v/f:DI 194 [ ip ]) (const_int 1 [0x1])) [0 MEM[(uint8_t *)prephitmp_385 + 1B]+0 S1 A8])) Index 147 (hash value 119; max distance 0) (ior:SI (reg:SI 410 [ MEM[(uint8_t *)prephitmp_385 + 1B] ]) (reg:SI 409)) Index 148 (hash value 6; max distance 0) (zero_extend:DI (reg:SI 411)) Index 149 (hash value 14; max distance 0) (lshiftrt:SI (reg:SI 411) (const_int 8 [0x8])) Index 150 (hash value 5; max distance 0) (ashift:SI (reg:SI 411) (const_int 2 [0x2])) Index 151 (hash value 126; max distance 0) (mult:SI (reg:SI 411) (const_int 5 [0x5])) Index 152 (hash value 116; max distance 0) (minus:SI (reg:SI 412) (reg:SI 416)) Index 153 (hash value 40; max distance 0) (and:SI (reg:SI 417) (const_int 65535 [0xffff])) Index 154 (hash value 13; max distance 0) (zero_extend:DI (reg:SI 418)) Index 155 (hash value 15; max distance 0) (ashift:DI (reg:DI 419) (const_int 3 [0x3])) Index 156 (hash value 27; max distance 0) (plus:DI (reg/f:DI 450) (reg:DI 420)) Index 157 (hash value 80; max distance 0) (plus:DI (reg/v/f:DI 262 [ ip ]) (reg:DI 268 [ prephitmp_370 ])) Index 158 (hash value 115; max distance 0) (minus:DI (reg:DI 422 [ ip ]) (reg/v/f:DI 273 [ in_data ])) Index 159 (hash value 17; max distance 0) (compare:CCUNS (reg/v/f:DI 249 [ out_end ]) (reg/v/f:DI 226 [ op ])) Index 160 (hash value 34; max distance 0) (plus:SI (subreg/s/u:SI (reg/v:DI 201 [ lit ]) 0) (const_int 1 [0x1])) Index 161 (hash value 19; max distance 0) (sign_extend:DI (reg:SI 425)) Index 162 (hash value 102; max distance 0) (plus:DI (reg/v/f:DI 262 [ ip ]) (const_int 1 [0x1])) Index 163 (hash value 36; max distance 0) (zero_extend:DI (mem:QI (reg/v/f:DI 262 [ ip ]) [0 *ip_229+0 S1 A8])) Index 164 (hash value 25; max distance 0) (compare:CC (reg:SI 425) (const_int 32 [0x20])) Index 165 (hash value 66; max distance 0) (plus:DI (reg/v/f:DI 226 [ op ]) (const_int 1 [0x1])) Index 166 (hash value 67; max distance 0) (plus:DI (reg/v/f:DI 226 [ op ]) (const_int 2 [0x2])) Index 167 (hash value 0; max distance 0) (compare:CCUNS (reg/v/f:DI 194 [ ip ]) (reg/f:DI 264 [ _254 ])) Index 168 (hash value 89; max distance 0) (compare:CCUNS (reg/v/f:DI 249 [ out_end ]) (reg/f:DI 429)) Index 169 (hash value 115; max distance 0) (compare:CCUNS (reg/v/f:DI 194 [ ip ]) (reg/v/f:DI 248 [ in_end ])) Index 170 (hash value 124; max distance 0) (minus:DI (reg/v/f:DI 248 [ in_end ]) (reg/v/f:DI 194 [ ip ])) Index 171 (hash value 32; max distance 0) (plus:DI (reg/v/f:DI 194 [ ip ]) (const_int -1 [0xffffffffffffffff])) Index 172 (hash value 26; max distance 0) (sign_extend:DI (reg:SI 432)) Index 173 (hash value 92; max distance 0) (plus:DI (reg:DI 121 [ ivtmp.19 ]) (const_int 1 [0x1])) Index 174 (hash value 26; max distance 0) (zero_extend:DI (mem:QI (reg:DI 121 [ ivtmp.19 ]) [0 MEM[(uint8_t *)_28]+0 S1 A8])) Index 175 (hash value 32; max distance 0) (compare:CC (reg:SI 432) (const_int 32 [0x20])) Index 176 (hash value 28; max distance 0) (plus:DI (reg:DI 190 [ doloop.16 ]) (const_int -1 [0xffffffffffffffff])) Index 177 (hash value 20; max distance 0) (compare:CC (reg:DI 190 [ doloop.16 ]) (const_int 0 [0])) Index 178 (hash value 31; max distance 0) (sign_extend:DI (reg:SI 437)) Index 179 (hash value 83; max distance 0) (plus:DI (reg/v/f:DI 226 [ op ]) (reg:DI 438)) Index 180 (hash value 38; max distance 0) (zero_extend:DI (reg:SI 443)) Index 181 (hash value 88; max distance 0) (minus:DI (reg/v/f:DI 226 [ op ]) (reg:DI 442)) Index 182 (hash value 8; max distance 0) (minus:DI (reg:DI 444 [ op ]) (reg/v/f:DI 275 [ out_data ])) Index 183 (hash value 33; max distance 0) (sign_extend:DI (subreg:SI (reg:DI 445) 0)) Index 184 (hash value 95; max distance 0) (plus:SI (subreg/s/v:SI (reg/v:DI 265 [ len ]) 0) (const_int -2 [0xfffffffffffffffe])) Index 185 (hash value 42; max distance 0) (zero_extend:DI (reg:SI 447)) Index 186 (hash value 96; max distance 0) (plus:SI (subreg/s/v:SI (reg/v:DI 265 [ len ]) 0) (const_int -1 [0xffffffffffffffff])) Index 187 (hash value 43; max distance 0) (zero_extend:DI (reg:SI 448)) Index 188 (hash value 77; max distance 0) (plus:DI (reg/v/f:DI 262 [ ip ]) (reg/v:DI 265 [ len ])) scanning new insn with uid = 528. deleting insn with uid = 265. PRE: redundant insn 265 (expression 106) in bb 41, reaching reg is 452 scanning new insn with uid = 529. deleting insn with uid = 517. PRE: redundant insn 517 (expression 21) in bb 44, reaching reg is 453 scanning new insn with uid = 530. deleting insn with uid = 516. PRE: redundant insn 516 (expression 21) in bb 9, reaching reg is 453 PRE: edge (8,9), copy expression 21 PRE: edge (75,41), copy expression 106 PRE: edge (76,41), copy expression 106 PRE: edge (77,41), copy expression 106 PRE: edge (78,41), copy expression 106 PRE: edge (79,41), copy expression 106 PRE: edge (80,41), copy expression 106 rescanning insn with uid = 235. scanning new insn with uid = 538. PRE: bb 37, insn 538, copy expression 106 in insn 235 to reg 452 scanning new insn with uid = 531. scanning new insn with uid = 532. scanning new insn with uid = 533. scanning new insn with uid = 534. scanning new insn with uid = 535. scanning new insn with uid = 536. scanning new insn with uid = 537. PRE GCSE of compute_on_bytes, 79 basic blocks, 46096 bytes needed, 3 substs, 8 insns created try_optimize_cfg iteration 1 starting the processing of deferred insns ending the processing of deferred insns compute_on_bytes Dataflow summary: ;; fully invalidated by EH 0 [0] 3 [3] 4 [4] 5 [5] 6 [6] 7 [7] 8 [8] 9 [9] 10 [10] 11 [11] 12 [12] 13 [13] 32 [0] 33 [1] 34 [2] 35 [3] 36 [4] 37 [5] 38 [6] 39 [7] 40 [8] 41 [9] 42 [10] 43 [11] 44 [12] 45 [13] 64 [0] 65 [1] 66 [2] 67 [3] 68 [4] 69 [5] 70 [6] 71 [7] 72 [8] 73 [9] 74 [10] 75 [11] 76 [12] 77 [13] 78 [14] 79 [15] 80 [16] 81 [17] 82 [18] 83 [19] 96 [lr] 97 [ctr] 98 [ca] 100 [0] 101 [1] 105 [5] 106 [6] 107 [7] 109 [vscr] ;; hardware regs used 1 [1] 2 [2] 99 [ap] 109 [vscr] 110 [sfp] ;; regular block artificial uses 1 [1] 2 [2] 31 [31] 99 [ap] 110 [sfp] ;; eh block artificial uses 1 [1] 2 [2] 31 [31] 99 [ap] 110 [sfp] ;; entry block defs 1 [1] 2 [2] 3 [3] 4 [4] 5 [5] 6 [6] 7 [7] 8 [8] 9 [9] 10 [10] 31 [31] 33 [1] 34 [2] 35 [3] 36 [4] 37 [5] 38 [6] 39 [7] 40 [8] 41 [9] 42 [10] 43 [11] 44 [12] 45 [13] 66 [2] 67 [3] 68 [4] 69 [5] 70 [6] 71 [7] 72 [8] 73 [9] 74 [10] 75 [11] 76 [12] 77 [13] 96 [lr] 99 [ap] 109 [vscr] 110 [sfp] ;; exit block uses 1 [1] 2 [2] 3 [3] 31 [31] 108 [vrsave] 109 [vscr] 110 [sfp] ;; regs ever live 3 [3] 4 [4] 5 [5] 6 [6] ;; ref usage r1={1d,78u} r2={1d,78u} r3={2d,3u} r4={1d,1u} r5={1d,1u} r6={1d,1u} r7={1d} r8={1d} r9={1d} r10={1d} r31={1d,78u} r33={1d} r34={1d} r35={1d} r36={1d} r37={1d} r38={1d} r39={1d} r40={1d} r41={1d} r42={1d} r43={1d} r44={1d} r45={1d} r66={1d} r67={1d} r68={1d} r69={1d} r70={1d} r71={1d} r72={1d} r73={1d} r74={1d} r75={1d} r76={1d} r77={1d} r96={1d} r99={1d,77u} r108={1u} r109={1d,1u} r110={1d,79u,2e} r121={2d,2u} r123={1d,2u} r135={1d,4u} r139={8d,10u,1e} r140={1d,2u} r142={1d,3u} r190={3d,4u} r194={5d,11u} r195={2d,1u} r201={9d,9u} r226={6d,16u} r227={1d,1u} r229={3d,1u} r230={1d,1u} r241={2d,2u} r243={2d,2u} r244={2d,1u} r248={1d,4u} r249={1d,4u} r250={1d,2u} r251={1d,21u} r254={1d,1u} r255={1d,4u} r256={2d,1u} r262={2d,33u,1e} r264={1d,3u} r265={10d,3u} r267={7d,1u} r268={8d,1u} r272={2d,1u} r273={1d,10u} r274={1d,2u} r275={1d,3u} r276={1d,2u} r277={1d,1u} r278={1d,1u} r279={1d,1u} r280={1d,1u} r281={1d,1u} r282={1d,1u} r283={1d,4u,1e} r284={1d,1u} r287={1d,1u} r288={1d,1u} r289={1d,1u} r290={1d,1u} r291={1d,1u} r292={1d,1u} r293={1d,1u} r294={1d,1u} r295={1d,1u} r296={1d,1u} r297={1d,1u} r298={1d,1u} r299={1d,1u} r300={1d,1u} r301={1d,1u} r302={1d,1u} r303={1d,1u} r304={1d,3u,1e} r305={1d,1u} r306={1d,1u} r307={1d,1u} r308={1d,1u} r309={1d,1u} r310={1d,1u} r311={1d,1u} r312={1d,1u} r314={1d,1u} r315={1d,1u} r316={1d,1u} r317={1d,1u} r318={1d,1u} r319={1d,1u} r320={1d,1u} r321={1d,1u} r322={1d,1u} r323={1d,1u} r324={1d,1u} r325={1d,1u} r326={1d,1u} r327={1d,1u} r328={1d,1u} r329={1d,1u} r330={1d,1u} r331={1d,1u} r332={1d,1u} r333={1d,1u} r334={1d,1u} r335={1d,1u} r336={1d,1u} r337={1d,1u} r338={1d,1u} r339={1d,1u} r340={1d,1u} r341={1d,1u} r342={1d,1u} r343={1d,1u} r344={1d,1u} r345={1d,1u} r346={1d,1u} r347={1d,1u} r348={1d,1u} r349={1d,1u} r350={1d,1u} r351={1d,1u} r352={1d,1u} r353={1d,1u} r354={1d,1u} r355={1d,1u} r356={1d,1u} r357={1d,1u} r358={1d,1u} r359={1d,1u} r360={1d,1u} r361={1d,1u} r362={1d,1u} r363={1d,1u} r364={1d,1u} r365={1d,1u} r366={1d,1u} r367={1d,1u} r368={1d,1u,1e} r370={1d,1u} r371={1d,1u} r372={1d,1u} r373={1d,1u} r374={1d,1u} r375={1d,1u} r376={1d,1u} r377={1d,3u} r378={1d,1u} r380={1d,1u} r381={1d,1u} r383={1d,1u} r384={1d,1u} r386={1d,1u} r388={1d,1u} r389={1d,1u} r390={1d,1u} r391={1d,1u} r392={1d,1u} r393={1d,1u} r394={1d,1u} r395={1d,1u} r396={1d,4u,1e} r397={1d,1u} r400={1d,1u} r401={1d,1u} r402={1d,1u} r403={1d,1u} r404={1d,1u} r405={1d,1u} r406={1d,1u} r407={1d,1u} r408={1d,1u} r409={1d,1u} r410={1d,1u} r411={1d,4u,1e} r412={1d,1u} r415={1d,1u} r416={1d,1u} r417={1d,1u} r418={1d,1u} r419={1d,1u} r420={1d,1u} r421={1d,1u} r422={1d,1u} r423={1d,1u} r424={1d,1u} r425={1d,2u} r426={1d,1u} r427={1d,1u} r428={1d,1u} r429={1d,1u} r430={1d,1u} r431={1d,1u} r432={1d,2u} r433={1d,1u} r434={1d,1u} r435={1d,1u} r436={1d,1u} r437={1d,1u} r438={1d,1u} r439={1d,1u} r441={1d,1u} r442={1d,1u} r443={1d,1u} r444={1d,1u} r445={1d,1u} r447={1d,1u} r448={1d,1u} r449={1d,1u} r450={1d,2u} r452={7d,2u} r453={1d,2u} ;; total ref usage 1056{301d,746u,9e} in 313{313 regular + 0 call} insns. 49: NOTE_INSN_BASIC_BLOCK 2 2: r273:DI=%3:DI REG_DEAD %3:DI 3: r274:DI=%4:DI REG_DEAD %4:DI 4: r275:DI=%5:DI REG_DEAD %5:DI 5: r276:DI=%6:DI REG_DEAD %6:DI 6: NOTE_INSN_FUNCTION_BEG 51: r277:CC=cmp(r274:DI,0) 52: pc={(r277:CC!=0)?L56:pc} REG_DEAD r277:CC REG_BR_PROB 708669604 58: L58: 53: NOTE_INSN_BASIC_BLOCK 3 20: r272:DI=0 ; pc falls through to BB 78 56: L56: 57: NOTE_INSN_BASIC_BLOCK 4 59: r278:CC=cmp(r276:DI,0) 60: pc={(r278:CC==0)?L58:pc} REG_DEAD r278:CC REG_BR_PROB 365072228 61: NOTE_INSN_BASIC_BLOCK 5 62: r248:DI=r273:DI+r274:DI REG_DEAD r274:DI 63: r249:DI=r275:DI+r276:DI REG_DEAD r276:DI 64: r226:DI=r275:DI+0x1 65: r279:HI=bswap([r273:DI]) 66: r280:SI=zero_extend(r279:HI) REG_DEAD r279:HI 67: r229:DI=zero_extend(r280:SI) REG_DEAD r280:SI 68: r264:DI=r248:DI-0x2 69: r281:CCUNS=cmp(r273:DI,r264:DI) 70: pc={(geu(r281:CCUNS,0))?L439:pc} REG_DEAD r281:CCUNS REG_BR_PROB 29527908 437: NOTE_INSN_BASIC_BLOCK 6 8: r262:DI=r273:DI 9: r201:DI=0 531: r453:DI=sfp:DI+0x20 358: L358: 71: NOTE_INSN_BASIC_BLOCK 7 72: r282:SI=r229:DI#0<<0x8 REG_DEAD r229:DI 74: r123:DI=zero_extend([r262:DI+0x2]) 75: r283:SI=r123:DI#0|r282:SI 76: r229:DI=zero_extend(r283:SI) REG_DEAD r283:SI 77: r284:SI=r283:SI 0>>0x8 80: r287:SI=r283:SI<<0x2 82: r288:SI=r287:SI+r283:SI REG_EQUAL r283:SI*0x5 83: r289:SI=r284:SI-r288:SI REG_DEAD r288:SI REG_DEAD r284:SI 84: r290:SI=r289:SI&0xffff REG_DEAD r289:SI 85: r291:DI=zero_extend(r290:SI) REG_DEAD r290:SI 86: r292:DI=r291:DI<<0x3 REG_DEAD r291:DI 530: r449:DI=r453:DI REG_EQUAL sfp:DI+0x20 87: r250:DI=r449:DI+r292:DI REG_DEAD r449:DI REG_DEAD r292:DI 88: r293:DI=zero_extend([r250:DI]) 89: r251:DI=r273:DI+r293:DI REG_DEAD r293:DI 90: r294:DI=r262:DI-r273:DI 91: [r250:DI]=r294:DI#0 REG_DEAD r294:DI REG_DEAD r250:DI 92: r295:DI=r262:DI-r251:DI 93: r135:DI=r295:DI-0x1 REG_DEAD r295:DI 97: r296:CCUNS=cmp(r135:DI,0x1fff) 98: pc={(gtu(r296:CCUNS,0))?L331:pc} REG_DEAD r296:CCUNS REG_BR_PROB 536870916 99: NOTE_INSN_BASIC_BLOCK 8 100: r297:CCUNS=cmp(r273:DI,r251:DI) 101: pc={(geu(r297:CCUNS,0))?L331:pc} REG_DEAD r297:CCUNS REG_BR_PROB 536870916 102: NOTE_INSN_BASIC_BLOCK 9 103: r298:SI=zero_extend([r251:DI+0x2]) 104: r299:CC=cmp(r298:SI,r123:DI#0) REG_DEAD r298:SI REG_DEAD r123:DI 105: pc={(r299:CC!=0)?L331:pc} REG_DEAD r299:CC REG_BR_PROB 708669604 106: NOTE_INSN_BASIC_BLOCK 10 107: r300:SI=zero_extend([r251:DI]) 108: r301:SI=zero_extend([r262:DI]) 109: r302:CC=cmp(r300:SI,r301:SI) REG_DEAD r301:SI REG_DEAD r300:SI 110: pc={(r302:CC!=0)?L331:pc} REG_DEAD r302:CC REG_BR_PROB 708669604 111: NOTE_INSN_BASIC_BLOCK 11 112: r303:DI=r248:DI-r262:DI 113: r304:SI=r303:DI#0-0x2 REG_DEAD r303:DI 115: r305:DI=r226:DI+0x4 116: r306:CCUNS=cmp(r249:DI,r305:DI) REG_DEAD r305:DI 117: pc={(gtu(r306:CCUNS,0))?L125:pc} REG_DEAD r306:CCUNS REG_BR_PROB 536870916 118: NOTE_INSN_BASIC_BLOCK 12 119: r307:CC=cmp(r201:DI,0) 120: pc={(r307:CC!=0)?L58:pc} REG_DEAD r307:CC REG_BR_PROB 536870916 121: NOTE_INSN_BASIC_BLOCK 13 122: r308:DI=r226:DI+0x3 123: r309:CCUNS=cmp(r249:DI,r308:DI) REG_DEAD r308:DI 124: pc={(leu(r309:CCUNS,0))?L58:pc} REG_DEAD r309:CCUNS REG_BR_PROB 4 125: L125: 126: NOTE_INSN_BASIC_BLOCK 14 127: r310:SI=-r201:DI#0 128: r311:DI=sign_extend(r310:SI) REG_DEAD r310:SI 129: r312:DI=r226:DI+r311:DI REG_DEAD r311:DI 131: r314:SI=r201:DI#0-0x1 132: [r312:DI-0x1]=r314:SI#0 REG_DEAD r314:SI REG_DEAD r312:DI 133: {r316:SI=r201:DI#0==0;clobber scratch;clobber scratch;} REG_DEAD r201:DI 134: r315:DI=zero_extend(r316:SI) REG_DEAD r316:SI 135: r255:DI=r226:DI-r315:DI REG_DEAD r315:DI REG_DEAD r226:DI 138: r317:CCUNS=cmp(r304:SI,0x10) 139: pc={(leu(r317:CCUNS,0))?L444:pc} REG_DEAD r317:CCUNS REG_BR_PROB 289910300 140: NOTE_INSN_BASIC_BLOCK 15 141: r318:SI=zero_extend([r251:DI+0x3]) 142: r319:SI=zero_extend([r262:DI+0x3]) 143: r320:CC=cmp(r318:SI,r319:SI) REG_DEAD r319:SI REG_DEAD r318:SI 144: pc={(r320:CC!=0)?L446:pc} REG_DEAD r320:CC REG_BR_PROB 708669604 145: NOTE_INSN_BASIC_BLOCK 16 146: r321:SI=zero_extend([r251:DI+0x4]) 147: r322:SI=zero_extend([r262:DI+0x4]) 148: r323:CC=cmp(r321:SI,r322:SI) REG_DEAD r322:SI REG_DEAD r321:SI 149: pc={(r323:CC!=0)?L450:pc} REG_DEAD r323:CC REG_BR_PROB 708669604 150: NOTE_INSN_BASIC_BLOCK 17 151: r324:SI=zero_extend([r251:DI+0x5]) 152: r325:SI=zero_extend([r262:DI+0x5]) 153: r326:CC=cmp(r324:SI,r325:SI) REG_DEAD r325:SI REG_DEAD r324:SI 154: pc={(r326:CC!=0)?L454:pc} REG_DEAD r326:CC REG_BR_PROB 708669604 155: NOTE_INSN_BASIC_BLOCK 18 156: r327:SI=zero_extend([r251:DI+0x6]) 157: r328:SI=zero_extend([r262:DI+0x6]) 158: r329:CC=cmp(r327:SI,r328:SI) REG_DEAD r328:SI REG_DEAD r327:SI 159: pc={(r329:CC!=0)?L458:pc} REG_DEAD r329:CC REG_BR_PROB 708669604 160: NOTE_INSN_BASIC_BLOCK 19 161: r330:SI=zero_extend([r251:DI+0x7]) 162: r331:SI=zero_extend([r262:DI+0x7]) 163: r332:CC=cmp(r330:SI,r331:SI) REG_DEAD r331:SI REG_DEAD r330:SI 164: pc={(r332:CC!=0)?L462:pc} REG_DEAD r332:CC REG_BR_PROB 708669604 165: NOTE_INSN_BASIC_BLOCK 20 166: r333:SI=zero_extend([r251:DI+0x8]) 167: r334:SI=zero_extend([r262:DI+0x8]) 168: r335:CC=cmp(r333:SI,r334:SI) REG_DEAD r334:SI REG_DEAD r333:SI 169: pc={(r335:CC!=0)?L466:pc} REG_DEAD r335:CC REG_BR_PROB 708669604 170: NOTE_INSN_BASIC_BLOCK 21 171: r336:SI=zero_extend([r251:DI+0x9]) 172: r337:SI=zero_extend([r262:DI+0x9]) 173: r338:CC=cmp(r336:SI,r337:SI) REG_DEAD r337:SI REG_DEAD r336:SI 174: pc={(r338:CC!=0)?L468:pc} REG_DEAD r338:CC REG_BR_PROB 708669604 175: NOTE_INSN_BASIC_BLOCK 22 176: r339:SI=zero_extend([r251:DI+0xa]) 177: r340:SI=zero_extend([r262:DI+0xa]) 178: r341:CC=cmp(r339:SI,r340:SI) REG_DEAD r340:SI REG_DEAD r339:SI 179: pc={(r341:CC!=0)?L472:pc} REG_DEAD r341:CC REG_BR_PROB 708669604 180: NOTE_INSN_BASIC_BLOCK 23 181: r342:SI=zero_extend([r251:DI+0xb]) 182: r343:SI=zero_extend([r262:DI+0xb]) 183: r344:CC=cmp(r342:SI,r343:SI) REG_DEAD r343:SI REG_DEAD r342:SI 184: pc={(r344:CC!=0)?L476:pc} REG_DEAD r344:CC REG_BR_PROB 708669604 185: NOTE_INSN_BASIC_BLOCK 24 186: r345:SI=zero_extend([r251:DI+0xc]) 187: r346:SI=zero_extend([r262:DI+0xc]) 188: r347:CC=cmp(r345:SI,r346:SI) REG_DEAD r346:SI REG_DEAD r345:SI 189: pc={(r347:CC!=0)?L480:pc} REG_DEAD r347:CC REG_BR_PROB 708669604 190: NOTE_INSN_BASIC_BLOCK 25 191: r348:SI=zero_extend([r251:DI+0xd]) 192: r349:SI=zero_extend([r262:DI+0xd]) 193: r350:CC=cmp(r348:SI,r349:SI) REG_DEAD r349:SI REG_DEAD r348:SI 194: pc={(r350:CC!=0)?L484:pc} REG_DEAD r350:CC REG_BR_PROB 708669604 195: NOTE_INSN_BASIC_BLOCK 26 196: r351:SI=zero_extend([r251:DI+0xe]) 197: r352:SI=zero_extend([r262:DI+0xe]) 198: r353:CC=cmp(r351:SI,r352:SI) REG_DEAD r352:SI REG_DEAD r351:SI 199: pc={(r353:CC!=0)?L488:pc} REG_DEAD r353:CC REG_BR_PROB 708669604 200: NOTE_INSN_BASIC_BLOCK 27 201: r354:SI=zero_extend([r251:DI+0xf]) 202: r355:SI=zero_extend([r262:DI+0xf]) 203: r356:CC=cmp(r354:SI,r355:SI) REG_DEAD r355:SI REG_DEAD r354:SI 204: pc={(r356:CC!=0)?L492:pc} REG_DEAD r356:CC REG_BR_PROB 708669604 205: NOTE_INSN_BASIC_BLOCK 28 206: r357:SI=zero_extend([r251:DI+0x10]) 207: r358:SI=zero_extend([r262:DI+0x10]) 208: r359:CC=cmp(r357:SI,r358:SI) REG_DEAD r358:SI REG_DEAD r357:SI 209: pc={(r359:CC!=0)?L496:pc} REG_DEAD r359:CC REG_BR_PROB 708669604 210: NOTE_INSN_BASIC_BLOCK 29 211: r360:SI=zero_extend([r251:DI+0x11]) 212: r361:SI=zero_extend([r262:DI+0x11]) 213: r362:CC=cmp(r360:SI,r361:SI) REG_DEAD r361:SI REG_DEAD r360:SI 214: pc={(r362:CC!=0)?L500:pc} REG_DEAD r362:CC REG_BR_PROB 708669604 215: NOTE_INSN_BASIC_BLOCK 30 216: r363:SI=zero_extend([r251:DI+0x12]) 217: r364:SI=zero_extend([r262:DI+0x12]) 218: r365:CC=cmp(r363:SI,r364:SI) REG_DEAD r364:SI REG_DEAD r363:SI 219: pc={(r365:CC!=0)?L504:pc} REG_DEAD r365:CC REG_BR_PROB 901943140 440: NOTE_INSN_BASIC_BLOCK 31 10: r244:DI=0x12 ; pc falls through to BB 33 444: L444: 443: NOTE_INSN_BASIC_BLOCK 32 11: r244:DI=0x2 220: L220: 221: NOTE_INSN_BASIC_BLOCK 33 224: r367:SI=0x108 225: r368:CCUNS=cmp(r304:SI,0x108) 227: r366:SI={(gtu(r368:CCUNS,0))?r367:SI:r304:SI} REG_EQUAL {(gtu(r368:CCUNS,0))?0x108:r304:SI} 228: r254:DI=zero_extend(r366:SI) REG_DEAD r366:SI 229: r370:SI=r244:DI#0+0x1 REG_DEAD r244:DI 230: r139:DI=zero_extend(r370:SI) REG_DEAD r370:SI ; pc falls through to BB 35 508: L508: 507: NOTE_INSN_BASIC_BLOCK 34 12: r139:DI=r140:DI REG_DEAD r140:DI 240: L240: 231: NOTE_INSN_BASIC_BLOCK 35 232: r142:DI=zero_extend(r139:DI#0) 233: r371:SI=r142:DI#0-0x1 234: r243:DI=zero_extend(r371:SI) REG_DEAD r371:SI 235: r452:DI=r262:DI+r139:DI 538: r194:DI=r452:DI 236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0) 237: pc={(geu(r372:CCUNS,0))?L246:pc} REG_DEAD r372:CCUNS REG_BR_PROB 59055804 238: NOTE_INSN_BASIC_BLOCK 36 239: r140:DI=r139:DI+0x1 241: r373:DI=r251:DI-0x1 242: r374:SI=zero_extend([r262:DI+r139:DI]) 243: r375:SI=zero_extend([r373:DI+r140:DI]) REG_DEAD r373:DI 244: r376:CC=cmp(r374:SI,r375:SI) REG_DEAD r375:SI REG_DEAD r374:SI 245: pc={(r376:CC==0)?L508:pc} REG_DEAD r376:CC REG_BR_PROB 1014686028 246: L246: 247: NOTE_INSN_BASIC_BLOCK 37 248: r377:SI=r142:DI#0-0x2 REG_DEAD r142:DI 249: r256:DI=zero_extend(r377:SI) REG_DEAD r377:SI 252: r378:CCUNS=cmp(r377:SI,0x6) 253: pc={(gtu(r378:CCUNS,0))?L268:pc} REG_DEAD r378:CCUNS REG_BR_PROB 1073741828 254: NOTE_INSN_BASIC_BLOCK 38 255: r268:DI=r243:DI REG_DEAD r243:DI 257: r380:SI=r377:SI<<0x5 258: r267:DI=zero_extend(r380:SI#0) REG_DEAD r380:SI 430: L430: 259: NOTE_INSN_BASIC_BLOCK 39 260: r241:DI=r255:DI+0x1 261: r381:DI=r135:DI 0>>0x8 263: r383:SI=r381:DI#0+r267:DI#0 264: r195:DI=zero_extend(r383:SI#0) REG_DEAD r383:SI 528: r194:DI=r452:DI REG_EQUAL r262:DI+r139:DI ; pc falls through to BB 41 268: L268: 269: NOTE_INSN_BASIC_BLOCK 40 270: r384:DI=r135:DI 0>>0x8 272: r386:SI=r384:DI#0-0x20 273: r195:DI=zero_extend(r386:SI#0) REG_DEAD r386:SI 274: r241:DI=r255:DI+0x2 276: r388:SI=r256:DI#0-0x7 277: [r255:DI+0x1]=r388:SI#0 REG_DEAD r388:SI 278: r268:DI=r243:DI REG_DEAD r243:DI 279: L279: 280: NOTE_INSN_BASIC_BLOCK 41 281: [r255:DI]=r195:DI#0 REG_DEAD r255:DI REG_DEAD r195:DI 282: [r241:DI]=r135:DI#0 REG_DEAD r135:DI 283: r226:DI=r241:DI+0x2 REG_DEAD r241:DI 284: r389:CCUNS=cmp(r264:DI,r194:DI) 285: pc={(leu(r389:CCUNS,0))?L512:pc} REG_DEAD r389:CCUNS REG_BR_PROB 29527908 286: NOTE_INSN_BASIC_BLOCK 42 287: r390:SI=zero_extend([r194:DI-0x2]) 288: r391:SI=r390:SI<<0x8 REG_DEAD r390:SI 289: r392:SI=zero_extend([r194:DI-0x1]) 290: r393:SI=r391:SI|r392:SI REG_DEAD r392:SI REG_DEAD r391:SI 291: r394:SI=r393:SI<<0x8 REG_DEAD r393:SI 293: r395:SI=zero_extend([r194:DI]) 294: r396:SI=r395:SI|r394:SI 296: r397:SI=r396:SI 0>>0x8 299: r400:SI=r396:SI<<0x2 301: r401:SI=r400:SI+r396:SI REG_EQUAL r396:SI*0x5 302: r402:SI=r397:SI-r401:SI REG_DEAD r401:SI REG_DEAD r397:SI 303: r403:SI=r402:SI&0xffff REG_DEAD r402:SI 304: r404:DI=zero_extend(r403:SI) REG_DEAD r403:SI 305: r405:DI=r404:DI<<0x3 REG_DEAD r404:DI 529: r450:DI=r453:DI REG_EQUAL sfp:DI+0x20 306: r406:DI=r450:DI+r405:DI REG_DEAD r450:DI REG_DEAD r405:DI 307: r407:DI=r194:DI-0x2 308: r408:DI=r407:DI-r273:DI REG_DEAD r407:DI 309: [r406:DI]=r408:DI REG_DEAD r408:DI REG_DEAD r406:DI 310: r409:SI=r396:SI<<0x8 312: r410:SI=zero_extend([r194:DI+0x1]) 313: r411:SI=r410:SI|r409:SI 314: r229:DI=zero_extend(r411:SI) REG_DEAD r411:SI 315: r412:SI=r411:SI 0>>0x8 318: r415:SI=r411:SI<<0x2 320: r416:SI=r415:SI+r411:SI REG_EQUAL r411:SI*0x5 321: r417:SI=r412:SI-r416:SI REG_DEAD r416:SI REG_DEAD r412:SI 322: r418:SI=r417:SI&0xffff REG_DEAD r417:SI 323: r419:DI=zero_extend(r418:SI) REG_DEAD r418:SI 324: r420:DI=r419:DI<<0x3 REG_DEAD r419:DI 325: r421:DI=r450:DI+r420:DI REG_DEAD r451:DI REG_DEAD r420:DI 326: r422:DI=r262:DI+r268:DI REG_DEAD r268:DI REG_DEAD r262:DI 327: r423:DI=r422:DI-r273:DI REG_DEAD r422:DI 328: [r421:DI]=r423:DI REG_DEAD r423:DI REG_DEAD r421:DI 14: r201:DI=0 ; pc falls through to BB 48 331: L331: 332: NOTE_INSN_BASIC_BLOCK 43 333: r424:CCUNS=cmp(r249:DI,r226:DI) 334: pc={(leu(r424:CCUNS,0))?L58:pc} REG_DEAD r424:CCUNS REG_BR_PROB 29527908 335: NOTE_INSN_BASIC_BLOCK 44 336: r425:SI=r201:DI#0+0x1 REG_DEAD r201:DI 337: r201:DI=sign_extend(r425:SI) REG_DEAD r425:SI 338: r194:DI=r262:DI+0x1 339: r227:DI=zero_extend([r262:DI]) REG_DEAD r262:DI 340: [r226:DI]=r227:DI#0 REG_DEAD r227:DI 341: r426:CC=cmp(r425:SI,0x20) 342: pc={(r426:CC==0)?L347:pc} REG_DEAD r426:CC REG_BR_PROB 365072228 343: NOTE_INSN_BASIC_BLOCK 45 344: r226:DI=r226:DI+0x1 ; pc falls through to BB 47 347: L347: 348: NOTE_INSN_BASIC_BLOCK 46 349: r427:QI=0x1f 350: [r226:DI-0x20]=r427:QI REG_DEAD r427:QI 351: r226:DI=r226:DI+0x2 13: r201:DI=0 352: L352: 353: NOTE_INSN_BASIC_BLOCK 47 354: r428:CCUNS=cmp(r194:DI,r264:DI) 355: pc={(geu(r428:CCUNS,0))?L361:pc} REG_DEAD r428:CCUNS REG_BR_PROB 30394580 356: L356: 357: NOTE_INSN_BASIC_BLOCK 48 7: r262:DI=r194:DI REG_DEAD r194:DI ; pc falls through to BB 7 439: L439: 438: NOTE_INSN_BASIC_BLOCK 49 16: r194:DI=r273:DI REG_DEAD r273:DI 17: r201:DI=0 ; pc falls through to BB 51 512: L512: 511: NOTE_INSN_BASIC_BLOCK 50 15: r201:DI=0 361: L361: 362: NOTE_INSN_BASIC_BLOCK 51 363: r429:DI=r226:DI+0x3 364: r430:CCUNS=cmp(r249:DI,r429:DI) REG_DEAD r429:DI REG_DEAD r249:DI 365: pc={(ltu(r430:CCUNS,0))?L58:pc} REG_DEAD r430:CCUNS REG_BR_PROB 365072228 366: NOTE_INSN_BASIC_BLOCK 52 367: r431:CCUNS=cmp(r194:DI,r248:DI) 368: pc={(geu(r431:CCUNS,0))?L401:pc} REG_DEAD r431:CCUNS REG_BR_PROB 118111604 369: NOTE_INSN_BASIC_BLOCK 53 370: r190:DI=r248:DI-r194:DI REG_DEAD r248:DI 371: r121:DI=r194:DI-0x1 REG_DEAD r194:DI 398: L398: 372: NOTE_INSN_BASIC_BLOCK 54 373: r432:SI=r201:DI#0+0x1 REG_DEAD r201:DI 374: r201:DI=sign_extend(r432:SI) REG_DEAD r432:SI 375: r121:DI=r121:DI+0x1 376: r230:DI=zero_extend([r121:DI]) 377: [r226:DI]=r230:DI#0 REG_DEAD r230:DI 378: r433:CC=cmp(r432:SI,0x20) 379: pc={(r433:CC==0)?L388:pc} REG_DEAD r433:CC REG_BR_PROB 365072228 380: NOTE_INSN_BASIC_BLOCK 55 381: r226:DI=r226:DI+0x1 382: r190:DI=r190:DI-0x1 383: r434:CC=cmp(r190:DI,0) 384: pc={(r434:CC!=0)?L396:pc} REG_DEAD r434:CC REG_BR_PROB 955630228 ; pc falls through to BB 60 388: L388: 389: NOTE_INSN_BASIC_BLOCK 56 390: r435:QI=0x1f 391: [r226:DI-0x20]=r435:QI REG_DEAD r435:QI 392: r226:DI=r226:DI+0x2 393: r190:DI=r190:DI-0x1 394: r436:CC=cmp(r190:DI,0) 395: pc={(r436:CC==0)?L515:pc} REG_DEAD r436:CC REG_BR_PROB 118111604 513: NOTE_INSN_BASIC_BLOCK 57 18: r201:DI=0 396: L396: 397: NOTE_INSN_BASIC_BLOCK 58 ; pc falls through to BB 54 515: L515: 514: NOTE_INSN_BASIC_BLOCK 59 19: r201:DI=0 401: L401: 402: NOTE_INSN_BASIC_BLOCK 60 403: r437:SI=-r201:DI#0 404: r438:DI=sign_extend(r437:SI) REG_DEAD r437:SI 405: r439:DI=r226:DI+r438:DI REG_DEAD r438:DI 407: r441:SI=r201:DI#0-0x1 408: [r439:DI-0x1]=r441:SI#0 REG_DEAD r441:SI REG_DEAD r439:DI 409: {r443:SI=r201:DI#0==0;clobber scratch;clobber scratch;} REG_DEAD r201:DI 410: r442:DI=zero_extend(r443:SI) REG_DEAD r443:SI 411: r444:DI=r226:DI-r442:DI REG_DEAD r442:DI REG_DEAD r226:DI 412: r445:DI=r444:DI-r275:DI REG_DEAD r444:DI REG_DEAD r275:DI 413: r272:DI=sign_extend(r445:DI#0) REG_DEAD r445:DI ; pc falls through to BB 78 468: L468: 467: NOTE_INSN_BASIC_BLOCK 61 30: r265:DI=0x9 ; pc falls through to BB 71 472: L472: 471: NOTE_INSN_BASIC_BLOCK 62 29: r265:DI=0xa ; pc falls through to BB 71 476: L476: 475: NOTE_INSN_BASIC_BLOCK 63 28: r265:DI=0xb ; pc falls through to BB 71 480: L480: 479: NOTE_INSN_BASIC_BLOCK 64 27: r265:DI=0xc ; pc falls through to BB 71 484: L484: 483: NOTE_INSN_BASIC_BLOCK 65 26: r265:DI=0xd ; pc falls through to BB 71 488: L488: 487: NOTE_INSN_BASIC_BLOCK 66 25: r265:DI=0xe ; pc falls through to BB 71 492: L492: 491: NOTE_INSN_BASIC_BLOCK 67 24: r265:DI=0xf ; pc falls through to BB 71 496: L496: 495: NOTE_INSN_BASIC_BLOCK 68 23: r265:DI=0x10 ; pc falls through to BB 71 500: L500: 499: NOTE_INSN_BASIC_BLOCK 69 22: r265:DI=0x11 ; pc falls through to BB 71 504: L504: 503: NOTE_INSN_BASIC_BLOCK 70 21: r265:DI=0x12 419: L419: 420: NOTE_INSN_BASIC_BLOCK 71 421: r447:SI=r265:DI#0-0x2 422: r256:DI=zero_extend(r447:SI) REG_DEAD r447:SI 423: r448:SI=r265:DI#0-0x1 424: r243:DI=zero_extend(r448:SI) REG_DEAD r448:SI 425: r194:DI=r262:DI+r265:DI REG_DEAD r265:DI ; pc falls through to BB 40 446: L446: 445: NOTE_INSN_BASIC_BLOCK 72 46: r139:DI=0x3 47: r268:DI=0x2 48: r267:DI=0x20 532: r452:DI=r262:DI+r139:DI ; pc falls through to BB 39 450: L450: 449: NOTE_INSN_BASIC_BLOCK 73 43: r139:DI=0x4 44: r268:DI=0x3 45: r267:DI=0x40 533: r452:DI=r262:DI+r139:DI ; pc falls through to BB 39 454: L454: 453: NOTE_INSN_BASIC_BLOCK 74 40: r139:DI=0x5 41: r268:DI=0x4 42: r267:DI=0x60 534: r452:DI=r262:DI+r139:DI ; pc falls through to BB 39 458: L458: 457: NOTE_INSN_BASIC_BLOCK 75 37: r139:DI=0x6 38: r268:DI=0x5 39: r267:DI=0x80 535: r452:DI=r262:DI+r139:DI ; pc falls through to BB 39 462: L462: 461: NOTE_INSN_BASIC_BLOCK 76 34: r139:DI=0x7 35: r268:DI=0x6 36: r267:DI=0xa0 536: r452:DI=r262:DI+r139:DI ; pc falls through to BB 39 466: L466: 465: NOTE_INSN_BASIC_BLOCK 77 31: r139:DI=0x8 32: r268:DI=0x7 33: r267:DI=0xc0 537: r452:DI=r262:DI+r139:DI ; pc falls through to BB 39 436: NOTE_INSN_BASIC_BLOCK 78 434: %3:DI=r272:DI REG_DEAD r272:DI 435: use %3:DI ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-04-20 9:23 ` Xionghu Luo @ 2021-04-21 11:54 ` Richard Biener 2021-05-14 7:10 ` Xionghu Luo 0 siblings, 1 reply; 20+ messages in thread From: Richard Biener @ 2021-04-21 11:54 UTC (permalink / raw) To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On Tue, 20 Apr 2021, Xionghu Luo wrote: > > > On 2021/4/15 19:34, Richard Biener wrote: > > On Thu, 15 Apr 2021, Xionghu Luo wrote: > > > >> Thanks, > >> > >> On 2021/4/14 14:41, Richard Biener wrote: > >>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, > >>>> but it moves #538 first, then #235, there is strong dependency here. It > >>>> seemsdoesn't like the LCM framework that could solve all and do the > >>>> delete-insert in one iteration. > >>> So my question was whether we want to do both within the LCM store > >>> sinking framework. The LCM dataflow is also used by RTL PRE which > >>> handles both loads and non-loads so in principle it should be able > >>> to handle stores and non-stores for the sinking case (PRE on the > >>> reverse CFG). > >>> > >>> A global dataflow is more powerful than any local ad-hoc method. > >> > >> My biggest concern is whether the LCM DF framework could support sinking > >> *multiple* reverse-dependent non-store instructions together by *one* > >> calling of LCM DF. If this is not supported, we need run multiple LCM > >> until no new changes, it would be time consuming obviously (unless > >> compiling time is not important here). > > > > As said it is used for PRE and there it most definitely can do that. > > I did some investigation about PRE and attached a case to show how it > works, it is quite like store-motion, and actually there is a rtl-hoist > pass in gcse.c which only works for code size. All of them are > leveraging the LCM framework to move instructions upward or downward. > > PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE > exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions > downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm. > The four problems are all converted to the LCM DF problem with > n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input > and two outputs of where to insert/delete. > > PRE scan each instruction and hash the SRC to table without *checking the > relationship between instructions*, for the case attached, BB 37, BB 38 > and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41 > save it to index 106, BB 38 save it to index 110. After finishing this pass, > "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert > expr to BB 75~BB 80 to create full redundancies from partial redundancies, > finally update instruction in BB 37. I'm not familiar with the actual PRE code but reading the toplevel comment it seems that indeed it can only handle expressions contained in a single insn unless a REG_EQUAL note provides a short-hand for the larger one. That of course means it would need to mark things as not transparent for correctness where they'd be if moved together. Now, nothing prevents you changing the granularity of what you feed LCM. So originally we arrived at looking into LCM because there's already a (store) sinking pass on RTL (using LCM) so adding another (loop-special) one didn't look like the best obvious solution. That said, LCM would work for single-instruction expressions. Alternatively a greedy algorithm like you prototyped could be used. Another pass to look at would be RTL invariant motion which seems to compute some kind of dependency graph - not sure if that would be adaptable for the reverse CFG problem. > Issues witnessed in the PRE run: > 1) "r262:DI+r139:DI" in BLOCK 38 is not replaced; > 2) PRE also use pseudo register to store the result like store-motion and > even rtl-hoist. Actually we need real "move" instead of "replace" for > rtl-sink to improve performance though also potential register pressure issue > like rtl-hoist? > 3) SRC instruction is scanned WITHOUT back chain check cross instructions > in hash_scan_set, it couldn't handle below case though a+c is same with b+d. > So I am afraid single run couldn't solve the instruction dependency issue > to sink multiple instructions out as before for rtl-sink. > > BB1: > a = b; > c = d; > s = a + c; > BB2: > s = b + d; > > > gcse.c: > changed = one_pre_gcse_pass () > alloc_hash_table (&expr_hash_table); > compute_hash_table (&expr_hash_table); > compute_hash_table_work (table); > FOR_EACH_BB_FN (current_bb, cfun) > FOR_BB_INSNS (current_bb, insn) > hash_scan_insn (insn, table); > hash_scan_set (pat, insn, table); > if REG_P (dest) > insert_expr_in_table (src, GET_MODE (dest), insn, antic_p, avail_p, > max_distance, table); > hash = hash_expr (x, mode, &do_not_record_p, table->size); > dump_hash_table (dump_file, "Expression", &expr_hash_table); > edge_list = compute_pre_data (); > compute_local_properties (transp, comp, antloc, &expr_hash_table); > edge_list = pre_edge_lcm (expr_hash_table.n_elems, transp, comp, antloc, > ae_kill, &pre_insert_map, &pre_delete_map); > changed |= pre_gcse (edge_list); > changed = pre_delete (); /* Create a pseudo-reg to store the result of > reaching expressions into. */ > did_insert = pre_edge_insert (edge_list, index_map); > > > > >> > >>> > >>> Richard. > >>> > >>>> However, there are still some common methods could be shared, like the > >>>> def-use check(though store-motion is per bb, rtl-sink is per loop), > >>>> insert_store, commit_edge_insertions etc. > >>>> > >>>> > >>>> 508: L508: > >>>> 507: NOTE_INSN_BASIC_BLOCK 34 > >>>> 12: r139:DI=r140:DI > >>>> REG_DEAD r140:DI > >>>> 240: L240: > >>>> 231: NOTE_INSN_BASIC_BLOCK 35 > >>>> 232: r142:DI=zero_extend(r139:DI#0) > >>>> 233: r371:SI=r142:DI#0-0x1 > >>>> 234: r243:DI=zero_extend(r371:SI) > >>>> REG_DEAD r371:SI > >>>> 235: r452:DI=r262:DI+r139:DI > >>>> 538: r194:DI=r452:DI > >>>> 236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0) > >> > >> > >> Like here, Each instruction's dest reg is calculated in the input vector > >> bitmap, after solving the equations by calling pre_edge_rev_lcm, > >> move #538 out of loop for the first call, then move #235 out of loop > >> after a second call... 4 repeat calls needed in total here, is the LCM > >> framework smart enough to move the all 4 instruction within one iteration? > >> I am worried that the input vector bitmap couldn't solve the dependency > >> problem for two back chained instructions. > >> > >> > >> > > > > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-04-21 11:54 ` Richard Biener @ 2021-05-14 7:10 ` Xionghu Luo 2021-05-17 8:11 ` Richard Biener 0 siblings, 1 reply; 20+ messages in thread From: Xionghu Luo @ 2021-05-14 7:10 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc Hi Richi, On 2021/4/21 19:54, Richard Biener wrote: > On Tue, 20 Apr 2021, Xionghu Luo wrote: > >> >> >> On 2021/4/15 19:34, Richard Biener wrote: >>> On Thu, 15 Apr 2021, Xionghu Luo wrote: >>> >>>> Thanks, >>>> >>>> On 2021/4/14 14:41, Richard Biener wrote: >>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, >>>>>> but it moves #538 first, then #235, there is strong dependency here. It >>>>>> seemsdoesn't like the LCM framework that could solve all and do the >>>>>> delete-insert in one iteration. >>>>> So my question was whether we want to do both within the LCM store >>>>> sinking framework. The LCM dataflow is also used by RTL PRE which >>>>> handles both loads and non-loads so in principle it should be able >>>>> to handle stores and non-stores for the sinking case (PRE on the >>>>> reverse CFG). >>>>> >>>>> A global dataflow is more powerful than any local ad-hoc method. >>>> >>>> My biggest concern is whether the LCM DF framework could support sinking >>>> *multiple* reverse-dependent non-store instructions together by *one* >>>> calling of LCM DF. If this is not supported, we need run multiple LCM >>>> until no new changes, it would be time consuming obviously (unless >>>> compiling time is not important here). >>> >>> As said it is used for PRE and there it most definitely can do that. >> >> I did some investigation about PRE and attached a case to show how it >> works, it is quite like store-motion, and actually there is a rtl-hoist >> pass in gcse.c which only works for code size. All of them are >> leveraging the LCM framework to move instructions upward or downward. >> >> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE >> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions >> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm. >> The four problems are all converted to the LCM DF problem with >> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input >> and two outputs of where to insert/delete. >> >> PRE scan each instruction and hash the SRC to table without *checking the >> relationship between instructions*, for the case attached, BB 37, BB 38 >> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41 >> save it to index 106, BB 38 save it to index 110. After finishing this pass, >> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert >> expr to BB 75~BB 80 to create full redundancies from partial redundancies, >> finally update instruction in BB 37. > > I'm not familiar with the actual PRE code but reading the toplevel comment > it seems that indeed it can only handle expressions contained in a single > insn unless a REG_EQUAL note provides a short-hand for the larger one. > > That of course means it would need to mark things as not transparent > for correctness where they'd be if moved together. Now, nothing > prevents you changing the granularity of what you feed LCM. > > So originally we arrived at looking into LCM because there's already > a (store) sinking pass on RTL (using LCM) so adding another (loop-special) > one didn't look like the best obvious solution. > > That said, LCM would work for single-instruction expressions. > Alternatively a greedy algorithm like you prototyped could be used. > Another pass to look at would be RTL invariant motion which seems to > compute some kind of dependency graph - not sure if that would be > adaptable for the reverse CFG problem. > Actually my RTL sinking pass patch is borrowed from RTL loop invariant motion, it is quite limited since only moves instructions from loop header to loop exits, though it could be refined with various of algorithms. Compared to the initial method of running gimple sink pass once more, it seems much more complicated and limited without gaining obvious performance benefit, shall we turn back to consider gimple sink2 pass from original since we are in stage1 now? Thanks, Xionghu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-05-14 7:10 ` Xionghu Luo @ 2021-05-17 8:11 ` Richard Biener 2021-05-18 5:20 ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo 0 siblings, 1 reply; 20+ messages in thread From: Richard Biener @ 2021-05-17 8:11 UTC (permalink / raw) To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On Fri, 14 May 2021, Xionghu Luo wrote: > Hi Richi, > > On 2021/4/21 19:54, Richard Biener wrote: > > On Tue, 20 Apr 2021, Xionghu Luo wrote: > > > >> > >> > >> On 2021/4/15 19:34, Richard Biener wrote: > >>> On Thu, 15 Apr 2021, Xionghu Luo wrote: > >>> > >>>> Thanks, > >>>> > >>>> On 2021/4/14 14:41, Richard Biener wrote: > >>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, > >>>>>> but it moves #538 first, then #235, there is strong dependency here. It > >>>>>> seemsdoesn't like the LCM framework that could solve all and do the > >>>>>> delete-insert in one iteration. > >>>>> So my question was whether we want to do both within the LCM store > >>>>> sinking framework. The LCM dataflow is also used by RTL PRE which > >>>>> handles both loads and non-loads so in principle it should be able > >>>>> to handle stores and non-stores for the sinking case (PRE on the > >>>>> reverse CFG). > >>>>> > >>>>> A global dataflow is more powerful than any local ad-hoc method. > >>>> > >>>> My biggest concern is whether the LCM DF framework could support sinking > >>>> *multiple* reverse-dependent non-store instructions together by *one* > >>>> calling of LCM DF. If this is not supported, we need run multiple LCM > >>>> until no new changes, it would be time consuming obviously (unless > >>>> compiling time is not important here). > >>> > >>> As said it is used for PRE and there it most definitely can do that. > >> > >> I did some investigation about PRE and attached a case to show how it > >> works, it is quite like store-motion, and actually there is a rtl-hoist > >> pass in gcse.c which only works for code size. All of them are > >> leveraging the LCM framework to move instructions upward or downward. > >> > >> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE > >> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions > >> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm. > >> The four problems are all converted to the LCM DF problem with > >> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input > >> and two outputs of where to insert/delete. > >> > >> PRE scan each instruction and hash the SRC to table without *checking the > >> relationship between instructions*, for the case attached, BB 37, BB 38 > >> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41 > >> save it to index 106, BB 38 save it to index 110. After finishing this pass, > >> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert > >> expr to BB 75~BB 80 to create full redundancies from partial redundancies, > >> finally update instruction in BB 37. > > > > I'm not familiar with the actual PRE code but reading the toplevel comment > > it seems that indeed it can only handle expressions contained in a single > > insn unless a REG_EQUAL note provides a short-hand for the larger one. > > > > That of course means it would need to mark things as not transparent > > for correctness where they'd be if moved together. Now, nothing > > prevents you changing the granularity of what you feed LCM. > > > > So originally we arrived at looking into LCM because there's already > > a (store) sinking pass on RTL (using LCM) so adding another (loop-special) > > one didn't look like the best obvious solution. > > > > That said, LCM would work for single-instruction expressions. > > Alternatively a greedy algorithm like you prototyped could be used. > > Another pass to look at would be RTL invariant motion which seems to > > compute some kind of dependency graph - not sure if that would be > > adaptable for the reverse CFG problem. > > > > Actually my RTL sinking pass patch is borrowed from RTL loop invariant > motion, it is quite limited since only moves instructions from loop header > to loop exits, though it could be refined with various of algorithms. > Compared to the initial method of running gimple sink pass once more, > it seems much more complicated and limited without gaining obvious performance > benefit, shall we turn back to consider gimple sink2 pass from original since > we are in stage1 now? OK, so while there might be new sinking opportunities exposed during RTL expansion and early RTL opts we can consider adding another sink pass on GIMPLE. Since it's basically a scheduling optimization placement shouldn't matter much but I suppose we should run it before store merging, so anywhere between cd_dce and that. Richard. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Run pass_sink_code once more before store_mergin 2021-05-17 8:11 ` Richard Biener @ 2021-05-18 5:20 ` Xionghu Luo 2021-05-18 7:02 ` Richard Biener 2021-09-02 15:45 ` Martin Jambor 0 siblings, 2 replies; 20+ messages in thread From: Xionghu Luo @ 2021-05-18 5:20 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc [-- Attachment #1: Type: text/plain, Size: 4731 bytes --] Hi, On 2021/5/17 16:11, Richard Biener wrote: > On Fri, 14 May 2021, Xionghu Luo wrote: > >> Hi Richi, >> >> On 2021/4/21 19:54, Richard Biener wrote: >>> On Tue, 20 Apr 2021, Xionghu Luo wrote: >>> >>>> >>>> >>>> On 2021/4/15 19:34, Richard Biener wrote: >>>>> On Thu, 15 Apr 2021, Xionghu Luo wrote: >>>>> >>>>>> Thanks, >>>>>> >>>>>> On 2021/4/14 14:41, Richard Biener wrote: >>>>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, >>>>>>>> but it moves #538 first, then #235, there is strong dependency here. It >>>>>>>> seemsdoesn't like the LCM framework that could solve all and do the >>>>>>>> delete-insert in one iteration. >>>>>>> So my question was whether we want to do both within the LCM store >>>>>>> sinking framework. The LCM dataflow is also used by RTL PRE which >>>>>>> handles both loads and non-loads so in principle it should be able >>>>>>> to handle stores and non-stores for the sinking case (PRE on the >>>>>>> reverse CFG). >>>>>>> >>>>>>> A global dataflow is more powerful than any local ad-hoc method. >>>>>> >>>>>> My biggest concern is whether the LCM DF framework could support sinking >>>>>> *multiple* reverse-dependent non-store instructions together by *one* >>>>>> calling of LCM DF. If this is not supported, we need run multiple LCM >>>>>> until no new changes, it would be time consuming obviously (unless >>>>>> compiling time is not important here). >>>>> >>>>> As said it is used for PRE and there it most definitely can do that. >>>> >>>> I did some investigation about PRE and attached a case to show how it >>>> works, it is quite like store-motion, and actually there is a rtl-hoist >>>> pass in gcse.c which only works for code size. All of them are >>>> leveraging the LCM framework to move instructions upward or downward. >>>> >>>> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE >>>> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions >>>> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm. >>>> The four problems are all converted to the LCM DF problem with >>>> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input >>>> and two outputs of where to insert/delete. >>>> >>>> PRE scan each instruction and hash the SRC to table without *checking the >>>> relationship between instructions*, for the case attached, BB 37, BB 38 >>>> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41 >>>> save it to index 106, BB 38 save it to index 110. After finishing this pass, >>>> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert >>>> expr to BB 75~BB 80 to create full redundancies from partial redundancies, >>>> finally update instruction in BB 37. >>> >>> I'm not familiar with the actual PRE code but reading the toplevel comment >>> it seems that indeed it can only handle expressions contained in a single >>> insn unless a REG_EQUAL note provides a short-hand for the larger one. >>> >>> That of course means it would need to mark things as not transparent >>> for correctness where they'd be if moved together. Now, nothing >>> prevents you changing the granularity of what you feed LCM. >>> >>> So originally we arrived at looking into LCM because there's already >>> a (store) sinking pass on RTL (using LCM) so adding another (loop-special) >>> one didn't look like the best obvious solution. >>> >>> That said, LCM would work for single-instruction expressions. >>> Alternatively a greedy algorithm like you prototyped could be used. >>> Another pass to look at would be RTL invariant motion which seems to >>> compute some kind of dependency graph - not sure if that would be >>> adaptable for the reverse CFG problem. >>> >> >> Actually my RTL sinking pass patch is borrowed from RTL loop invariant >> motion, it is quite limited since only moves instructions from loop header >> to loop exits, though it could be refined with various of algorithms. >> Compared to the initial method of running gimple sink pass once more, >> it seems much more complicated and limited without gaining obvious performance >> benefit, shall we turn back to consider gimple sink2 pass from original since >> we are in stage1 now? > > OK, so while there might be new sinking opportunities exposed during > RTL expansion and early RTL opts we can consider adding another sink pass > on GIMPLE. Since it's basically a scheduling optimization placement > shouldn't matter much but I suppose we should run it before store > merging, so anywhere between cd_dce and that. > > Richard. > Attached the patch as discussed, put it before store_merging is fine. Regression tested pass on P8LE, OK for trunk? :) Thanks, Xionghu [-- Attachment #2: 0001-Run-pass_sink_code-once-more-before-store_merging.patch --] [-- Type: text/plain, Size: 14563 bytes --] From 7fcc6ca9ef3b6acbfbcbd3da4be1d1c0eef4be80 Mon Sep 17 00:00:00 2001 From: Xiong Hu Luo <luoxhu@linux.ibm.com> Date: Mon, 17 May 2021 20:46:15 -0500 Subject: [PATCH] Run pass_sink_code once more before store_merging Gimple sink code pass runs quite early, there may be some new oppertunities exposed by later gimple optmization passes, this patch runs the sink code pass once more before store_merging. For detailed discussion, please refer to: https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562352.html Tested the SPEC2017 performance on P8LE, 544.nab_r is improved by 2.43%, but no big changes to other cases, GEOMEAN is improved quite small with 0.25%. gcc/ChangeLog: * passes.def: Add sink_code before store_merging. * tree-ssa-sink.c (pass_sink_code:clone): New. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-sink-1.c: Adjust. * gcc.dg/tree-ssa/ssa-sink-2.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-3.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-4.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-5.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-6.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-7.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-8.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-9.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-10.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-13.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-14.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-16.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-17.c: Ditto. * gcc.dg/tree-ssa/ssa-sink-18.c: New. --- gcc/passes.def | 1 + gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c | 196 ++++++++++++++++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c | 2 +- gcc/tree-ssa-sink.c | 1 + 17 files changed, 214 insertions(+), 16 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c diff --git a/gcc/passes.def b/gcc/passes.def index de39fa48b3d..945d2bc797c 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -348,6 +348,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_phiopt, false /* early_p */); NEXT_PASS (pass_fold_builtins); NEXT_PASS (pass_optimize_widening_mul); + NEXT_PASS (pass_sink_code); NEXT_PASS (pass_store_merging); NEXT_PASS (pass_tail_calls); /* If DCE is not run before checking for uninitialized uses, diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c index 411585a6dc4..57b501681f3 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-1.c @@ -7,4 +7,4 @@ foo (int a, int b, int c) return c ? x : a; } /* We should sink the x = a * b calculation into the branch that returns x. */ -/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c index 37e4d2fe687..535cb3208f5 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c @@ -16,4 +16,4 @@ void foo (void) } } -/* { dg-final { scan-tree-dump-times "Sinking # VUSE" 4 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sinking # VUSE" 4 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c index a65ba35d4ba..584fd91f43a 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-13.c @@ -21,5 +21,5 @@ void test () /* We should sink/merge all stores and end up with a single BB. */ -/* { dg-final { scan-tree-dump-times "MEM\[^\n\r\]* = 0;" 3 "sink" } } */ -/* { dg-final { scan-tree-dump-times "<bb " 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "MEM\[^\n\r\]* = 0;" 3 "sink1" } } */ +/* { dg-final { scan-tree-dump-times "<bb " 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c index 771cd4420c4..f5418b06deb 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-14.c @@ -13,5 +13,5 @@ void foo (int b) /* We should have sunk the store and inserted a PHI to merge the stored values. */ -/* { dg-final { scan-tree-dump-times " = PHI" 1 "sink" } } */ -/* { dg-final { scan-tree-dump-times "x = " 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times " = PHI" 1 "sink1" } } */ +/* { dg-final { scan-tree-dump-times "x = " 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c index 610c8d60ebe..012b165fbab 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-16.c @@ -10,5 +10,5 @@ int f(int n) return j; } -/* { dg-final { scan-tree-dump "Sinking j_. = __builtin_ffs" "sink" } } */ +/* { dg-final { scan-tree-dump "Sinking j_. = __builtin_ffs" "sink1" } } */ /* { dg-final { scan-tree-dump "return 2;" "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c index cf2e2a0f766..d0aeeb312cc 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-17.c @@ -12,4 +12,4 @@ int my_f(int a, int b) } /* We should sink the call to pure_f to the if block. */ -/* { dg-final { scan-tree-dump "Sinking # VUSE" "sink" } } */ +/* { dg-final { scan-tree-dump "Sinking # VUSE" "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c new file mode 100644 index 00000000000..b2f5ca9c847 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c @@ -0,0 +1,196 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-sink-stats" } */ + +#include <stdint.h> + +#define HLOG 16 +#define MAX_LIT (1 << 5) +typedef const uint8_t *LZF_HSLOT; +typedef LZF_HSLOT LZF_STATE[1 << (HLOG)]; + +int +compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len) +{ + LZF_STATE htab; + + uint8_t *ip = in_data; + uint8_t *op = out_data; + uint8_t *in_end = ip + in_len; + uint8_t *out_end = op + out_len; + uint8_t *ref; + + unsigned long off; + unsigned int hval; + int lit; + + if (!in_len || !out_len) + return 0; + + lit = 0; + op++; + hval = (((ip[0]) << 8) | ip[1]); + + while (ip < in_end - 2) + { + uint8_t *hslot; + + hval = (((hval) << 8) | ip[2]); + hslot = (uint8_t*)(htab + (((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))); + + ref = *hslot + in_data; + *hslot = ip - in_data; + + if (1 && (off = ip - ref - 1) < (1 << 13) && ref > in_data + && ref[2] == ip[2] + && ((ref[1] << 8) | ref[0]) == ((ip[1] << 8) | ip[0])) + { + unsigned int len = 2; + unsigned int maxlen = in_end - ip - len; + maxlen + = maxlen > ((1 << 8) + (1 << 3)) ? ((1 << 8) + (1 << 3)) : maxlen; + + if ((op + 3 + 1 >= out_end) != 0) + if (op - !lit + 3 + 1 >= out_end) + return 0; + + op[-lit - 1] = lit - 1; + op -= !lit; + + for (;;) + { + if (maxlen > 16) + { + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + len++; + if (ref[len] != ip[len]) + break; + } + + do + { + len++; + } + while (len < maxlen && ip[len] == ref[len]); + + break; + } + + len -= 2; + ip++; + + if (len < 7) + { + *op++ = (off >> 8) + (len << 5); + } + else + { + *op++ = (off >> 8) + (7 << 5); + *op++ = len - 7; + } + *op++ = off; + lit = 0; + op++; + ip += len + 1; + + if (ip >= in_end - 2) + break; + + --ip; + --ip; + + hval = (((ip[0]) << 8) | ip[1]); + hval = (((hval) << 8) | ip[2]); + htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))] + = (LZF_HSLOT)(ip - in_data); + ip++; + + hval = (((hval) << 8) | ip[2]); + htab[(((hval >> (3 * 8 - 16)) - hval * 5) & ((1 << (16)) - 1))] + = (LZF_HSLOT)(ip - in_data); + ip++; + } + else + { + if (op >= out_end) + return 0; + + lit++; + *op++ = *ip++; + + if (lit == (1 << 5)) + { + op[-lit - 1] = lit - 1; + lit = 0; + op++; + } + } + } + if (op + 3 > out_end) /* at most 3 bytes can be missing here */ + return 0; + + while (ip < in_end) + { + lit++; + *op++ = *ip++; + if (lit == MAX_LIT) + { + op[-lit - 1] = lit - 1; /* stop run */ + lit = 0; + op++; /* start run */ + } + } + + op[-lit - 1] = lit - 1; /* end run */ + op -= !lit; /* undo run if length is zero */ + + return op - out_data; + } + +/* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c index 6aa5a182a3a..a0b4734b1e0 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-2.c @@ -9,4 +9,4 @@ bar (int a, int b, int c) return y; } /* We should sink the x = a * b calculation into the else branch */ -/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c index 599997e0e6b..ad88ccc4f5b 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c @@ -12,4 +12,4 @@ main (int argc) } } /* We should sink the a = argc + 1 calculation into the if branch */ -/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c index 784edd2fc87..1e3cfa93fa8 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-4.c @@ -17,4 +17,4 @@ main (int argc) foo2 (a); } /* We should sink the first a = b + c calculation into the else branch */ -/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c index dbdde39add6..f04da5da9b0 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-5.c @@ -44,4 +44,4 @@ void foo(int16_t runs[], uint8_t alpha[], int x, int count) } /* We should not sink the next_runs = runs + x calculation after the loop. */ -/* { dg-final { scan-tree-dump-times "Sunk statements:" 0 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sunk statements:" 0 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c index 1abae9f7943..31f5af330f9 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-6.c @@ -14,4 +14,4 @@ int foo(int *a, int r) /* *a = 1 should be sunk to the else block. */ -/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c index ec3288f4e69..bd748442edc 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-7.c @@ -15,4 +15,4 @@ int foo(int *a, int r, short *b) /* *a = 1 should be sunk to the else block. */ -/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c index 48af4218fc0..4b23b567fd0 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-8.c @@ -24,4 +24,4 @@ int foo(int *a, int r, short *b) /* *a = 1 should be sunk into the default case. */ -/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c index 509a76330a4..32bfc81741a 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-9.c @@ -15,4 +15,4 @@ int foo(int *a, int r, int *b) /* *a = 1 should be sunk to the else block. */ -/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink" } } */ +/* { dg-final { scan-tree-dump-times "Sinking" 1 "sink1" } } */ diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c index d33e56ef915..d252cbb5c51 100644 --- a/gcc/tree-ssa-sink.c +++ b/gcc/tree-ssa-sink.c @@ -819,6 +819,7 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { return flag_tree_sink != 0; } virtual unsigned int execute (function *); + opt_pass *clone (void) { return new pass_sink_code (m_ctxt); } }; // class pass_sink_code -- 2.27.0.90.geebb51ba8c ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Run pass_sink_code once more before store_mergin 2021-05-18 5:20 ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo @ 2021-05-18 7:02 ` Richard Biener 2021-05-18 9:00 ` Xionghu Luo 2021-09-02 15:45 ` Martin Jambor 1 sibling, 1 reply; 20+ messages in thread From: Richard Biener @ 2021-05-18 7:02 UTC (permalink / raw) To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On Tue, 18 May 2021, Xionghu Luo wrote: > Hi, > > On 2021/5/17 16:11, Richard Biener wrote: > > On Fri, 14 May 2021, Xionghu Luo wrote: > > > >> Hi Richi, > >> > >> On 2021/4/21 19:54, Richard Biener wrote: > >>> On Tue, 20 Apr 2021, Xionghu Luo wrote: > >>> > >>>> > >>>> > >>>> On 2021/4/15 19:34, Richard Biener wrote: > >>>>> On Thu, 15 Apr 2021, Xionghu Luo wrote: > >>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> On 2021/4/14 14:41, Richard Biener wrote: > >>>>>>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by > >>>>>>>> rtl-sink, > >>>>>>>> but it moves #538 first, then #235, there is strong dependency here. > >>>>>>>> It > >>>>>>>> seemsdoesn't like the LCM framework that could solve all and do the > >>>>>>>> delete-insert in one iteration. > >>>>>>> So my question was whether we want to do both within the LCM store > >>>>>>> sinking framework. The LCM dataflow is also used by RTL PRE which > >>>>>>> handles both loads and non-loads so in principle it should be able > >>>>>>> to handle stores and non-stores for the sinking case (PRE on the > >>>>>>> reverse CFG). > >>>>>>> > >>>>>>> A global dataflow is more powerful than any local ad-hoc method. > >>>>>> > >>>>>> My biggest concern is whether the LCM DF framework could support > >>>>>> sinking > >>>>>> *multiple* reverse-dependent non-store instructions together by *one* > >>>>>> calling of LCM DF. If this is not supported, we need run multiple LCM > >>>>>> until no new changes, it would be time consuming obviously (unless > >>>>>> compiling time is not important here). > >>>>> > >>>>> As said it is used for PRE and there it most definitely can do that. > >>>> > >>>> I did some investigation about PRE and attached a case to show how it > >>>> works, it is quite like store-motion, and actually there is a rtl-hoist > >>>> pass in gcse.c which only works for code size. All of them are > >>>> leveraging the LCM framework to move instructions upward or downward. > >>>> > >>>> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE > >>>> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions > >>>> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm. > >>>> The four problems are all converted to the LCM DF problem with > >>>> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as > >>>> input > >>>> and two outputs of where to insert/delete. > >>>> > >>>> PRE scan each instruction and hash the SRC to table without *checking the > >>>> relationship between instructions*, for the case attached, BB 37, BB 38 > >>>> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB > >>>> 41 > >>>> save it to index 106, BB 38 save it to index 110. After finishing this > >>>> pass, > >>>> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert > >>>> expr to BB 75~BB 80 to create full redundancies from partial > >>>> redundancies, > >>>> finally update instruction in BB 37. > >>> > >>> I'm not familiar with the actual PRE code but reading the toplevel comment > >>> it seems that indeed it can only handle expressions contained in a single > >>> insn unless a REG_EQUAL note provides a short-hand for the larger one. > >>> > >>> That of course means it would need to mark things as not transparent > >>> for correctness where they'd be if moved together. Now, nothing > >>> prevents you changing the granularity of what you feed LCM. > >>> > >>> So originally we arrived at looking into LCM because there's already > >>> a (store) sinking pass on RTL (using LCM) so adding another (loop-special) > >>> one didn't look like the best obvious solution. > >>> > >>> That said, LCM would work for single-instruction expressions. > >>> Alternatively a greedy algorithm like you prototyped could be used. > >>> Another pass to look at would be RTL invariant motion which seems to > >>> compute some kind of dependency graph - not sure if that would be > >>> adaptable for the reverse CFG problem. > >>> > >> > >> Actually my RTL sinking pass patch is borrowed from RTL loop invariant > >> motion, it is quite limited since only moves instructions from loop header > >> to loop exits, though it could be refined with various of algorithms. > >> Compared to the initial method of running gimple sink pass once more, > >> it seems much more complicated and limited without gaining obvious > >> performance > >> benefit, shall we turn back to consider gimple sink2 pass from original > >> since > >> we are in stage1 now? > > > > OK, so while there might be new sinking opportunities exposed during > > RTL expansion and early RTL opts we can consider adding another sink pass > > on GIMPLE. Since it's basically a scheduling optimization placement > > shouldn't matter much but I suppose we should run it before store > > merging, so anywhere between cd_dce and that. > > > > Richard. > > > > Attached the patch as discussed, put it before store_merging is fine. > Regression tested pass on P8LE, OK for trunk? :) Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add a comment explaining what operations we expect to sink? The testcase is likely somewhat fragile in the exact number of sinkings (can you check on some other target and maybe P8BE with -m32 for example?), so for future adjustments it would be nice to easiliy figure out what we expect to happen. OK with that change. Richard. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Run pass_sink_code once more before store_mergin 2021-05-18 7:02 ` Richard Biener @ 2021-05-18 9:00 ` Xionghu Luo 2021-05-18 10:34 ` Richard Biener 0 siblings, 1 reply; 20+ messages in thread From: Xionghu Luo @ 2021-05-18 9:00 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc Hi, On 2021/5/18 15:02, Richard Biener wrote: > Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add > a comment explaining what operations we expect to sink? The testcase > is likely somewhat fragile in the exact number of sinkings > (can you check on some other target and maybe P8BE with -m32 for > example?), so for future adjustments it would be nice to easiliy > figure out what we expect to happen. > > OK with that change. Thanks a lot for the reminder! ssa-sink-18.c generates different code for m32 and m64 exactly due to different type size conversion and ivopts selection, since -m32 and -m64 couldn't co-exist in one test file, shall I restrict it to -m64 only or check target lp64/target ilp32? I've verified this case shows same behavior on X86, AArch64 and Power for both m32 and m64. -m32: <bb 29> [local count: 75120046]: # len_155 = PHI <len_127(28), len_182(55)> len_182 = len_155 + 1; _35 = (unsigned int) ip_249; _36 = _35 + len_182; _380 = (uint8_t *) _36; if (maxlen_179 > len_182) goto <bb 30>; [94.50%] else goto <bb 31>; [5.50%] ... Sinking _329 = (uint8_t *) _36; from bb 29 to bb 86 Sinking _36 = _35 + len_182; from bb 29 to bb 86 Sinking _35 = (unsigned int) ip_249; from bb 29 to bb 86 Pass statistics of "sink": ---------------- Sunk statements: 3 -m64: <bb 29> [local count: 75120046]: # ivtmp.23_34 = PHI <ivtmp.23_36(28), ivtmp.23_35(55)> _38 = (unsigned int) ivtmp.23_34; len_161 = _38 + 4294967295; _434 = (unsigned long) ip_254; _433 = ivtmp.23_34 + _434; _438 = (uint8_t *) _433; if (_38 < maxlen_187) goto <bb 30>; [94.50%] else goto <bb 31>; [5.50%] ... Sinking _367 = (uint8_t *) _320; from bb 31 to bb 90 Sinking _320 = _321 + ivtmp.25_326; from bb 31 to bb 90 Sinking _321 = (unsigned long) ip_229; from bb 31 to bb 90 Sinking len_158 = _322 + 4294967295; from bb 31 to bb 33 Pass statistics of "sink": ---------------- Sunk statements: 4 Regarding to the comments part: diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c index 52b9a74b65f..5147f7b85cd 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c @@ -193,16 +193,17 @@ compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len) return op - out_data; } + /* For this case, pass sink2 sinks statements from hot loop header to loop + exits after gimple loop optimizations, which generates instructions executed + each iteration in loop, but the results are used outside of loop: + With -m64, + "Sinking _367 = (uint8_t *) _320; + from bb 31 to bb 90 + Sinking _320 = _321 + ivtmp.25_326; + from bb 31 to bb 90 + Sinking _321 = (unsigned long) ip_229; + from bb 31 to bb 90 + Sinking len_158 = _322 + 4294967295; + from bb 31 to bb 33" */ - /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */ + /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" { target lp64 } } } */ + /* { dg-final { scan-tree-dump-times "Sunk statements: 3" 1 "sink2" { target ilp32 } } } */ -- Thanks, Xionghu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Run pass_sink_code once more before store_mergin 2021-05-18 9:00 ` Xionghu Luo @ 2021-05-18 10:34 ` Richard Biener 2021-05-19 8:03 ` Bernd Edlinger 0 siblings, 1 reply; 20+ messages in thread From: Richard Biener @ 2021-05-18 10:34 UTC (permalink / raw) To: Xionghu Luo; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On Tue, 18 May 2021, Xionghu Luo wrote: > Hi, > > On 2021/5/18 15:02, Richard Biener wrote: > > Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add > > a comment explaining what operations we expect to sink? The testcase > > is likely somewhat fragile in the exact number of sinkings > > (can you check on some other target and maybe P8BE with -m32 for > > example?), so for future adjustments it would be nice to easiliy > > figure out what we expect to happen. > > > > OK with that change. > > Thanks a lot for the reminder! ssa-sink-18.c generates different code > for m32 and m64 exactly due to different type size conversion and ivopts > selection, since -m32 and -m64 couldn't co-exist in one test file, shall > I restrict it to -m64 only or check target lp64/target ilp32? > I've verified this case shows same behavior on X86, AArch64 and Power for > both m32 and m64. > > -m32: > <bb 29> [local count: 75120046]: > # len_155 = PHI <len_127(28), len_182(55)> > len_182 = len_155 + 1; > _35 = (unsigned int) ip_249; > _36 = _35 + len_182; > _380 = (uint8_t *) _36; > if (maxlen_179 > len_182) > goto <bb 30>; [94.50%] > else > goto <bb 31>; [5.50%] > ... > > Sinking _329 = (uint8_t *) _36; > from bb 29 to bb 86 > Sinking _36 = _35 + len_182; > from bb 29 to bb 86 > Sinking _35 = (unsigned int) ip_249; > from bb 29 to bb 86 > > Pass statistics of "sink": ---------------- > Sunk statements: 3 > > > -m64: > <bb 29> [local count: 75120046]: > # ivtmp.23_34 = PHI <ivtmp.23_36(28), ivtmp.23_35(55)> > _38 = (unsigned int) ivtmp.23_34; > len_161 = _38 + 4294967295; > _434 = (unsigned long) ip_254; > _433 = ivtmp.23_34 + _434; > _438 = (uint8_t *) _433; > if (_38 < maxlen_187) > goto <bb 30>; [94.50%] > else > goto <bb 31>; [5.50%] > ... > > Sinking _367 = (uint8_t *) _320; > from bb 31 to bb 90 > Sinking _320 = _321 + ivtmp.25_326; > from bb 31 to bb 90 > Sinking _321 = (unsigned long) ip_229; > from bb 31 to bb 90 > Sinking len_158 = _322 + 4294967295; > from bb 31 to bb 33 > > Pass statistics of "sink": ---------------- > Sunk statements: 4 > > > Regarding to the comments part: > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c > index 52b9a74b65f..5147f7b85cd 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c > @@ -193,16 +193,17 @@ compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len) > return op - out_data; > } > + /* For this case, pass sink2 sinks statements from hot loop header to loop > + exits after gimple loop optimizations, which generates instructions executed > + each iteration in loop, but the results are used outside of loop: > + With -m64, > + "Sinking _367 = (uint8_t *) _320; > + from bb 31 to bb 90 > + Sinking _320 = _321 + ivtmp.25_326; > + from bb 31 to bb 90 > + Sinking _321 = (unsigned long) ip_229; > + from bb 31 to bb 90 > + Sinking len_158 = _322 + 4294967295; > + from bb 31 to bb 33" */ > > - /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */ > + /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" { target lp64 } } } */ > + /* { dg-final { scan-tree-dump-times "Sunk statements: 3" 1 "sink2" { target ilp32 } } } */ Yes, that looks good. Thanks, Richard. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Run pass_sink_code once more before store_mergin 2021-05-18 10:34 ` Richard Biener @ 2021-05-19 8:03 ` Bernd Edlinger 0 siblings, 0 replies; 20+ messages in thread From: Bernd Edlinger @ 2021-05-19 8:03 UTC (permalink / raw) To: Richard Biener, Xionghu Luo; +Cc: gcc-patches, dje.gcc, wschmidt, linkw, segher [-- Attachment #1: Type: text/plain, Size: 3875 bytes --] On 5/18/21 12:34 PM, Richard Biener wrote: > On Tue, 18 May 2021, Xionghu Luo wrote: > >> Hi, >> >> On 2021/5/18 15:02, Richard Biener wrote: >>> Can you, for the new gcc.dg/tree-ssa/ssa-sink-18.c testcase, add >>> a comment explaining what operations we expect to sink? The testcase >>> is likely somewhat fragile in the exact number of sinkings >>> (can you check on some other target and maybe P8BE with -m32 for >>> example?), so for future adjustments it would be nice to easiliy >>> figure out what we expect to happen. >>> >>> OK with that change. >> >> Thanks a lot for the reminder! ssa-sink-18.c generates different code >> for m32 and m64 exactly due to different type size conversion and ivopts >> selection, since -m32 and -m64 couldn't co-exist in one test file, shall >> I restrict it to -m64 only or check target lp64/target ilp32? >> I've verified this case shows same behavior on X86, AArch64 and Power for >> both m32 and m64. >> >> -m32: >> <bb 29> [local count: 75120046]: >> # len_155 = PHI <len_127(28), len_182(55)> >> len_182 = len_155 + 1; >> _35 = (unsigned int) ip_249; >> _36 = _35 + len_182; >> _380 = (uint8_t *) _36; >> if (maxlen_179 > len_182) >> goto <bb 30>; [94.50%] >> else >> goto <bb 31>; [5.50%] >> ... >> >> Sinking _329 = (uint8_t *) _36; >> from bb 29 to bb 86 >> Sinking _36 = _35 + len_182; >> from bb 29 to bb 86 >> Sinking _35 = (unsigned int) ip_249; >> from bb 29 to bb 86 >> >> Pass statistics of "sink": ---------------- >> Sunk statements: 3 >> >> >> -m64: >> <bb 29> [local count: 75120046]: >> # ivtmp.23_34 = PHI <ivtmp.23_36(28), ivtmp.23_35(55)> >> _38 = (unsigned int) ivtmp.23_34; >> len_161 = _38 + 4294967295; >> _434 = (unsigned long) ip_254; >> _433 = ivtmp.23_34 + _434; >> _438 = (uint8_t *) _433; >> if (_38 < maxlen_187) >> goto <bb 30>; [94.50%] >> else >> goto <bb 31>; [5.50%] >> ... >> >> Sinking _367 = (uint8_t *) _320; >> from bb 31 to bb 90 >> Sinking _320 = _321 + ivtmp.25_326; >> from bb 31 to bb 90 >> Sinking _321 = (unsigned long) ip_229; >> from bb 31 to bb 90 >> Sinking len_158 = _322 + 4294967295; >> from bb 31 to bb 33 >> >> Pass statistics of "sink": ---------------- >> Sunk statements: 4 >> >> >> Regarding to the comments part: >> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c >> index 52b9a74b65f..5147f7b85cd 100644 >> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-18.c >> @@ -193,16 +193,17 @@ compute_on_bytes (uint8_t *in_data, int in_len, uint8_t *out_data, int out_len) >> return op - out_data; >> } >> + /* For this case, pass sink2 sinks statements from hot loop header to loop >> + exits after gimple loop optimizations, which generates instructions executed >> + each iteration in loop, but the results are used outside of loop: >> + With -m64, >> + "Sinking _367 = (uint8_t *) _320; >> + from bb 31 to bb 90 >> + Sinking _320 = _321 + ivtmp.25_326; >> + from bb 31 to bb 90 >> + Sinking _321 = (unsigned long) ip_229; >> + from bb 31 to bb 90 >> + Sinking len_158 = _322 + 4294967295; >> + from bb 31 to bb 33" */ >> >> - /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" } } */ >> + /* { dg-final { scan-tree-dump-times "Sunk statements: 4" 1 "sink2" { target lp64 } } } */ >> + /* { dg-final { scan-tree-dump-times "Sunk statements: 3" 1 "sink2" { target ilp32 } } } */ > > Yes, that looks good. > > Thanks, > Richard. > Hi, I've noticed the test case gcc.dg/tree-ssa/ssa-sink-3.c was accidentally committed as empty file, and therefore fails: FAIL: gcc.dg/tree-ssa/ssa-sink-3.c (test for excess errors) I've commited as obvious the following fix (which restores the test case and Xionghu Luo's intended change. Thanks Bernd. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Fix-commit-mistake-in-testcase-gcc.dg-tree-ssa-ssa-s.patch --] [-- Type: text/x-patch; name="0001-Fix-commit-mistake-in-testcase-gcc.dg-tree-ssa-ssa-s.patch", Size: 1107 bytes --] From 51cfa55431c38f3c29c7b72833337ad8a2da5c06 Mon Sep 17 00:00:00 2001 From: Bernd Edlinger <bernd.edlinger@hotmail.de> Date: Wed, 19 May 2021 09:51:44 +0200 Subject: [PATCH] Fix commit mistake in testcase gcc.dg/tree-ssa/ssa-sink-3.c the test case was accidenally changed to empty file. 2021-05-19 Bernd Edlinger <bernd.edlinger@hotmail.de> * gcc.dg/tree-ssa/ssa-sink-3.c: Fix test case. --- gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c index e69de29..ad88ccc 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-3.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-sink-stats" } */ +extern void foo(int a); +int +main (int argc) +{ + int a; + a = argc + 1; + if (argc + 3) + { + foo (a); + } +} +/* We should sink the a = argc + 1 calculation into the if branch */ +/* { dg-final { scan-tree-dump-times "Sunk statements: 1" 1 "sink1" } } */ -- 1.9.1 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Run pass_sink_code once more before store_mergin 2021-05-18 5:20 ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo 2021-05-18 7:02 ` Richard Biener @ 2021-09-02 15:45 ` Martin Jambor 1 sibling, 0 replies; 20+ messages in thread From: Martin Jambor @ 2021-09-02 15:45 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches Hi, On Tue, May 18 2021, Xionghu Luo via Gcc-patches wrote: > [...] > From 7fcc6ca9ef3b6acbfbcbd3da4be1d1c0eef4be80 Mon Sep 17 00:00:00 2001 > From: Xiong Hu Luo <luoxhu@linux.ibm.com> > Date: Mon, 17 May 2021 20:46:15 -0500 > Subject: [PATCH] Run pass_sink_code once more before store_merging > > Gimple sink code pass runs quite early, there may be some new > oppertunities exposed by later gimple optmization passes, this patch > runs the sink code pass once more before store_merging. For detailed > discussion, please refer to: > https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562352.html > > Tested the SPEC2017 performance on P8LE, 544.nab_r is improved > by 2.43%, but no big changes to other cases, GEOMEAN is improved quite > small with 0.25%. > > gcc/ChangeLog: > > * passes.def: Add sink_code before store_merging. > * tree-ssa-sink.c (pass_sink_code:clone): New. Unfortunately, this seems to have caused PR 102178 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178) Sorry about the bad news, Martin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] Run pass_sink_code once more after ivopts/fre 2021-04-14 6:41 ` Richard Biener 2021-04-15 6:20 ` Xionghu Luo @ 2021-04-24 3:44 ` Jeff Law 1 sibling, 0 replies; 20+ messages in thread From: Jeff Law @ 2021-04-24 3:44 UTC (permalink / raw) To: Richard Biener, Xionghu Luo; +Cc: gcc-patches, dje.gcc, wschmidt, linkw, segher On 4/14/2021 12:41 AM, Richard Biener wrote: > On Wed, 14 Apr 2021, Xionghu Luo wrote: > >> Hi, >> >> On 2021/3/26 15:35, Xionghu Luo via Gcc-patches wrote: >>>> Also we already have a sinking pass on RTL which even computes >>>> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c. >>>> I'm not sure whether this deals with non-stores but the >>>> LCM machinery definitely can handle arbitrary expressions. I wonder >>>> if it makes more sense to extend this rather than inventing a new >>>> ad-hoc sinking pass? >>> From the literal, my pass doesn't handle or process store instructions >>> like store-motion.. Thanks, will check it. >> Store motion only processes store instructions with data flow equations, >> generating 4 inputs(st_kill, st_avloc, st_antloc, st_transp) and solve it >> by Lazy Code Motion API(5 DF compute call) with 2 outputs (st_delete_map, >> st_insert_map) globally, each store place is independently represented in >> the input bitmap vectors. Output is which should be delete and where to >> insert, current code does what you said "emit copies to a new pseudo at >> the original insn location and use it in followed bb", actually it is >> "store replacement" instead of "store move", why not save one pseudo by >> moving the store instruction to target edge directly? > It probably simply saves the pass from doing analysis whether the > stored value is clobbered on the sinking path, enabling more store > sinking. For stores that might be even beneficial, for non-stores > it becomes more of a cost issue, yes. > >> There are many differences between the newly added rtl-sink pass and >> store-motion pass. >> 1. Store motion moves only store instructions, rtl-sink ignores store >> instructions. >> 2. Store motion is a global DF problem solving, rtl-sink only processes >> loop header reversely with dependency check in loop, take the below RTL >> as example, >> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink, >> but it moves #538 first, then #235, there is strong dependency here. It >> seemsdoesn't like the LCM framework that could solve all and do the >> delete-insert in one iteration. > So my question was whether we want to do both within the LCM store > sinking framework. The LCM dataflow is also used by RTL PRE which > handles both loads and non-loads so in principle it should be able > to handle stores and non-stores for the sinking case (PRE on the > reverse CFG). > > A global dataflow is more powerful than any local ad-hoc method. IIRC you can use LCM on stores like this, but you have to run it independently on each store to pick up the secondary effects. I believe the basic concepts are discussed in Morgan's book. That may turn out to be too expensive in practice -- I've never tried it though. jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2021-09-02 15:45 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-21 9:03 [RFC] Run pass_sink_code once more after ivopts/fre Xiong Hu Luo 2020-12-22 16:53 ` Richard Biener 2021-03-23 3:07 ` Xionghu Luo 2021-03-23 8:50 ` Richard Biener 2021-03-26 7:35 ` Xionghu Luo 2021-04-14 1:51 ` Xionghu Luo 2021-04-14 6:41 ` Richard Biener 2021-04-15 6:20 ` Xionghu Luo 2021-04-15 11:34 ` Richard Biener 2021-04-20 9:23 ` Xionghu Luo 2021-04-21 11:54 ` Richard Biener 2021-05-14 7:10 ` Xionghu Luo 2021-05-17 8:11 ` Richard Biener 2021-05-18 5:20 ` [PATCH] Run pass_sink_code once more before store_mergin Xionghu Luo 2021-05-18 7:02 ` Richard Biener 2021-05-18 9:00 ` Xionghu Luo 2021-05-18 10:34 ` Richard Biener 2021-05-19 8:03 ` Bernd Edlinger 2021-09-02 15:45 ` Martin Jambor 2021-04-24 3:44 ` [RFC] Run pass_sink_code once more after ivopts/fre Jeff Law
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).