public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations @ 2020-03-23 11:42 z.zhanghaijian at huawei dot com 2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org ` (5 more replies) 0 siblings, 6 replies; 7+ messages in thread From: z.zhanghaijian at huawei dot com @ 2020-03-23 11:42 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274 Bug ID: 94274 Summary: fold phi whose incoming args are defined from binary operations Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: z.zhanghaijian at huawei dot com Target Milestone: --- For if/else structure, Example 1: int test(int cond, int a, int b, int c) { int result = 0; if (cond) result = a + b; else result = a + c; return result; } The expressions is binary operation and have a common subexpression "a", and the opcode is the same. E.g. on aarch64, gcc will do the binary operation first, and then do csel: cmp w0, 0 add w0, w1, w2 add w1, w1, w3 csel w0, w1, w0, eq In fact, it can be optimized to do csel first and then do binary operations: cmp w0, 0 csel w2, w2, w3, ne add w0, w2, w1 This can eliminate one instruction. This scenario is very common, and the switch/case structure is the same. Example 2: int test(int cond, int a, int b, int c, int d) { int result = 0; switch (cond) { case 1: result = a + b; break; case 8: result = a + c; break; default: result = a + d; break; } return result; } gcc will do the binary operation first, and then do csel : mov w5, w0 add w0, w1, w2 cmp w5, 1 beq .L1 add w4, w1, w4 cmp w5, 8 add w1, w1, w3 csel w0, w1, w4, eq .L1: ret Which can further optimized into : cmp w0, 1 beq .L3 cmp w0, 8 csel w4, w4, w3, ne add w0, w1, w4 ret .L3: mov w4, w2 add w0, w1, w4 ret My proposal: fold the merging phi node in tree_ssa_phiopt_worker (ssa-phiopt) : For example 1: replaces bb0: if (cond) goto bb1; else goto bb2; bb1: x1 = a + b; goto <bb3> bb2: x2 = a + c; bb3: x = PHI <x1 (bb1), x2 (bb2), ...>; with bb0: if (cond) goto bb1; else goto bb2; bb1: bb2: bb3: x3 = PHI <b (bb1), c (bb2), ...>; x = a + x3; For example 2: replaces bb0: if (cond == 1) goto bb2; else goto bb1; bb1: if (cond == 8) goto bb3; else goto bb4; bb2: x2 = a + b; goto <bb5> bb3: x3 = a + c; goto <bb5> bb4: x4 = a + d; bb5: x5 = PHI <x2 (bb2), x3 (bb3), x4 (bb4), ...>; with bb0: if (cond == 1) goto bb2; else goto bb1; bb1: if (cond == 8) goto bb3; else goto bb4; bb2: bb3: bb4: bb5: x5 = PHI <b (bb2), c (bb3), c (bb4), ...>; x = a + x5; I have an initial implementation that is under testing. In part, it based on the LLVM InstCombinePass(InstCombinePHI.cpp). Any suggestions? ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations 2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com @ 2020-03-23 12:59 ` glisse at gcc dot gnu.org 2020-03-23 14:04 ` rguenth at gcc dot gnu.org ` (4 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: glisse at gcc dot gnu.org @ 2020-03-23 12:59 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274 --- Comment #1 from Marc Glisse <glisse at gcc dot gnu.org> --- Detecting common beginnings / endings in branches is something gcc does very seldom. Even at -Os, for if(cond)f(b);else f(c); we need to wait until rtl-optimizations to get a single call to f. (of course the reverse transformation of duplicating a statement that was after the branches into them, if it simplifies, is nice as well, and they can conflict) I don't know if handling one such very specific case (binary operations with a common argument) separately is a good idea when we don't even handle unary operations. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations 2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com 2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org @ 2020-03-23 14:04 ` rguenth at gcc dot gnu.org 2020-03-24 13:42 ` z.zhanghaijian at huawei dot com ` (3 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: rguenth at gcc dot gnu.org @ 2020-03-23 14:04 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Keywords| |missed-optimization Last reconfirmed| |2020-03-23 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Note that with binary operations you are eventually increasing register pressure up to a point where we need to spill so IMHO this should be only done if both blocks become empty after the transform. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations 2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com 2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org 2020-03-23 14:04 ` rguenth at gcc dot gnu.org @ 2020-03-24 13:42 ` z.zhanghaijian at huawei dot com 2020-03-25 9:40 ` z.zhanghaijian at huawei dot com ` (2 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: z.zhanghaijian at huawei dot com @ 2020-03-24 13:42 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274 --- Comment #3 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> --- (In reply to Marc Glisse from comment #1) > Detecting common beginnings / endings in branches is something gcc does very > seldom. Even at -Os, for if(cond)f(b);else f(c); we need to wait until > rtl-optimizations to get a single call to f. (of course the reverse > transformation of duplicating a statement that was after the branches into > them, if it simplifies, is nice as well, and they can conflict) > I don't know if handling one such very specific case (binary operations with > a common argument) separately is a good idea when we don't even handle unary > operations. I tried to test this fold on specint2017 and found some performance gains on 500.perlbench_r. Then compared the assemble and found some improvements. For example: S_invlist_max, which is inlined by many functions, such as S__append_range_to_invlist, S_ssc_anything, Perl__invlist_invert ... invlist_inline.h: #define FROM_INTERNAL_SIZE(x) ((x)/ sizeof(UV)) S_invlist_max(inlined by S__append_range_to_invlist, S_ssc_anything, Perl__invlist_invert, ....): return SvLEN(invlist) == 0 /* This happens under _new_invlist_C_array */ ? FROM_INTERNAL_SIZE(SvCUR(invlist)) - 1 : FROM_INTERNAL_SIZE(SvLEN(invlist)) - 1; Dump tree phiopt: <bb 3> [local count: 536870911]: _46 = pretmp_112 >> 3; iftmp.1123_47 = _46 + 18446744073709551615; goto <bb 5>; [100.00%] <bb 4> [local count: 536870911]: _48 = _44 >> 3; iftmp.1123_49 = _48 + 18446744073709551615; <bb 5> [local count: 1073741823]: # iftmp.1123_50 = PHI <iftmp.1123_47(3), iftmp.1123_49(4)> Which can replaces with: <bb 3> [local count: 536870912]: <bb 4> [local count: 1073741823]: # _48 = PHI <_44(2), pretmp_112(3)> _49 = _48 >> 3; iftmp.1123_50 = _49 + 18446744073709551615; Assemble: lsr x5, x6, #3 lsr x3, x3, #3 sub x20, x5, #0x1 sub x3, x3, #0x1 csel x20, x3, x20, ne Replaces with: csel x3, x3, x4, ne lsr x3, x3, #3 sub x20, x3, #0x1 This can eliminate two instruction. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations 2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com ` (2 preceding siblings ...) 2020-03-24 13:42 ` z.zhanghaijian at huawei dot com @ 2020-03-25 9:40 ` z.zhanghaijian at huawei dot com 2020-06-02 7:43 ` z.zhanghaijian at huawei dot com 2021-08-03 23:33 ` pinskia at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: z.zhanghaijian at huawei dot com @ 2020-03-25 9:40 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274 --- Comment #4 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> --- (In reply to Richard Biener from comment #2) > Note that with binary operations you are eventually increasing register > pressure up to a point where we need to spill so IMHO this should be only > done if both > blocks become empty after the transform. I can try to add some constraints to only do the fold when both blocks become empty after the transform. Even with this constraints, the improvements of spec2017 mentioned above can also be done, but do you think need to add a option to control this constraint to let users choose? Because the register pressure problem we worry about does not exist in most cases. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations 2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com ` (3 preceding siblings ...) 2020-03-25 9:40 ` z.zhanghaijian at huawei dot com @ 2020-06-02 7:43 ` z.zhanghaijian at huawei dot com 2021-08-03 23:33 ` pinskia at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: z.zhanghaijian at huawei dot com @ 2020-06-02 7:43 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274 --- Comment #5 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> --- Created attachment 48659 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48659&action=edit Fold phi whose incoming args are defined from binary operations I tried to make a patch to do this optimization(in attachment): replaces bb0: if (cond) goto bb1; else goto bb2; bb1: x1 = a + b; goto <bb3> bb2: x2 = a + c; bb3: x = PHI <x1 (bb1), x2 (bb2), ...>; with bb0: if (cond) goto bb1; else goto bb2; bb1: bb2: bb3: x3 = PHI <b (bb1), c (bb2), ...>; x = a + x3; This patch will check all the phi nodes in bb3. We do the optimization only if these phis can all be converted, This can avoid most scenes that the blocks is not empty after the optimization, but there are still some scenes that the block is not empty. For example 1: int f1(int cond, int a, int b, int c, int d, int e, int f, int x, int y, int z, int w, int m, int n) { if (cond) { x = e + f; b = x >> w; c = m + 12; a = b + z; } else { d = y >> w; c = n + 12; a = d + z; } a = a + 18; return c + a; } Tree dump before optimization: <bb 3> [local count: 536870913]: x_13 = e_11(D) + f_12(D); b_14 = x_13 >> w_5(D); c_16 = m_15(D) + 12; a_17 = z_9(D) + b_14; goto <bb 5>; [100.00%] <bb 4> [local count: 536870913]: d_6 = y_4(D) >> w_5(D); c_8 = n_7(D) + 12; a_10 = d_6 + z_9(D); <bb 5> [local count: 1073741824]: # a_1 = PHI <a_17(3), a_10(4)> # c_2 = PHI <c_16(3), c_8(4)> a_18 = a_1 + 18; _19 = c_2 + a_18; return _19; Tree dump after optimization: <bb 3> [local count: 536870913]: x_13 = e_11(D) + f_12(D); goto <bb 5>; [100.00%] <bb 4> [local count: 536870913]: <bb 5> [local count: 1073741824]: # _21 = PHI <x_13(3), y_4(D)(4)> # _23 = PHI <m_15(D)(3), n_7(D)(4)> c_2 = _23 + 12; _22 = _21 >> w_5(D); a_1 = z_9(D) + _22; a_18 = a_1 + 18; _19 = c_2 + a_18; return _19; Assemble before optimization: .LFB0: .cfi_startproc ldr w1, [sp, 8] ldr w2, [sp, 16] cbz w0, .L2 add w5, w5, w6 ldr w0, [sp, 24] asr w5, w5, w2 add w1, w5, w1 add w1, w1, 18 add w0, w0, 12 add w0, w1, w0 ret .p2align 2,,3 .L2: ldr w0, [sp] asr w2, w0, w2 ldr w0, [sp, 32] add w1, w2, w1 add w1, w1, 18 add w0, w0, 12 add w0, w1, w0 ret .cfi_endproc Assemble after optimization: .LFB0: .cfi_startproc ldr w1, [sp] ldr w3, [sp, 8] ldr w4, [sp, 16] ldr w2, [sp, 32] cbz w0, .L2 ldr w2, [sp, 24] add w1, w5, w6 .L2: asr w0, w1, w4 add w0, w0, w3 add w0, w0, w2 add w0, w0, 30 ret Because the statement x_13 = e_11 (D) + f_12 (D) in bb3 does not have a phi node in bb5, so bb3 cannot be emptied. I have not found a good way to solve it. Any suggestions? Without considering the register pressure, the performance of example 1 is profitable. And this patch is effective for 500.perlbench_r in comment 3. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations 2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com ` (4 preceding siblings ...) 2020-06-02 7:43 ` z.zhanghaijian at huawei dot com @ 2021-08-03 23:33 ` pinskia at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: pinskia at gcc dot gnu.org @ 2021-08-03 23:33 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at gcc dot gnu.org |pinskia at gcc dot gnu.org Severity|normal |enhancement Status|NEW |ASSIGNED --- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Mine. There is another bug which I assigned to myself which is very similar. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-08-03 23:33 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com 2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org 2020-03-23 14:04 ` rguenth at gcc dot gnu.org 2020-03-24 13:42 ` z.zhanghaijian at huawei dot com 2020-03-25 9:40 ` z.zhanghaijian at huawei dot com 2020-06-02 7:43 ` z.zhanghaijian at huawei dot com 2021-08-03 23:33 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).