public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations
@ 2020-03-23 11:42 z.zhanghaijian at huawei dot com
2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-03-23 11:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274
Bug ID: 94274
Summary: fold phi whose incoming args are defined from binary
operations
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: z.zhanghaijian at huawei dot com
Target Milestone: ---
For if/else structure,
Example 1:
int test(int cond, int a, int b, int c)
{
int result = 0;
if (cond)
result = a + b;
else
result = a + c;
return result;
}
The expressions is binary operation and have a common subexpression "a", and
the opcode is the same.
E.g. on aarch64, gcc will do the binary operation first, and then do csel:
cmp w0, 0
add w0, w1, w2
add w1, w1, w3
csel w0, w1, w0, eq
In fact, it can be optimized to do csel first and then do binary operations:
cmp w0, 0
csel w2, w2, w3, ne
add w0, w2, w1
This can eliminate one instruction. This scenario is very common, and the
switch/case structure is the same.
Example 2:
int test(int cond, int a, int b, int c, int d)
{
int result = 0;
switch (cond) {
case 1:
result = a + b;
break;
case 8:
result = a + c;
break;
default:
result = a + d;
break;
}
return result;
}
gcc will do the binary operation first, and then do csel :
mov w5, w0
add w0, w1, w2
cmp w5, 1
beq .L1
add w4, w1, w4
cmp w5, 8
add w1, w1, w3
csel w0, w1, w4, eq
.L1:
ret
Which can further optimized into :
cmp w0, 1
beq .L3
cmp w0, 8
csel w4, w4, w3, ne
add w0, w1, w4
ret
.L3:
mov w4, w2
add w0, w1, w4
ret
My proposal: fold the merging phi node in tree_ssa_phiopt_worker (ssa-phiopt) :
For example 1:
replaces
bb0:
if (cond) goto bb1; else goto bb2;
bb1:
x1 = a + b;
goto <bb3>
bb2:
x2 = a + c;
bb3:
x = PHI <x1 (bb1), x2 (bb2), ...>;
with
bb0:
if (cond) goto bb1; else goto bb2;
bb1:
bb2:
bb3:
x3 = PHI <b (bb1), c (bb2), ...>;
x = a + x3;
For example 2:
replaces
bb0:
if (cond == 1) goto bb2; else goto bb1;
bb1:
if (cond == 8) goto bb3; else goto bb4;
bb2:
x2 = a + b;
goto <bb5>
bb3:
x3 = a + c;
goto <bb5>
bb4:
x4 = a + d;
bb5:
x5 = PHI <x2 (bb2), x3 (bb3), x4 (bb4), ...>;
with
bb0:
if (cond == 1) goto bb2; else goto bb1;
bb1:
if (cond == 8) goto bb3; else goto bb4;
bb2:
bb3:
bb4:
bb5:
x5 = PHI <b (bb2), c (bb3), c (bb4), ...>;
x = a + x5;
I have an initial implementation that is under testing. In part, it based on
the LLVM InstCombinePass(InstCombinePHI.cpp).
Any suggestions?
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
@ 2020-03-23 12:59 ` glisse at gcc dot gnu.org
2020-03-23 14:04 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: glisse at gcc dot gnu.org @ 2020-03-23 12:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274
--- Comment #1 from Marc Glisse <glisse at gcc dot gnu.org> ---
Detecting common beginnings / endings in branches is something gcc does very
seldom. Even at -Os, for if(cond)f(b);else f(c); we need to wait until
rtl-optimizations to get a single call to f. (of course the reverse
transformation of duplicating a statement that was after the branches into
them, if it simplifies, is nice as well, and they can conflict)
I don't know if handling one such very specific case (binary operations with a
common argument) separately is a good idea when we don't even handle unary
operations.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
@ 2020-03-23 14:04 ` rguenth at gcc dot gnu.org
2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-23 14:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Keywords| |missed-optimization
Last reconfirmed| |2020-03-23
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that with binary operations you are eventually increasing register
pressure up to a point where we need to spill so IMHO this should be only done
if both
blocks become empty after the transform.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
2020-03-23 14:04 ` rguenth at gcc dot gnu.org
@ 2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
2020-03-25 9:40 ` z.zhanghaijian at huawei dot com
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-03-24 13:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274
--- Comment #3 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> ---
(In reply to Marc Glisse from comment #1)
> Detecting common beginnings / endings in branches is something gcc does very
> seldom. Even at -Os, for if(cond)f(b);else f(c); we need to wait until
> rtl-optimizations to get a single call to f. (of course the reverse
> transformation of duplicating a statement that was after the branches into
> them, if it simplifies, is nice as well, and they can conflict)
> I don't know if handling one such very specific case (binary operations with
> a common argument) separately is a good idea when we don't even handle unary
> operations.
I tried to test this fold on specint2017 and found some performance gains on
500.perlbench_r. Then compared the assemble and found some improvements.
For example:
S_invlist_max, which is inlined by many functions, such as
S__append_range_to_invlist, S_ssc_anything, Perl__invlist_invert ...
invlist_inline.h:
#define FROM_INTERNAL_SIZE(x) ((x)/ sizeof(UV))
S_invlist_max(inlined by S__append_range_to_invlist, S_ssc_anything,
Perl__invlist_invert, ....):
return SvLEN(invlist) == 0 /* This happens under _new_invlist_C_array */
? FROM_INTERNAL_SIZE(SvCUR(invlist)) - 1
: FROM_INTERNAL_SIZE(SvLEN(invlist)) - 1;
Dump tree phiopt:
<bb 3> [local count: 536870911]:
_46 = pretmp_112 >> 3;
iftmp.1123_47 = _46 + 18446744073709551615;
goto <bb 5>; [100.00%]
<bb 4> [local count: 536870911]:
_48 = _44 >> 3;
iftmp.1123_49 = _48 + 18446744073709551615;
<bb 5> [local count: 1073741823]:
# iftmp.1123_50 = PHI <iftmp.1123_47(3), iftmp.1123_49(4)>
Which can replaces with:
<bb 3> [local count: 536870912]:
<bb 4> [local count: 1073741823]:
# _48 = PHI <_44(2), pretmp_112(3)>
_49 = _48 >> 3;
iftmp.1123_50 = _49 + 18446744073709551615;
Assemble:
lsr x5, x6, #3
lsr x3, x3, #3
sub x20, x5, #0x1
sub x3, x3, #0x1
csel x20, x3, x20, ne
Replaces with:
csel x3, x3, x4, ne
lsr x3, x3, #3
sub x20, x3, #0x1
This can eliminate two instruction.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
` (2 preceding siblings ...)
2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
@ 2020-03-25 9:40 ` z.zhanghaijian at huawei dot com
2020-06-02 7:43 ` z.zhanghaijian at huawei dot com
2021-08-03 23:33 ` pinskia at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-03-25 9:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274
--- Comment #4 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> ---
(In reply to Richard Biener from comment #2)
> Note that with binary operations you are eventually increasing register
> pressure up to a point where we need to spill so IMHO this should be only
> done if both
> blocks become empty after the transform.
I can try to add some constraints to only do the fold when both
blocks become empty after the transform. Even with this constraints, the
improvements of spec2017 mentioned above can also be done, but do you think
need to add a option to control this constraint to let users choose? Because
the register pressure problem we worry about does not exist in most cases.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
` (3 preceding siblings ...)
2020-03-25 9:40 ` z.zhanghaijian at huawei dot com
@ 2020-06-02 7:43 ` z.zhanghaijian at huawei dot com
2021-08-03 23:33 ` pinskia at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-06-02 7:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274
--- Comment #5 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> ---
Created attachment 48659
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48659&action=edit
Fold phi whose incoming args are defined from binary operations
I tried to make a patch to do this optimization(in attachment):
replaces
bb0:
if (cond) goto bb1; else goto bb2;
bb1:
x1 = a + b;
goto <bb3>
bb2:
x2 = a + c;
bb3:
x = PHI <x1 (bb1), x2 (bb2), ...>;
with
bb0:
if (cond) goto bb1; else goto bb2;
bb1:
bb2:
bb3:
x3 = PHI <b (bb1), c (bb2), ...>;
x = a + x3;
This patch will check all the phi nodes in bb3. We do the optimization only if
these phis can all be converted, This can avoid most scenes that the blocks is
not empty after the optimization, but there are still some scenes that the
block is not empty.
For example 1:
int f1(int cond, int a, int b, int c, int d, int e, int f, int x, int y, int z,
int w, int m, int n)
{
if (cond) {
x = e + f;
b = x >> w;
c = m + 12;
a = b + z;
}
else {
d = y >> w;
c = n + 12;
a = d + z;
}
a = a + 18;
return c + a;
}
Tree dump before optimization:
<bb 3> [local count: 536870913]:
x_13 = e_11(D) + f_12(D);
b_14 = x_13 >> w_5(D);
c_16 = m_15(D) + 12;
a_17 = z_9(D) + b_14;
goto <bb 5>; [100.00%]
<bb 4> [local count: 536870913]:
d_6 = y_4(D) >> w_5(D);
c_8 = n_7(D) + 12;
a_10 = d_6 + z_9(D);
<bb 5> [local count: 1073741824]:
# a_1 = PHI <a_17(3), a_10(4)>
# c_2 = PHI <c_16(3), c_8(4)>
a_18 = a_1 + 18;
_19 = c_2 + a_18;
return _19;
Tree dump after optimization:
<bb 3> [local count: 536870913]:
x_13 = e_11(D) + f_12(D);
goto <bb 5>; [100.00%]
<bb 4> [local count: 536870913]:
<bb 5> [local count: 1073741824]:
# _21 = PHI <x_13(3), y_4(D)(4)>
# _23 = PHI <m_15(D)(3), n_7(D)(4)>
c_2 = _23 + 12;
_22 = _21 >> w_5(D);
a_1 = z_9(D) + _22;
a_18 = a_1 + 18;
_19 = c_2 + a_18;
return _19;
Assemble before optimization:
.LFB0:
.cfi_startproc
ldr w1, [sp, 8]
ldr w2, [sp, 16]
cbz w0, .L2
add w5, w5, w6
ldr w0, [sp, 24]
asr w5, w5, w2
add w1, w5, w1
add w1, w1, 18
add w0, w0, 12
add w0, w1, w0
ret
.p2align 2,,3
.L2:
ldr w0, [sp]
asr w2, w0, w2
ldr w0, [sp, 32]
add w1, w2, w1
add w1, w1, 18
add w0, w0, 12
add w0, w1, w0
ret
.cfi_endproc
Assemble after optimization:
.LFB0:
.cfi_startproc
ldr w1, [sp]
ldr w3, [sp, 8]
ldr w4, [sp, 16]
ldr w2, [sp, 32]
cbz w0, .L2
ldr w2, [sp, 24]
add w1, w5, w6
.L2:
asr w0, w1, w4
add w0, w0, w3
add w0, w0, w2
add w0, w0, 30
ret
Because the statement x_13 = e_11 (D) + f_12 (D) in bb3 does not have a phi
node in bb5, so bb3 cannot be emptied. I have not found a good way to solve it.
Any suggestions?
Without considering the register pressure, the performance of example 1 is
profitable. And this patch is effective for 500.perlbench_r in comment 3.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
` (4 preceding siblings ...)
2020-06-02 7:43 ` z.zhanghaijian at huawei dot com
@ 2021-08-03 23:33 ` pinskia at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-03 23:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |pinskia at gcc dot gnu.org
Severity|normal |enhancement
Status|NEW |ASSIGNED
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Mine. There is another bug which I assigned to myself which is very similar.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-08-03 23:33 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
2020-03-23 14:04 ` rguenth at gcc dot gnu.org
2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
2020-03-25 9:40 ` z.zhanghaijian at huawei dot com
2020-06-02 7:43 ` z.zhanghaijian at huawei dot com
2021-08-03 23:33 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).