public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations
@ 2020-03-23 11:42 z.zhanghaijian at huawei dot com
  2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-03-23 11:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

            Bug ID: 94274
           Summary: fold phi whose incoming args are defined from binary
                    operations
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

For if/else structure, 
Example 1:

int test(int cond, int a, int b, int c)
{
  int result = 0;

  if (cond)
    result = a + b;
  else
    result = a + c;
  return result;
}

The expressions is binary operation and have a common subexpression "a", and
the opcode is the same.

E.g. on aarch64, gcc will do the binary operation first, and then do csel:

cmp     w0, 0
add     w0, w1, w2
add     w1, w1, w3
csel    w0, w1, w0, eq

In fact, it can be optimized to do csel first and then do binary operations:

cmp     w0, 0
csel    w2, w2, w3, ne
add     w0, w2, w1

This can eliminate one instruction. This scenario is very common, and the
switch/case structure is the same.

Example 2:

int test(int cond, int a, int b, int c, int d)
{
  int result = 0;

  switch (cond) {
    case 1:
      result = a + b;
      break;
    case 8:
      result = a + c;
      break;
    default:
      result = a + d;
      break;
  }
  return result;
}

gcc will do the binary operation first, and then do csel :

        mov     w5, w0
        add     w0, w1, w2
        cmp     w5, 1
        beq     .L1
        add     w4, w1, w4
        cmp     w5, 8
        add     w1, w1, w3
        csel    w0, w1, w4, eq
.L1:
        ret

Which can further optimized into :

        cmp     w0, 1
        beq     .L3
        cmp     w0, 8
        csel    w4, w4, w3, ne
        add     w0, w1, w4
        ret
.L3:
        mov     w4, w2
        add     w0, w1, w4
        ret

My proposal: fold the merging phi node in tree_ssa_phiopt_worker (ssa-phiopt) :

For example 1:

replaces

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
  x1 = a + b;
  goto <bb3>
bb2:
  x2 = a + c;
bb3:
  x = PHI <x1 (bb1), x2 (bb2), ...>;

with

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
bb2:
bb3:
  x3 = PHI <b (bb1), c (bb2), ...>;
  x = a + x3;


For example 2:

replaces

bb0:
  if (cond == 1) goto bb2; else goto bb1;
bb1:
  if (cond == 8) goto bb3; else goto bb4;
bb2:
  x2 = a + b;
  goto <bb5>
bb3:
  x3 = a + c;
  goto <bb5>
bb4:
  x4 = a + d;
bb5:
  x5 = PHI <x2 (bb2), x3 (bb3), x4 (bb4), ...>;

with

bb0:
  if (cond == 1) goto bb2; else goto bb1;
bb1:
  if (cond == 8) goto bb3; else goto bb4;
bb2:
bb3:
bb4:
bb5:
  x5 = PHI <b (bb2), c (bb3), c (bb4), ...>;
  x = a + x5;

I have an initial implementation that is under testing. In part, it based on
the LLVM InstCombinePass(InstCombinePHI.cpp).

Any suggestions?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
  2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
@ 2020-03-23 12:59 ` glisse at gcc dot gnu.org
  2020-03-23 14:04 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: glisse at gcc dot gnu.org @ 2020-03-23 12:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

--- Comment #1 from Marc Glisse <glisse at gcc dot gnu.org> ---
Detecting common beginnings / endings in branches is something gcc does very
seldom. Even at -Os, for if(cond)f(b);else f(c); we need to wait until
rtl-optimizations to get a single call to f. (of course the reverse
transformation of duplicating a statement that was after the branches into
them, if it simplifies, is nice as well, and they can conflict)
I don't know if handling one such very specific case (binary operations with a
common argument) separately is a good idea when we don't even handle unary
operations.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
  2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
  2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
@ 2020-03-23 14:04 ` rguenth at gcc dot gnu.org
  2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-23 14:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2020-03-23

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that with binary operations you are eventually increasing register
pressure up to a point where we need to spill so IMHO this should be only done
if both
blocks become empty after the transform.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
  2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
  2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
  2020-03-23 14:04 ` rguenth at gcc dot gnu.org
@ 2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
  2020-03-25  9:40 ` z.zhanghaijian at huawei dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-03-24 13:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

--- Comment #3 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> ---
(In reply to Marc Glisse from comment #1)
> Detecting common beginnings / endings in branches is something gcc does very
> seldom. Even at -Os, for if(cond)f(b);else f(c); we need to wait until
> rtl-optimizations to get a single call to f. (of course the reverse
> transformation of duplicating a statement that was after the branches into
> them, if it simplifies, is nice as well, and they can conflict)
> I don't know if handling one such very specific case (binary operations with
> a common argument) separately is a good idea when we don't even handle unary
> operations.

I tried to test this fold on specint2017 and found some performance gains on
500.perlbench_r. Then compared the assemble and found some improvements.

For example:

S_invlist_max, which is inlined by many functions, such as
S__append_range_to_invlist, S_ssc_anything, Perl__invlist_invert ...

invlist_inline.h:
#define FROM_INTERNAL_SIZE(x) ((x)/ sizeof(UV))

S_invlist_max(inlined by S__append_range_to_invlist, S_ssc_anything,
Perl__invlist_invert, ....):
    return SvLEN(invlist) == 0  /* This happens under _new_invlist_C_array */
           ? FROM_INTERNAL_SIZE(SvCUR(invlist)) - 1
           : FROM_INTERNAL_SIZE(SvLEN(invlist)) - 1;

Dump tree phiopt:

<bb 3> [local count: 536870911]:
  _46 = pretmp_112 >> 3;
  iftmp.1123_47 = _46 + 18446744073709551615;
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 536870911]:
  _48 = _44 >> 3;
  iftmp.1123_49 = _48 + 18446744073709551615;

  <bb 5> [local count: 1073741823]:
  # iftmp.1123_50 = PHI <iftmp.1123_47(3), iftmp.1123_49(4)>

Which can replaces with:

  <bb 3> [local count: 536870912]:

  <bb 4> [local count: 1073741823]:
  # _48 = PHI <_44(2), pretmp_112(3)>
  _49 = _48 >> 3;
  iftmp.1123_50 = _49 + 18446744073709551615;

Assemble:

lsr     x5, x6, #3
lsr     x3, x3, #3
sub     x20, x5, #0x1
sub     x3, x3, #0x1
csel    x20, x3, x20, ne

Replaces with:

csel    x3, x3, x4, ne
lsr     x3, x3, #3
sub     x20, x3, #0x1

This can eliminate two instruction.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
  2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
                   ` (2 preceding siblings ...)
  2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
@ 2020-03-25  9:40 ` z.zhanghaijian at huawei dot com
  2020-06-02  7:43 ` z.zhanghaijian at huawei dot com
  2021-08-03 23:33 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-03-25  9:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

--- Comment #4 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> ---
(In reply to Richard Biener from comment #2)
> Note that with binary operations you are eventually increasing register
> pressure up to a point where we need to spill so IMHO this should be only
> done if both
> blocks become empty after the transform.

I can try to add some constraints to only do the fold when both
blocks become empty after the transform. Even with this constraints, the
improvements of spec2017 mentioned above can also be done, but do you think
need to add a option to control this constraint to let users choose? Because
the register pressure problem we worry about does not exist in most cases.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
  2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
                   ` (3 preceding siblings ...)
  2020-03-25  9:40 ` z.zhanghaijian at huawei dot com
@ 2020-06-02  7:43 ` z.zhanghaijian at huawei dot com
  2021-08-03 23:33 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: z.zhanghaijian at huawei dot com @ 2020-06-02  7:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

--- Comment #5 from z.zhanghaijian at huawei dot com <z.zhanghaijian at huawei dot com> ---
Created attachment 48659
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48659&action=edit
Fold phi whose incoming args are defined from binary operations

I tried to make a patch to do this optimization(in attachment):

replaces

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
  x1 = a + b;
  goto <bb3>
bb2:
  x2 = a + c;
bb3:
  x = PHI <x1 (bb1), x2 (bb2), ...>;

with

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
bb2:
bb3:
  x3 = PHI <b (bb1), c (bb2), ...>;
  x = a + x3;

This patch will check all the phi nodes in bb3. We do the optimization only if
these phis can all be converted, This can avoid most scenes that the blocks is
not empty after the optimization, but there are still some scenes that the
block is not empty.

For example 1:

int f1(int cond, int a, int b, int c, int d, int e, int f, int x, int y, int z,
int w, int m, int n)
{

  if (cond) {
    x = e + f;
    b = x >> w;
    c = m + 12;
    a = b + z;
  }
  else {
    d = y >> w;
    c = n + 12;
    a = d + z;
  }
  a = a + 18;
  return c + a;

}

Tree dump before optimization:

  <bb 3> [local count: 536870913]:
  x_13 = e_11(D) + f_12(D);
  b_14 = x_13 >> w_5(D);
  c_16 = m_15(D) + 12;
  a_17 = z_9(D) + b_14;
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 536870913]:
  d_6 = y_4(D) >> w_5(D);
  c_8 = n_7(D) + 12;
  a_10 = d_6 + z_9(D);

  <bb 5> [local count: 1073741824]:
  # a_1 = PHI <a_17(3), a_10(4)>
  # c_2 = PHI <c_16(3), c_8(4)>
  a_18 = a_1 + 18;
  _19 = c_2 + a_18;
  return _19;

Tree dump after optimization:

  <bb 3> [local count: 536870913]:
  x_13 = e_11(D) + f_12(D);
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 536870913]:

  <bb 5> [local count: 1073741824]:
  # _21 = PHI <x_13(3), y_4(D)(4)>
  # _23 = PHI <m_15(D)(3), n_7(D)(4)>
  c_2 = _23 + 12;
  _22 = _21 >> w_5(D);
  a_1 = z_9(D) + _22;
  a_18 = a_1 + 18;
  _19 = c_2 + a_18;
  return _19;

Assemble before optimization:

.LFB0:
        .cfi_startproc
        ldr     w1, [sp, 8]
        ldr     w2, [sp, 16]
        cbz     w0, .L2
        add     w5, w5, w6
        ldr     w0, [sp, 24]
        asr     w5, w5, w2
        add     w1, w5, w1
        add     w1, w1, 18
        add     w0, w0, 12
        add     w0, w1, w0
        ret
        .p2align 2,,3
.L2:
        ldr     w0, [sp]
        asr     w2, w0, w2
        ldr     w0, [sp, 32]
        add     w1, w2, w1
        add     w1, w1, 18
        add     w0, w0, 12
        add     w0, w1, w0
        ret
        .cfi_endproc

Assemble after optimization:

.LFB0:
        .cfi_startproc
        ldr     w1, [sp]
        ldr     w3, [sp, 8]
        ldr     w4, [sp, 16]
        ldr     w2, [sp, 32]
        cbz     w0, .L2
        ldr     w2, [sp, 24]
        add     w1, w5, w6
.L2:
        asr     w0, w1, w4
        add     w0, w0, w3
        add     w0, w0, w2
        add     w0, w0, 30
        ret

Because the statement x_13 = e_11 (D) + f_12 (D) in bb3 does not have a phi
node in bb5, so bb3 cannot be emptied. I have not found a good way to solve it.
Any suggestions?

Without considering the register pressure, the performance of example 1 is
profitable. And this patch is effective for 500.perlbench_r in comment 3.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations
  2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
                   ` (4 preceding siblings ...)
  2020-06-02  7:43 ` z.zhanghaijian at huawei dot com
@ 2021-08-03 23:33 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-03 23:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |pinskia at gcc dot gnu.org
           Severity|normal                      |enhancement
             Status|NEW                         |ASSIGNED

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Mine. There is another bug which I assigned to myself which is very similar.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-03 23:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23 11:42 [Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations z.zhanghaijian at huawei dot com
2020-03-23 12:59 ` [Bug tree-optimization/94274] " glisse at gcc dot gnu.org
2020-03-23 14:04 ` rguenth at gcc dot gnu.org
2020-03-24 13:42 ` z.zhanghaijian at huawei dot com
2020-03-25  9:40 ` z.zhanghaijian at huawei dot com
2020-06-02  7:43 ` z.zhanghaijian at huawei dot com
2021-08-03 23:33 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).