public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code
@ 2021-01-14  9:32 rjiejie at me dot com
  2021-01-14 10:45 ` [Bug tree-optimization/98673] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: rjiejie at me dot com @ 2021-01-14  9:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

            Bug ID: 98673
           Summary: pass fre4 inhibit pass dom3 to create much more
                    optimized code
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rjiejie at me dot com
  Target Milestone: ---

Created attachment 49962
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49962&action=edit
bug test file

a, compiler option:

cc1 -mabi=lp64d -march=rv64gc -O2 -S

b, hot code in function t_run_test:

        j       .L30
.L39:
        mv      a4,a3
.L30:
        ld      a2,8(a5)
        addi    a3,a4,1
        slli    t3,a4,3
        ble     a2,a1,.L28
        ld      t5,0(a5)
        bge     a1,t5,.L50
.L28:
        addi    a5,a5,8
        bne     a3,a0,.L39    : hot code loop to .L39

better code in version 8.4 with same compiler option:
=====================================================

.L30:
        ld      t1,8(a4)
        slli    a7,a5,3
        ble     t1,a3,.L28
        ld      t4,0(a4)
        bge     a3,t4,.L50
.L28:
        addi    a5,a5,1
        addi    a4,a4,8
        bne     a5,t3,.L30    : hot code loop to .L30

v10.2.0 gcc has more one instruction than v8.4.0.

analize gcc pass of source code in v10.2.0:
===========================================

before pass fr4:
----------------

<bb 8> [local count: 82176881]:
  engLoad.11_20 = engLoad;
  loadValue.13_26 = loadValue;
  _410 = (unsigned long) numXEntries.17_218;
  _409 = _410 + 18446744073709551615;
  _408 = (long int) _409;

... ...

<bb 12> [local count: 986782143]:
  i1_174 = i1_6 + 1;
  if (i1_174 != _408)
    goto <bb 9>; [94.50%]
  else
    goto <bb 13>; [5.50%]

  <bb 13> [local count: 54273018]:
  # i1_420 = PHI <i1_174(12)>
  _433 = (long unsigned int) i1_420;
  _434 = _433 + 1;
  _435 = _434 * 8;
  _436 = i1_420 + 1;
  _440 = _435 - 8;
  _442 = engLoad.11_20 + _440;
  goto <bb 15>; [100.00%]

after pass fr4:
---------------

<bb 8> [local count: 82176881]:
engLoad.11_20 = engLoad;
loadValue.13_26 = loadValue;
_410 = (unsigned long) numXEntries.17_218;
_409 = _410 + 18446744073709551615;

... ...

<bb 12> [local count: 986782143]:
  i1_174 = i1_6 + 1;
  if (i1_174 != _213)
    goto <bb 9>; [94.50%]
  else
    goto <bb 13>; [5.50%]

<bb 13> [local count: 54273018]:
_433 = (long unsigned int) i1_174;
_434 = _433 + 1;
_435 = _434 * 8;
_436 = i1_174 + 1;
_440 = _435 - 8;
_442 = engLoad.11_20 + _440;
goto <bb 15>; [100.00%]

pass fr4 remove 'Removing dead stmt _408 = (long int) _409;',
pass dom3 can't optimize this <bb 13> about '_433 = (long unsigned int)
i1_174;'

if <bb 13> use i1_174 node same as <bb 12>, so that conflict will be happened
in pass expand on processing coalesced ssa/phi nodes, and then will split edge.


need help ....:)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
  2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
@ 2021-01-14 10:45 ` rguenth at gcc dot gnu.org
  2021-01-15  1:35 ` rjiejie at me dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-14 10:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
             Target|                            |riscv
           Keywords|                            |missed-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The analysis sounds a bit confused.  What is the transform that DOM cannot do
after the transform that FRE does?  There's some older bug about out-of-SSA
coalescing issues with loops and liveness of induction variables but it's
not clear if this is related (the assembly doesn't show the loop exit block).

Can you name the loop in the source that is problematic?

See PR86270 and PR70359

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
  2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
  2021-01-14 10:45 ` [Bug tree-optimization/98673] " rguenth at gcc dot gnu.org
@ 2021-01-15  1:35 ` rjiejie at me dot com
  2021-01-18  1:31 ` rjiejie at me dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rjiejie at me dot com @ 2021-01-15  1:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

--- Comment #2 from jojo <rjiejie at me dot com> ---
(In reply to Richard Biener from comment #1)
> The analysis sounds a bit confused.  What is the transform that DOM cannot
> do after the transform that FRE does?  There's some older bug about
> out-of-SSA
> coalescing issues with loops and liveness of induction variables but it's
> not clear if this is related (the assembly doesn't show the loop exit block).
> 
> Can you name the loop in the source that is problematic?
> 

see this loop:

      for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
        {
            if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) )
            {
                break ;
            }
        }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
  2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
  2021-01-14 10:45 ` [Bug tree-optimization/98673] " rguenth at gcc dot gnu.org
  2021-01-15  1:35 ` rjiejie at me dot com
@ 2021-01-18  1:31 ` rjiejie at me dot com
  2021-01-18 10:05 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rjiejie at me dot com @ 2021-01-18  1:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

--- Comment #3 from jojo <rjiejie at me dot com> ---
(In reply to jojo from comment #2)
> (In reply to Richard Biener from comment #1)
> > The analysis sounds a bit confused.  What is the transform that DOM cannot
> > do after the transform that FRE does?  There's some older bug about
> > out-of-SSA
> > coalescing issues with loops and liveness of induction variables but it's
> > not clear if this is related (the assembly doesn't show the loop exit block).
> > 
> > Can you name the loop in the source that is problematic?
> > 
> 
> see this loop:
> 
>       for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
>         {
>             if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1]
> ) )
>             {
>                 break ;
>             }
>         }

Richi: Can you please update Known to work?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
  2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
                   ` (2 preceding siblings ...)
  2021-01-18  1:31 ` rjiejie at me dot com
@ 2021-01-18 10:05 ` rguenth at gcc dot gnu.org
  2021-02-03  9:58 ` rjiejie at me dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-18 10:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |8.4.0

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Please try simplifying the testcase, I've tried

long loadValue;
const long *engLoad;
float engLoadDelta1;
void foo()
{
  long i1, numXEntries = 50;
  for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
    {
      if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) )
        {
          break ;
        }
    }

  if( i1 == ( numXEntries - 1 ) )
    {
      loadValue = engLoad[i1] ;
    }

  engLoadDelta1 = (float)( loadValue - engLoad[i1] ) /
      (float)( engLoad[i1 + 1] - engLoad[i1] ) ;
}

which on x86 doesn't exhibit the issue (same code with GCC 8 and GCC 10).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
  2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
                   ` (3 preceding siblings ...)
  2021-01-18 10:05 ` rguenth at gcc dot gnu.org
@ 2021-02-03  9:58 ` rjiejie at me dot com
  2021-02-03 10:19 ` rguenth at gcc dot gnu.org
  2023-12-06 10:39 ` [Bug middle-end/98673] " rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rjiejie at me dot com @ 2021-02-03  9:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

--- Comment #5 from jojo <rjiejie at me dot com> ---
Sorry for late :)

Please test with following c case:

long
YTableLookup (long xValue, long xEntries, const long *xAxis,
              const long *yTable )
{
  int i ;
  long xDelta ;
  long outValue ;
  for (i=0; i<(xEntries - 1); i++)
    {
      if ((xValue < xAxis[i + 1]) && (xValue >= xAxis[i]))
        break ;
    }

  if (i == (xEntries - 1))
      xValue = xAxis[i] ;

  xDelta = (long) ((xValue - xAxis[i]) * 1000) / (xAxis[i + 1] - xAxis[i]);
  outValue = (long) ((((1000 - xDelta) * (long) yTable[i]) / 1000) +
             ((xDelta * (long) yTable[i+1]) / 1000)) ;
  return outValue ;
}

risc-v cc1 option: -O2 -march=rv32gc -mabi=ilp32d
=================================================

YTableLookup:
        addi    a1,a1,-1
        ble     a1,zero,.L2
        li      a7,4
        li      a4,0
        sub     a7,a7,a2
        j       .L5
.L6:
        mv      a4,a5
.L5:
        lw      a6,4(a2)
        addi    a5,a4,1
        add     t1,a7,a2
        ble     a6,a0,.L3
        lw      t3,0(a2)
        slli    t4,a4,2
        ble     t3,a0,.L9
.L3:
        addi    a2,a2,4
        bne     a1,a5,.L6
        addi    a4,a4,2
        slli    t1,a4,2
        addi    t4,t1,-4
        add     t4,a3,t4
        li      a1,0
        li      a5,1000
.L4:


or x86 cc1 option: -O2 -march=i386
==================================


YTableLookup:
.LFB0:
        pushl   %ebp
.LCFI0:
        pushl   %edi
.LCFI1:
        pushl   %esi
.LCFI2:
        pushl   %ebx
.LCFI3:
        pushl   %ecx
.LCFI4:
        movl    24(%esp), %esi
        movl    32(%esp), %edi
        movl    28(%esp), %eax
        decl    %eax
        testl   %eax, %eax
        jle     .L2
        movl    $4, %ecx
        xorl    %edx, %edx
        jmp     .L5
        .align 4
.L6:
        movl    %ebx, %edx
.L5:
        movl    (%edi,%ecx), %ebx
        cmpl    %esi, %ebx
        jle     .L3
        leal    -4(%ecx), %ebp
        movl    %ebp, (%esp)
        movl    (%edi,%edx,4), %ebp
        cmpl    %esi, %ebp
        jle     .L10
.L3:
        leal    1(%edx), %ebx
        addl    $4, %ecx
        cmpl    %ebx, %eax
        jne     .L6
        leal    8(,%edx,4), %ecx
        movl    36(%esp), %eax
        leal    -4(%eax,%ecx), %esi
        xorl    %eax, %eax
        movl    $1000, %ebx
.L4:


Please check the redundancy instruction 'mov' at .L6:

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
  2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
                   ` (4 preceding siblings ...)
  2021-02-03  9:58 ` rjiejie at me dot com
@ 2021-02-03 10:19 ` rguenth at gcc dot gnu.org
  2023-12-06 10:39 ` [Bug middle-end/98673] " rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-03 10:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to fail|                            |10.2.1
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=70359
      Known to work|                            |11.0

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with your testcase on trunk I see for RISCV

        ble     a1,zero,.L2
        li      a6,4
        li      a5,0
        sub     a6,a6,a2
.L5:
        lw      a4,4(a2)
        slli    a7,a5,2
        add     t1,a6,a2
        addi    a5,a5,1
        ble     a4,a0,.L3
        lw      t3,0(a2)
        ble     t3,a0,.L9
.L3:
        addi    a2,a2,4
        bne     a1,a5,.L5

which is fine, same for x86.  This is usually a SSA coalescing issue where
a failed coalesce ends up splitting the backedge and emitting a move there.

I can see the issue on the branch where the problematic one is

;;   basic block 4, loop depth 1
;;    pred:       3
;;                7
  # i_57 = PHI <0(3), i_41(7)>
...

;;   basic block 7, loop depth 1
;;    pred:       4
;;                5
  i_41 = i_57 + 1;
  ivtmp.14_90 = ivtmp.14_91 + 4;
  if (_6 != i_41)
    goto <bb 4>; [94.50%]
  else
    goto <bb 8>; [5.50%]
;;    succ:       4
;;                8

;;   basic block 8, loop depth 0
;;    pred:       7
  _87 = (sizetype) i_57;
  _146 = _87 + 2;

which is a use of the pre-increment i_57 on the loop exit edge.  This
inhibits coalescing of i_57 and i_41 causing the copy.

That's exactly the issue noted in the cited PRs.  There have been patches
floating around re-materializing i_41 + 1 at the point of i_57 to make
the coalescing possible but I think nobody developed them in full.
See the thread starting at
https://gcc.gnu.org/pipermail/gcc-patches/2018-March/495843.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/98673] pass fre4 inhibit pass dom3 to create much more optimized code
  2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
                   ` (5 preceding siblings ...)
  2021-02-03 10:19 ` rguenth at gcc dot gnu.org
@ 2023-12-06 10:39 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-12-06 10:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |WORKSFORME

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
No updates, verified the code is still the same, assuming not an issue.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-12-06 10:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14  9:32 [Bug tree-optimization/98673] New: pass fre4 inhibit pass dom3 to create much more optimized code rjiejie at me dot com
2021-01-14 10:45 ` [Bug tree-optimization/98673] " rguenth at gcc dot gnu.org
2021-01-15  1:35 ` rjiejie at me dot com
2021-01-18  1:31 ` rjiejie at me dot com
2021-01-18 10:05 ` rguenth at gcc dot gnu.org
2021-02-03  9:58 ` rjiejie at me dot com
2021-02-03 10:19 ` rguenth at gcc dot gnu.org
2023-12-06 10:39 ` [Bug middle-end/98673] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).