[Bug tree-optimization/95760] New: ivopts with loop variables

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/95760] New: ivopts with loop variables
@ 2020-06-19  6:17 hailey.chiu at sifive dot com
  2020-06-21  6:12 ` [Bug tree-optimization/95760] " wilson at gcc dot gnu.org
  2020-06-22 23:40 ` wilson at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: hailey.chiu at sifive dot com @ 2020-06-19  6:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95760

            Bug ID: 95760
           Summary: ivopts with loop variables
           Product: gcc
           Version: tree-ssa
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hailey.chiu at sifive dot com
  Target Milestone: ---

C Source:

int **matrix;
int n;
void foo()
{
    static int row;
    static int col;
    static int sum = 0;

    for( row = 0 ; row < n ; row++ )
    {
        for( col = 0 ; col < n ; col++ )
        {
            sum += matrix[row][col];
        }
    }
}

Compiling:
$./RISCV-GCC-10.1/bin/riscv64-unknown-elf-gcc foo.c -Os -S -o foo.s
-march=rv32imac -mabi=ilp32

Asm:

foo:
...skip load/store variables...
.L5:
    slli    a7,a1,2 #row*4 
    add a7,t3,a7    #matrix + (row*4) 
    li  a0,0
.L3:
    bgt a5,a0,.L4
    addi    a1,a1,1 #row++
    mv  a0,a5
    li  a7,1
    j   .L2
.L4:
    lw  t1,0(a7)
    slli    t4,a0,2 #col*4
    addi    a0,a0,1 #col++
    add t1,t1,t4    #*matrix + (col*4)
    lw  t1,0(t1)
    add a6,a6,t1
    li  t1,1
    j   .L3

The calculation of matrix offset is not increasing by 4 after each iteration. I
also check that with RISCV-GCC-8.3, it can be emitted code like "add a7, a7, 4"
after each iteration. GCC-10.1 takes two instructions to do this, while GCC-8.3
takes one. I think it might affect code size / performance slightly. 

I am also wondering if "col" can be optimized to the add by 4 operation,
because gcc-8.3 doesn't optimize it too. 

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/95760] ivopts with loop variables
  2020-06-19  6:17 [Bug tree-optimization/95760] New: ivopts with loop variables hailey.chiu at sifive dot com
@ 2020-06-21  6:12 ` wilson at gcc dot gnu.org
  2020-06-22 23:40 ` wilson at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: wilson at gcc dot gnu.org @ 2020-06-21  6:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95760

Jim Wilson <wilson at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilson at gcc dot gnu.org

--- Comment #1 from Jim Wilson <wilson at gcc dot gnu.org> ---
You are compiling with -Os.  I get the expected result if I compile with -O2.

Looking at tree dumps, I see the first difference between -O2 and -Os dumps is
in the ch2 (copy loop header 2) pass, which explicitly disables loop header
copying when -Os is used.  Note the optimize_loop_for_size_p check in
should_duplicate_loop_header_p in tree-ssa-loop-ch.c.  You can see the
difference if you add -ftree-dump-ch2-all.  In the -O2 ch2 dump file, I see

Loop 1 is not do-while loop: latch is not empty.
    Will duplicate bb 7
  Not duplicating bb 3: it is single succ.
Duplicating header of the loop 1 up to edge 7->3, 4 insns.
Loop 1 is do-while loop
Loop 1 is now do-while loop.

and in the -Os ch2 dump file, I see

Loop 1 is not do-while loop: latch is not empty.
  Not duplicating bb 7: optimizing for size.

The difference in loop optimization here then affects the later ivopt pass. 
Normally, duplicating basic blocks will make code bigger.  But in this case the
duplicated blocks enable better loop optimization which results in smaller code
at the end.  This kind of thing is hard to handle with the heuristics.  We
would have to optimize both ways and check to see which one is smaller at the
end to get this right every time, and the compiler doesn't work that way
currently.

I haven't checked older sources to see if/when a heuristic changed.

This isn't risc-v specific.  I see the same issue with x86_64.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/95760] ivopts with loop variables
  2020-06-19  6:17 [Bug tree-optimization/95760] New: ivopts with loop variables hailey.chiu at sifive dot com
  2020-06-21  6:12 ` [Bug tree-optimization/95760] " wilson at gcc dot gnu.org
@ 2020-06-22 23:40 ` wilson at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: wilson at gcc dot gnu.org @ 2020-06-22 23:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95760

--- Comment #2 from Jim Wilson <wilson at gcc dot gnu.org> ---
I took another look, and it turns out that the should_duplicate_loop_header_p
for size/speed is not the only issue.  There is also an issue in
tree-ssa-loop-ivopts.c when computing iv costs.  With speed, the +4 iv is
computed as cheaper than the +1 iv.  With size, the +4 iv and +1 iv have the
exact same cost, and since the +1 iv was looked at first that one was chosen. 
If I hack adjust_setup_cost to use to always use the speed cost calculation,
and retain the should_duplicate_loop_header_p hack, then both the inner and
outer loops get the +4 iv with -Os.

Looking at gcc-8.3, I see that the outer loop has the +4 iv and the inner loop
as the +1 iv.  This looks similar to the result I get with the
adjust_setup_cost hack but not the should_duplicate_loop_header_p hack.  So I
think the regression is solely due to some change in the cost calculation.

There is a lot of code involved in cost calculations.  This could have even
been a riscv backend change.  I would suggest doing a bisect over the gcc git
tree if you want to see exactly where and how the cost calculation changed.

The -Os and -O2 optimization diverges in try_improve_iv_set where it does "if
(acost < best_cost)".  Maybe this could be improved to handle the case where
acost == best_cost, and use some other criteria to choose which one is better,
e.g. maybe a giv is better than a biv if they have the same cost.  I haven't
tried looking into this.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-06-22 23:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19  6:17 [Bug tree-optimization/95760] New: ivopts with loop variables hailey.chiu at sifive dot com
2020-06-21  6:12 ` [Bug tree-optimization/95760] " wilson at gcc dot gnu.org
2020-06-22 23:40 ` wilson at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).