[Bug target/62173] New: [AArch64] Performance regression due to r213488

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/62173] New: [AArch64] Performance regression due to r213488
@ 2014-08-18 16:12 spop at gcc dot gnu.org
  2014-08-18 16:39 ` [Bug target/62173] " pinskia at gcc dot gnu.org
                   ` (38 more replies)
  0 siblings, 39 replies; 40+ messages in thread
From: spop at gcc dot gnu.org @ 2014-08-18 16:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

            Bug ID: 62173
           Summary: [AArch64] Performance regression due to r213488
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: spop at gcc dot gnu.org

void bar(int i) {
  char A[10];
  int d = 0;
  while (i > 0)
    A[d++] = i--;

  while (d > 0)
    foo(A[d--]);
}

Compile this function at -O3 with the last good revision r213487 and with the
first bad revision r213488 "[AArch64] Improve TARGET_LEGITIMIZE_ADDRESS_P
hook", and diffing:

--- good.s    2014-08-18 10:56:42.542867000 -0500
+++ bad.s    2014-08-18 10:56:42.504090000 -0500
@@ -8,34 +8,33 @@
     stp    x29, x30, [sp, -48]!
     cmp    w0, wzr
     add    x29, sp, 0
-    stp    x19, x20, [sp, 16]
+    str    x19, [sp, 16]
     ble    .L1
     strb    w0, [x29, 32]
     subs    w19, w0, #1
-    add    x20, x29, 32
     beq    .L4
     strb    w19, [x29, 33]
     subs    w1, w0, #2
     beq    .L4
-    strb    w1, [x20, 2]
+    strb    w1, [x29, 34]
     subs    w1, w0, #3
     beq    .L4
-    strb    w1, [x20, 3]
+    strb    w1, [x29, 35]
     subs    w1, w0, #4
     beq    .L4
-    strb    w1, [x20, 4]
+    strb    w1, [x29, 36]
     subs    w1, w0, #5
     beq    .L4
-    strb    w1, [x20, 5]
+    strb    w1, [x29, 37]
     subs    w1, w0, #6
     beq    .L4
-    strb    w1, [x20, 6]
+    strb    w1, [x29, 38]
     subs    w1, w0, #7
     beq    .L4
-    strb    w1, [x20, 7]
+    strb    w1, [x29, 39]
     subs    w1, w0, #8
     beq    .L4
-    strb    w1, [x20, 8]
+    strb    w1, [x29, 40]
     subs    w1, w0, #9
     beq    .L4
     strb    w1, [x29, 41]
@@ -43,14 +42,16 @@
 .L35:
     sub    w19, w19, #1
 .L4:
-    ldrb    w0, [x20, w0, sxtw]
+    add    x1, x29, 48
+    add    x0, x1, x0, sxtw
+    ldrb    w0, [x0, -16]
     bl    foo
     mov    w0, w19
     cbnz    w19, .L35
 .L1:
-    ldp    x19, x20, [sp, 16]
+    ldr    x19, [sp, 16]
     ldp    x29, x30, [sp], 48
     ret
     .size    bar, .-bar


The problem is that gcc now generates an addressing mode that requires two more
add instructions in the second loop:

-    ldrb    w0, [x20, w0, sxtw]
+    add    x1, x29, 48
+    add    x0, x1, x0, sxtw
+    ldrb    w0, [x0, -16]


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
@ 2014-08-18 16:39 ` pinskia at gcc dot gnu.org
  2014-08-18 19:13 ` spop at gcc dot gnu.org
                   ` (37 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-08-18 16:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-08-18
     Ever confirmed|0                           |1

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed,  Here is a much shorter testcase and not related to loops either:
void bar(int i)
{
  char A[10];
  g(A);
  f(A[i]);
}
--- CUT ---
Before:
bar:
    stp    x29, x30, [sp, -64]!
    add    x29, sp, 0
    stp    x19, x20, [sp, 16]
    add    x19, x29, 48
    mov    w20, w0
    mov    x0, x19
    bl    g
    ldrb    w0, [x19, w20, sxtw]
    bl    f
    ldp    x19, x20, [sp, 16]
    ldp    x29, x30, [sp], 64
    ret

After:
bar:
    stp    x29, x30, [sp, -48]!
    add    x29, sp, 0
    str    x19, [sp, 16]
    mov    w19, w0
    add    x0, x29, 32
    bl    g
    add    x0, x29, 48
    add    x19, x0, x19, sxtw
    ldrb    w0, [x19, -16]
    bl    f
    ldr    x19, [sp, 16]
    ldp    x29, x30, [sp], 48
    ret


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
  2014-08-18 16:39 ` [Bug target/62173] " pinskia at gcc dot gnu.org
@ 2014-08-18 19:13 ` spop at gcc dot gnu.org
  2014-08-19  1:37 ` amker at gcc dot gnu.org
                   ` (36 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: spop at gcc dot gnu.org @ 2014-08-18 19:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #2 from Sebastian Pop <spop at gcc dot gnu.org> ---
I have seen a 15% perf regression on a version of bzip2 compress compiled at
-O3.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
  2014-08-18 16:39 ` [Bug target/62173] " pinskia at gcc dot gnu.org
  2014-08-18 19:13 ` spop at gcc dot gnu.org
@ 2014-08-19  1:37 ` amker at gcc dot gnu.org
  2014-10-28 11:28 ` [Bug target/62173] [5.0 regression] " jiwang at gcc dot gnu.org
                   ` (35 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2014-08-19  1:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

amker at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org

--- Comment #3 from amker at gcc dot gnu.org ---
I have seen potential improvement of bzip/gzip on arm 32.  It relates to
addressing mode which affecting loop invariant hoisting in kernel loop of these
two benchmarks.  I once had a patch but didn't follow up that.  I think it's
worthy of methodical investigation, rather than case by case changes.

Thanks,
bin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2014-08-19  1:37 ` amker at gcc dot gnu.org
@ 2014-10-28 11:28 ` jiwang at gcc dot gnu.org
  2014-11-14  9:37 ` jiwang at gcc dot gnu.org
                   ` (34 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-10-28 11:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #4 from Jiong Wang <jiwang at gcc dot gnu.org> ---
interesting.

r213488 was doing something right, and I guess it exposed some other hidding
issues which need our investigations.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2014-10-28 11:28 ` [Bug target/62173] [5.0 regression] " jiwang at gcc dot gnu.org
@ 2014-11-14  9:37 ` jiwang at gcc dot gnu.org
  2014-11-17  2:14 ` amker.cheng at gmail dot com
                   ` (33 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-14  9:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #5 from Jiong Wang <jiwang at gcc dot gnu.org> ---
The root cause is AArch64's TARGET_LEGITIMIZE_ADDRESS are not doing well when
the input is ((reg + reg) + offset).

looks like currently we have considered reg + non_normal_offset, and trying to
split non_normal_offset to base + offset.

while for ((reg1 + reg2) + offset), if reg + offset is valid, then we should
legitimize it to

reg3 = reg1 + offset
return reg3 + reg2

will send out the patch after pass tests and benchmarking.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2014-11-14  9:37 ` jiwang at gcc dot gnu.org
@ 2014-11-17  2:14 ` amker.cheng at gmail dot com
  2014-11-24 12:15 ` jiwang at gcc dot gnu.org
                   ` (32 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker.cheng at gmail dot com @ 2014-11-17  2:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

bin.cheng <amker.cheng at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker.cheng at gmail dot com

--- Comment #6 from bin.cheng <amker.cheng at gmail dot com> ---
Em, is offset valid for [reg+offset] addressing mode? if it is, why don't we
transform "reg+reg+offset" into "regX <- reg + reg; [regX + offset];"?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2014-11-17  2:14 ` amker.cheng at gmail dot com
@ 2014-11-24 12:15 ` jiwang at gcc dot gnu.org
  2014-11-24 12:38 ` jiwang at gcc dot gnu.org
                   ` (31 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-24 12:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #7 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to bin.cheng from comment #6)
> Em, is offset valid for [reg+offset] addressing mode? if it is, why don't we
> transform "reg+reg+offset" into "regX <- reg + reg; [regX + offset];"?

that's because for local char array A, if we want to address it's element, like
A[I],

first we get the base address of array A, which is

  (plus virtual_stack_vars_rtx, offset),

then we add the index offset I which is in register B:

  (plus (plus virtual_stack_vars_rtx, offset), B)

while from my experiment, above will be canonicalized into :

(plus (plus virtual_stack_vars_rtx, B), offset)


and for any target define FRAME_GROWS_DOWNWARD be 1, virtual_stack_vars_rtx
will be eliminated into (plus frame pointer, offset1), instead of (plus,
frame_pointer, const_0) which only happen when FRAME_GROWS_DOWNWARD be 0.

so, transform "reg+reg+offset" into "regX <- reg + reg; [regX + offset];" will
cause some trouble for gcc rtl optimization, because it's finally splitted
into:

regA <- frame - offset0

regA <- regA + regB

regA <- regA + offset1

and somehow, later rtl optimization can't fold offset 0 and offset 1, because
virtual_stack_var_rtx elimination happens at quite later stage in LRA.

so, if we found "virtual_stack_var_rtx + reg + offset", it's better to
associate constant offset with it, which means transform it into

regA <- virtual_stack_var_rtx + offset
regA <- regA + regB

thus the elimination offset will be merged into the array offset automatically
in LRA.

I verified if we add such transform in aarch64's LEGITIMIZE_ADDRESS hook, then
we do generate optimized code for Pinski's sample code:

bar:
        stp     x29, x30, [sp, -48]!
        add     x29, sp, 0
        stp     x19, x20, [sp, 16]
        add     x19, x29, 32
        mov     w20, w0
        mov     x0, x19
        bl      g
        ldrb    w0, [x19, w20, sxtw]
        bl      f
        ldp     x19, x20, [sp, 16]
        ldp     x29, x30, [sp], 48 
        ret


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2014-11-24 12:15 ` jiwang at gcc dot gnu.org
@ 2014-11-24 12:38 ` jiwang at gcc dot gnu.org
  2014-11-24 13:06 ` rguenth at gcc dot gnu.org
                   ` (30 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-24 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #9 from Jiong Wang <jiwang at gcc dot gnu.org> ---
To summary, given the following testcases:

case A.C
===
void bar(int i)
{
  char A[10];
  g(A);
  f(A[i]);
}

case B.c
===
void bar(int i)              
{                        
  char A[10];
  char B[10];     
  char C[10];       
  g(A);                        
  g(B);
  g(C);                              
  f(A[i]);                 
  f(B[i]);                        
  f(C[i]);                   
  return;               
} 

current code base:

  * generate sub-optimal code for case A.
  * generate optimal code for case B, because frame address are rematerialized.

I verified *arm/mips also generate the same sub-optimal code layout for case
A*, and I believe should be the same for Sebastian's testcase.

r213488 bring AArch64 to the correct road then we run into common issue existed
on other target also.

for any target with FRAME_GROWS_DOWNWARD be 1, the same sub-optimal code layout
will be generated, because the base address of the first stack variable will be
eliminated into frame + some_minus_value in later stage of LRA which cause it
can't be foled with other constant.

and after my experimental hack on LEGITIMIZE_ADDRESS to associate
stack_var_virtual_rtx with constant offset, then:

  * generate optimal code for case A.
  * generate sub-optimal code for case B, because frame address are *not
rematerialized*.

will do further investigation on this especially the frame address
rematerialization after my patch.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2014-11-24 12:38 ` jiwang at gcc dot gnu.org
@ 2014-11-24 13:06 ` rguenth at gcc dot gnu.org
  2014-11-24 23:01 ` jiwang at gcc dot gnu.org
                   ` (29 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-11-24 13:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Performance regression due to r213488
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2014-11-24 13:06 ` rguenth at gcc dot gnu.org
@ 2014-11-24 23:01 ` jiwang at gcc dot gnu.org
  2014-11-26 10:54 ` [Bug target/62173] [5.0 regression] [AArch64] Can't ivopt array base address while ARM can jiwang at gcc dot gnu.org
                   ` (28 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-24 23:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #10 from Jiong Wang <jiwang at gcc dot gnu.org> ---
Finished a further investigation, looks like the simplest fix to genrate
optimized code for case A is to add one more optimization case in
"eliminate_regs_in_insn".

currently we only optimize "eliminate_reg + const_offset"

      if (plus_src                                                              
          && CONST_INT_P (XEXP (plus_src, 1))) 

while for those arch, like AArch64, Mips, ARM, which support base + offset
addressing mode, the following pattern which is normally for array element
address, like A[I]:

reg T <- eliminate_reg + reg I (which hold value I)
reg D <- MEM(reg T, offset)

we should eliminate into (fold two constant offset immediately):

reg T <- reg_after_eliminate + reg I
reg D <- MEM(reg T, offset + eliminate_offset)

instead of 

reg S <- reg_after_eliminate + eliminate_offset
reg T <- reg S + reg I
reg D <- MEM(reg T, offset)

because there are high dependence between D and T, we just need to check
NEXT_INSN when doing the elimination to detect whether there are such pattern.

I'd try this approach which is quite clean, hopefully AArch64, ARM, MIPS could
all be fixed.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Can't ivopt array base address while ARM can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2014-11-24 23:01 ` jiwang at gcc dot gnu.org
@ 2014-11-26 10:54 ` jiwang at gcc dot gnu.org
  2014-11-27  9:35 ` jiwang at gcc dot gnu.org
                   ` (27 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-26 10:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

Jiong Wang <jiwang at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[5.0 regression] [AArch64]  |[5.0 regression] [AArch64]
                   |Performance regression due  |Can't ivopt array base
                   |to r213488                  |address while ARM can

--- Comment #11 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to amker from comment #3)
> I have seen potential improvement of bzip/gzip on arm 32.  It relates to
> addressing mode which affecting loop invariant hoisting in kernel loop of
> these two benchmarks.  I once had a patch but didn't follow up that.  I
> think it's worthy of methodical investigation, rather than case by case
> changes.
> 
> Thanks,
> bin

exactly.

the fix in lra elimination only reduce one unnecessary add instructions

add    x1, x29, 48 
add    x0, x1, x0, sxtw
ldrb    w0, [x0, -16]

 transformed into

add    x0, 29, x0, sxtw
ldrb    w0, [x0, 32]

Pinski'a case is fixed by this.

But for Seb's case, still the base address calculation is not hoisted outside
the loop which is critical. And If we re-associate ((virtual_fp + reg) +
offset) into ((virtual_fp + offset) + reg), then the later RTL GCSE pre pass
will identify virtual_fp + offset as loop invariant and do the hoisting. But
the re-association is not always good when there are multi-use etc.

While for ARM backend, although there is the lra elimination issue, there
is no base address hoisting issue.  From the tree dump, ARM and AArch64 do get
difference result after ivopt pass.

Will create a seperate bugzilla for lra elimination issue


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug target/62173] [5.0 regression] [AArch64] Can't ivopt array base address while ARM can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2014-11-26 10:54 ` [Bug target/62173] [5.0 regression] [AArch64] Can't ivopt array base address while ARM can jiwang at gcc dot gnu.org
@ 2014-11-27  9:35 ` jiwang at gcc dot gnu.org
  2014-11-27 12:00 ` [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can rguenther at suse dot de
                   ` (26 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-27  9:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

Jiong Wang <jiwang at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
         Depends on|                            |52563

--- Comment #12 from Jiong Wang <jiwang at gcc dot gnu.org> ---
the root cause why it's not ivopted on AArch64 is because

  must_check_src_overflow = TYPE_PRECISION (ct) < TYPE_PRECISION (type);

code above in convert_affine_scev, the input type is sizetype, so DI for 64bit
arch, SI for 32bit arch, ct is SI, thus, must_check_src_overflow set to true
for 64bit arch, then failed later scev_probably_wraps_p check.

And I found similar issue reported back in 2012, at bug 52563.

I verified this bug exist on other 64 archs, like mips64, ppc64, x86-64

Richard, on 52563, I see you was working on this, do you have any thoughts on
this?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2014-11-27  9:35 ` jiwang at gcc dot gnu.org
@ 2014-11-27 12:00 ` rguenther at suse dot de
  2014-11-27 12:16 ` rguenther at suse dot de
                   ` (25 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenther at suse dot de @ 2014-11-27 12:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Nov 2014, jiwang at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> Jiong Wang <jiwang at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |rguenth at gcc dot gnu.org
>          Depends on|                            |52563
> 
> --- Comment #12 from Jiong Wang <jiwang at gcc dot gnu.org> ---
> the root cause why it's not ivopted on AArch64 is because
> 
>   must_check_src_overflow = TYPE_PRECISION (ct) < TYPE_PRECISION (type);
> 
> code above in convert_affine_scev, the input type is sizetype, so DI for 64bit
> arch, SI for 32bit arch, ct is SI, thus, must_check_src_overflow set to true
> for 64bit arch, then failed later scev_probably_wraps_p check.
> 
> And I found similar issue reported back in 2012, at bug 52563.
> 
> I verified this bug exist on other 64 archs, like mips64, ppc64, x86-64
> 
> Richard, on 52563, I see you was working on this, do you have any thoughts on
> this?

See comment #5 of that bug.  For 4.8(?) I started on work to relax
the type requirements of the offset parameter of POINTER_PLUS_EXPR
by abstracting stuff but I didn't get to continue on that work.

Basically that we force the offset to 'sizetype' has both correctness
issues (for targets where sizetype precision doesn't match Pmode
precision) and optimization issues as we lose for example sign
information and overflow knowledge in the computation of the offset.
The last thing is also because we have transforms in fold which
push typecasts of expressions down to operands - thus
from (sizetype) (4 * i) we get 4 * (sizetype)i which may now be
an unsigned multiplication with wrapping overflow.  Note that
it is the frontends who start the conversion thing and apply some
"tricks" for code-gen (see pointer_int_sum in c-common.c).
It's also not clear whether if you write p[i] with i of type int
the multiplication by sizeof (*p) invokes undefined behavior if
it wraps (that is, the C standard does not define the type the
multiplication is performed in but just defines things in terms
of array elements).

Ideally we'd use a widening multiplication here but optimizers
have little knowledge of that so it probably would cause quite
some regressions.  We could also keep the multiplication signed
(but using ssizetype), but then fold will come along and
undo that trick IIRC.

That said, both the POINTER_PLUS_EXPR constraints on the offset
type _and_ the C language issue with int * sizeof (element)
overflowing for 64bit pointer sizes prevent us from optimizing this.

It's a very tricky area ;)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2014-11-27 12:00 ` [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can rguenther at suse dot de
@ 2014-11-27 12:16 ` rguenther at suse dot de
  2014-11-27 13:34 ` jiwang at gcc dot gnu.org
                   ` (24 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenther at suse dot de @ 2014-11-27 12:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Nov 2014, rguenther at suse dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
> On Thu, 27 Nov 2014, jiwang at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> > 
> > Jiong Wang <jiwang at gcc dot gnu.org> changed:
> > 
> >            What    |Removed                     |Added
> > ----------------------------------------------------------------------------
> >                  CC|                            |rguenth at gcc dot gnu.org
> >          Depends on|                            |52563
> > 
> > --- Comment #12 from Jiong Wang <jiwang at gcc dot gnu.org> ---
> > the root cause why it's not ivopted on AArch64 is because
> > 
> >   must_check_src_overflow = TYPE_PRECISION (ct) < TYPE_PRECISION (type);
> > 
> > code above in convert_affine_scev, the input type is sizetype, so DI for 64bit
> > arch, SI for 32bit arch, ct is SI, thus, must_check_src_overflow set to true
> > for 64bit arch, then failed later scev_probably_wraps_p check.
> > 
> > And I found similar issue reported back in 2012, at bug 52563.
> > 
> > I verified this bug exist on other 64 archs, like mips64, ppc64, x86-64
> > 
> > Richard, on 52563, I see you was working on this, do you have any thoughts on
> > this?
> 
> See comment #5 of that bug.  For 4.8(?) I started on work to relax
> the type requirements of the offset parameter of POINTER_PLUS_EXPR
> by abstracting stuff but I didn't get to continue on that work.
> 
> Basically that we force the offset to 'sizetype' has both correctness
> issues (for targets where sizetype precision doesn't match Pmode
> precision) and optimization issues as we lose for example sign
> information and overflow knowledge in the computation of the offset.
> The last thing is also because we have transforms in fold which
> push typecasts of expressions down to operands - thus
> from (sizetype) (4 * i) we get 4 * (sizetype)i which may now be
> an unsigned multiplication with wrapping overflow.  Note that
> it is the frontends who start the conversion thing and apply some
> "tricks" for code-gen (see pointer_int_sum in c-common.c).
> It's also not clear whether if you write p[i] with i of type int
> the multiplication by sizeof (*p) invokes undefined behavior if
> it wraps (that is, the C standard does not define the type the
> multiplication is performed in but just defines things in terms
> of array elements).
> 
> Ideally we'd use a widening multiplication here but optimizers
> have little knowledge of that so it probably would cause quite
> some regressions.  We could also keep the multiplication signed
> (but using ssizetype), but then fold will come along and
> undo that trick IIRC.
> 
> That said, both the POINTER_PLUS_EXPR constraints on the offset
> type _and_ the C language issue with int * sizeof (element)
> overflowing for 64bit pointer sizes prevent us from optimizing this.
> 
> It's a very tricky area ;)

A related part of code to pointer_int_sum is get_inner_reference.
If you do

Index: gcc/expr.c
===================================================================
--- gcc/expr.c  (revision 218121)
+++ gcc/expr.c  (working copy)
@@ -6852,9 +6852,10 @@ get_inner_reference (tree exp, HOST_WIDE
                                   index, low_bound);

            offset = size_binop (PLUS_EXPR, offset,
+                                fold_convert (sizetype,
                                 size_binop (MULT_EXPR,
-                                            fold_convert (sizetype, 
index),
-                                            unit_size));
+                                            fold_convert (ssizetype, 
index),
+                                            fold_convert (ssizetype, 
unit_size))));
          }
          break;


thus compute index * size multiplication in a signed type and
then only convert the result back to unsigned then it for example
improves the testcase in PR52563 comment #4 in the following way:

 .L5:
-       movq    %rdx, %rcx
-       movq    %rax, %r8
-       movl    $100, (%rax)
-       addq    %rdi, %rdx
-       addq    %rsi, %rax
-       cmpq    $999, %rcx
+       movl    $100, a(,%rax,4)
+       addq    %rdi, %rax
+       movq    %rdx, %rsi
+       addq    %rcx, %rdx
+       cmpq    $999, %rax
        jle     .L5

not sure if that's really faster in the end though, but IVOPTs
does its job in that case.

Now you need to argue that doing that is safe - the change does
two things.

 1) we interpret 'index' as signed
 2) we say the multiplication does not overflow

that would break for example

 unsigned long i = 0x8000000000000000UL;
 int a[1];
 a[i] = 0;

as i * 4 overflows to 0 and thus the access is valid.

Of course accessing the element 0x8000000000000000UL of an array
of size 1 invokes undefined behavior.  Now you get to prove
that there is no case that breaks where it wouldn't be undefined
from the start ...

I'd be happy to approve such a change - maybe you can check if
it helps you and whether it passes testing and enough benchmarks?
(there is still the related code in pointer_int_sum which should
be adjusted accordingly)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2014-11-27 12:16 ` rguenther at suse dot de
@ 2014-11-27 13:34 ` jiwang at gcc dot gnu.org
  2015-01-23 17:33 ` jiwang at gcc dot gnu.org
                   ` (23 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-27 13:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #15 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #14)
> I'd be happy to approve such a change - maybe you can check if
> it helps you and whether it passes testing and enough benchmarks?
> (there is still the related code in pointer_int_sum which should
> be adjusted accordingly)

thanks for these explanation and information. Need some time to digest :)

Sure, I will test it and check benchmarks.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2014-11-27 13:34 ` jiwang at gcc dot gnu.org
@ 2015-01-23 17:33 ` jiwang at gcc dot gnu.org
  2015-01-26 10:30 ` rguenth at gcc dot gnu.org
                   ` (22 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2015-01-23 17:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #16 from Jiong Wang <jiwang at gcc dot gnu.org> ---
After some work on this issue, things have gone beyond my expectations. To
summarize my understanding of this issue and what I have got:

Reason of regression
======
  the testcase contains a loop which is hotcode, and there is one critical 
loop invariant, actually base address of local array, if it can't be ivopted
then there will be big performance regression on this testcase.

  because of one gcc tree level ivopt issue, this variable will not be 
ivopted on 64bit machine, mips, powerpc, aarch64 etc.

  mostly because there some type precision issues in gcc tree level code that 
gcc is doing too conservative decision on overflow checking and thus abort the
ivopt in early stage. For details, please see above discussion.

  Then why there is regression on AArch64 since

     r213488 "[AArch64] Improve TARGET_LEGITIMIZE_ADDRESS_P hook" ?

  it's because except tree level ivopt pass, gcc also do extra ivopt on
rtl level by the rtl pre pass. previously, we have inefficient implementation
on aarch64 TARGET_LEGITIMIZE_ADDRESS_P while this inefficient implementation
happen to generated some instruction pattern that rtl level ivopt could catch 
and that critical variable finally ivopted by rtl pass.

  r213488 fixed the inefficiency of TARGET_LEGITIMIZE_ADDRESS_P, and as a
side-effect, we will not generate that specific instruction pattern again, thus
no ivopt on rtl pre, and thus regression happen.

  before r213488, ivopt happen on AArch64 only, while after r213488, AArch64
act the same as all other 64bit targets, mips64, powerpc64, sparc64 etc.

To Fix
======
  * the idea way is fix this issue on tree-ssa-loop-ivopt pass.
    We need to correct type precision & overflow checking issue, then gcc could
    ivopt the critical variable as early as in tree level for all 64bit
targets.

  * gcc rtl level ivopt could be improved also. we could re-shuffle rtl
    pattern, then rtl pre pass could catch more ivopt opportunities and
    act as a supplement to tree ivopt.

What I have tried
=================
  I have tried to fix this issue on rtl level. We need two patches.

    * one patch to correct aarch64 TARGET_LEGITIMIZE_ADDRESS hook to generate
      more efficient rtl pattern for array indexed address.

    * one patch to re-shuffle rtl pattern to let rtl pre pass catch more ivopt
      opportunities.

  For targets except AArch64, patch 2 alone is enough to let ivopt happen in
rtl
pre pass.

  While for AArch64, we need to correct TARGET_LEGITIMIZE_ADDRESS first, to   
don't split valid "offset" into "magic-base + offset" which make the rtl
sequences unnecessarily complexer and cause troubles for later rtl pass.

  but this fix on AArch64 TARGET_LEGITIMIZE_ADDRESS hook then exposed another
issue on tree-ssa-loop-ivopt. When running benchmarks, I found some post-index
addressing opportunities are missing after we correct
TARGET_LEGITIMIZE_ADDRESS.
The correct of TARGET_LEGITIMIZE_ADDRESS will actually correct address cost
which tree-ssa-loop-ivopt will query, and looks like there are glitches in
tree-ssa-loop-ivopt cost model that gcc ivopt some simple post-index
addressable
variable, and thus generated sub-optimal code.

  And the patch for re-shuffle rtl pattern alone,
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00355.html, has critical problem.
It's unsafe to re-use those REG_NOTE info in those places. So the patch needs
re-work, or we need to
investigate a better solution to fix this on rtl level.

Future work
===========
  I hope someone could have a looks at the tree level fix which is the ideal 
fix.

  For RTL level fix, now the issue dependent chain for my solution becomes:

    fix-on-rtl-level
      \
       v
      re-shuffle insn for rtl pre
        \
         v
        fix AArch64 TARGET_LEGITIMIZE_ADDRESS (B)
          \
           v
          fix tree-ssa-loop-ivopt cost model to
          make better decision between ivopt
          and post-index. (A)

I think task A and B are important as they are quite general improvements to
AArch64 and gcc generic code besides this testcase issue.

I'd prepare a testcase and create a seperate ticket for task A first.

And, I don't think we can fix this issue in 5.0 release, so move the target to
the next release?

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2015-01-23 17:33 ` jiwang at gcc dot gnu.org
@ 2015-01-26 10:30 ` rguenth at gcc dot gnu.org
  2015-01-26 11:10 ` rguenth at gcc dot gnu.org
                   ` (21 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-26 10:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
I really wonder why IVOPTs calls convert_affine_scev with
!use_overflow_semantics.

Note that for the original testcase 'i' may be negative or zero and thus 'd'
may be zero.  We do a bad analysis here because IVOPTs follows complete
peeling immediately...  but at least we have range information that looks
useful:

  <bb 16>:
  # RANGE [0, 10] NONZERO 15
  # d_26 = PHI <i_6(D)(15), d_13(17)>
  # RANGE [0, 9] NONZERO 15
  d_13 = d_26 + -1;
  _14 = A[d_26];
  # RANGE [0, 255] NONZERO 255
  _15 = (int) _14;
  # USE = nonlocal
  # CLB = nonlocal
  foo (_15);
  if (d_13 != 0)
    goto <bb 17>;
  else
    goto <bb 3>;

  <bb 17>:
  goto <bb 16>;

but unfortunately we expand the initial value of the IV for d all the way
to i_6(D) so we don't see that i_6(D) is constrained by the range for d_26.

So when we are in idx_find_step before we replace *idx with iv->base
we could check range-information on whether it wrapped.  Hmm, I think
we can't really compute this.  But we can transfer range information
(temporarily) from d_26 to iv->base i_6(D) and make use of that in
scev_probably_wraps_p.  There we currently compute whether
(unsigned) i_6(D) + 2147483648 (??) > 9 using fold_binary but with
range information [0, 10] it would compute as false (huh, so what is it
actually testing?!).  I think the computation of 'delta' should instead
be adjusted to use range information - max for negative step and min
for positive step.  Like the following:

Index: gcc/tree-ssa-loop-niter.c
===================================================================
--- gcc/tree-ssa-loop-niter.c   (revision 220038)
+++ gcc/tree-ssa-loop-niter.c   (working copy)
@@ -3863,12 +3863,17 @@ scev_probably_wraps_p (tree base, tree s
      bound of the type, and verify that the loop is exited before this
      occurs.  */
   unsigned_type = unsigned_type_for (type);
-  base = fold_convert (unsigned_type, base);
-
   if (tree_int_cst_sign_bit (step))
     {
       tree extreme = fold_convert (unsigned_type,
                                   lower_bound_in_type (type, type));
+      wide_int min, max;
+      if (TREE_CODE (base) == SSA_NAME
+         && INTEGRAL_TYPE_P (TREE_TYPE (base))
+         && get_range_info (base, &min, &max) == VR_RANGE)
+       base = wide_int_to_tree (unsigned_type, max);
+      else
+       base = fold_convert (unsigned_type, base);
       delta = fold_build2 (MINUS_EXPR, unsigned_type, base, extreme);
       step_abs = fold_build1 (NEGATE_EXPR, unsigned_type,
                              fold_convert (unsigned_type, step));
@@ -3877,6 +3882,13 @@ scev_probably_wraps_p (tree base, tree s
     {
       tree extreme = fold_convert (unsigned_type,
                                   upper_bound_in_type (type, type));
+      wide_int min, max;
+      if (TREE_CODE (base) == SSA_NAME
+         && INTEGRAL_TYPE_P (TREE_TYPE (base))
+         && get_range_info (base, &min, &max) == VR_RANGE)
+       base = wide_int_to_tree (unsigned_type, min);
+      else
+       base = fold_convert (unsigned_type, base);
       delta = fold_build2 (MINUS_EXPR, unsigned_type, extreme, base);
       step_abs = fold_convert (unsigned_type, step);
     }

doesn't really help this case unless i_6(D) gets range-information transfered
temporarily as I said above, of course.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2015-01-26 10:30 ` rguenth at gcc dot gnu.org
@ 2015-01-26 11:10 ` rguenth at gcc dot gnu.org
  2015-01-26 13:48 ` ramana at gcc dot gnu.org
                   ` (20 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-26 11:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's probably not correct to simply transfer range info from *idx to iv->base.
Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
the SSA use-def chain.  That's of course a much bigger change :/

The patch may still help in some cases - I suppose the original testcase is
reduced from sth else?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2015-01-26 11:10 ` rguenth at gcc dot gnu.org
@ 2015-01-26 13:48 ` ramana at gcc dot gnu.org
  2015-01-26 14:19 ` amker at gcc dot gnu.org
                   ` (19 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: ramana at gcc dot gnu.org @ 2015-01-26 13:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #19 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #18)
> It's probably not correct to simply transfer range info from *idx to
> iv->base.
> Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
> the SSA use-def chain.  That's of course a much bigger change :/
> 
> The patch may still help in some cases - I suppose the original testcase is
> reduced from sth else?

Not sure if this is related to comment #c2 where the reference is to a 15%
regression in bzip2 compress at O3. Sebastian could probably confirm.

I don't think we can ignore the regression though, can we ?

regards
Ramana


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2015-01-26 13:48 ` ramana at gcc dot gnu.org
@ 2015-01-26 14:19 ` amker at gcc dot gnu.org
  2015-01-26 14:51 ` rguenther at suse dot de
                   ` (18 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-01-26 14:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #20 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #18)
> It's probably not correct to simply transfer range info from *idx to
> iv->base.
> Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
> the SSA use-def chain.  That's of course a much bigger change :/
> 
> The patch may still help in some cases - I suppose the original testcase is
> reduced from sth else?

I see it's a tricky problem, and I have to admit that I don't understand it
very well yet.  The question is, is relax of POINTER_PLUS_EXPR constraint the
right way fixing this?  I do remember some other PRs (other than this one or
bug 52563) are caused by this constraint according to your comments.

Of course, take range information into consideration is always an improvement. 
Actually I have another possible example in iv elimination.  Curently GCC
simply rejects elimination of an iv use with a candidate if the cand is of
smaller type, but as long as we can prove value of iv use can be represented by
the smaller candidate, elimination is actually safe.  But seems to me, fixing
this issue with value information is like a side-effect?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2015-01-26 14:19 ` amker at gcc dot gnu.org
@ 2015-01-26 14:51 ` rguenther at suse dot de
  2015-01-26 14:53 ` rguenther at suse dot de
                   ` (17 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenther at suse dot de @ 2015-01-26 14:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #21 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 26 Jan 2015, ramana at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #19 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #18)
> > It's probably not correct to simply transfer range info from *idx to
> > iv->base.
> > Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
> > the SSA use-def chain.  That's of course a much bigger change :/
> > 
> > The patch may still help in some cases - I suppose the original testcase is
> > reduced from sth else?
> 
> Not sure if this is related to comment #c2 where the reference is to a 15%
> regression in bzip2 compress at O3. Sebastian could probably confirm.
> 
> I don't think we can ignore the regression though, can we ?

We should try not to.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2015-01-26 14:51 ` rguenther at suse dot de
@ 2015-01-26 14:53 ` rguenther at suse dot de
  2015-01-26 15:03 ` amker at gcc dot gnu.org
                   ` (16 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenther at suse dot de @ 2015-01-26 14:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #22 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 26 Jan 2015, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #20 from amker at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #18)
> > It's probably not correct to simply transfer range info from *idx to
> > iv->base.
> > Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
> > the SSA use-def chain.  That's of course a much bigger change :/
> > 
> > The patch may still help in some cases - I suppose the original testcase is
> > reduced from sth else?
> 
> I see it's a tricky problem, and I have to admit that I don't understand it
> very well yet.  The question is, is relax of POINTER_PLUS_EXPR constraint the
> right way fixing this?  I do remember some other PRs (other than this one or
> bug 52563) are caused by this constraint according to your comments.

Well, it's not sure that relaxing POINTER_PLUS_EXPR will help in the 
end...

> Of course, take range information into consideration is always an improvement. 
> Actually I have another possible example in iv elimination.  Curently GCC
> simply rejects elimination of an iv use with a candidate if the cand is of
> smaller type, but as long as we can prove value of iv use can be represented by
> the smaller candidate, elimination is actually safe.  But seems to me, fixing
> this issue with value information is like a side-effect?

Might be - but it's a matter of fixing the observable regression ;)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2015-01-26 14:53 ` rguenther at suse dot de
@ 2015-01-26 15:03 ` amker at gcc dot gnu.org
  2015-01-26 15:38 ` jiwang at gcc dot gnu.org
                   ` (15 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-01-26 15:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #23 from amker at gcc dot gnu.org ---
Now I am less convinced that it's a tree ivopt issue.  Tree optimizer has no
knowledge about stack frame information for local array variables.  With the
original test, on 32-bits targets, tree ivopts happens to choose the IV of base
address like below:

  <bb 16>:
  # ivtmp.11_82 = PHI <ivtmp.11_84(15), ivtmp.11_83(17)>
  _87 = (void *) ivtmp.11_82;
  _13 = MEM[base: _87, offset: 0B];
  ivtmp.11_83 = ivtmp.11_82 - 1;
  _14 = (int) _13;
  foo (_14);
  if (ivtmp.11_83 != _88)
    goto <bb 17>;
  else
    goto <bb 3>;

  <bb 17>:
  goto <bb 16>;

IMHO, we shouldn't depend on this.  It's much clearer to consider the exactly
multiple-use example in previous comment, but this time in loop context:

void bar(int i) {
  char A[10];
  char B[10];
  char C[10];
  int d = 0;
  while (i > 0)
    A[d++] = i--;

  while (d > 0)
  {
    foo(A[d]);
    foo(B[d]);
    foo(C[d]);
    d--;
  }
}

Tree ivopt has no knowledge that references of A/B/C in loop share common
sub-expression.  Most likely (as suggested by previous comments), GCC will
generates below IR:

  <bb 15>:
  _93 = (sizetype) i_6(D);
  _94 = &A + _93;
  ivtmp.11_92 = (unsigned int) _94;
  _98 = (sizetype) i_6(D);
  _99 = _98 + 1;
  _100 = &B + _99;
  ivtmp.15_97 = (unsigned int) _100;
  _104 = (sizetype) i_6(D);
  _105 = _104 + 1;
  _106 = &C + _105;
  ivtmp.20_103 = (unsigned int) _106;
  _110 = (unsigned int) &A;

  <bb 16>:
  # ivtmp.11_90 = PHI <ivtmp.11_92(15), ivtmp.11_91(17)>
  # ivtmp.15_95 = PHI <ivtmp.15_97(15), ivtmp.15_96(17)>
  # ivtmp.20_101 = PHI <ivtmp.20_103(15), ivtmp.20_102(17)>
  _107 = (void *) ivtmp.11_90;
  _13 = MEM[base: _107, offset: 0B];
  ivtmp.11_91 = ivtmp.11_90 - 1;
  _14 = (int) _13;
  foo (_14);
  ivtmp.15_96 = ivtmp.15_95 - 1;
  _108 = (void *) ivtmp.15_96;
  _16 = MEM[base: _108, offset: 0B];
  _17 = (int) _16;
  foo (_17);
  ivtmp.20_102 = ivtmp.20_101 - 1;
  _109 = (void *) ivtmp.20_102;
  _19 = MEM[base: _109, offset: 0B];
  _20 = (int) _19;
  foo (_20);
  if (ivtmp.11_91 != _110)
    goto <bb 17>;
  else
    goto <bb 3>;

  <bb 17>:
  goto <bb 16>;

It not good at least on targets without auto-increment addressing modes.  Even
on ARM/AARCH64 with auto-increment addressing modes, it's not always practical
because of bloated loop-setup, or register pressure issue caused by duplicated
inducation variables.

In this case, we should associate "virtual_frame + offset1" and hoist it out of
loop, while in loop, we should choose only one inducation variable (the biv),
then refer to A/B/C using offset addressing mode like below:

  <bb 15>
    base_1 = virtual_frame + off1

  <bb 16>:
  # d_26 = PHI <init, d_13>
    base_2 = base_1 + d_26
    d_13 = d_26 - 1;
    foo (MEM[base_2, off_A])
    foo (MEM[base_2, off_B])
    foo (MEM[base_2, off_C])
    goto <bb 16>


So maybe this is a RTL issue, we firstly should do reassociation like
"virtual_frame + reg + offset" to "(virtual_frame + offset) + reg".  As for
missed CSE opportunities in address expressions, maybe it can be fixed by an
additional strength reduction pass on RTL, just like gimple-strength-reduction
pass.  The rtl pass is necessary simply because the gimple one has no knowledge
of stack frame information, just like tree IVOPT.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2015-01-26 15:03 ` amker at gcc dot gnu.org
@ 2015-01-26 15:38 ` jiwang at gcc dot gnu.org
  2015-01-27  3:21 ` amker at gcc dot gnu.org
                   ` (14 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2015-01-26 15:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #24 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to amker from comment #23)

partially agree.

at least for the single use case given by Seb, I think tree ivopt should do it.
(I verified clang do ivopt correctly for the case)

for the rtl re-associate, it's a little bit painful from my experiment
experiences. as it's not always good to reassociate virtual_frame + offset, we
can only benefit if it's in loop, because the re-associate will increase
register pressure, there will be situations that more callee-saved regs used,
and finally we run into unncessary push/pop in pro/epilogue... and I haven't
found a good place where we can safely re-use existed rtl info and do the rtl
re-association as I am afraid rebuild those rtl info will cause compile time
penalty.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2015-01-26 15:38 ` jiwang at gcc dot gnu.org
@ 2015-01-27  3:21 ` amker at gcc dot gnu.org
  2015-01-27  7:56 ` amker at gcc dot gnu.org
                   ` (13 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-01-27  3:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #25 from amker at gcc dot gnu.org ---
(In reply to Jiong Wang from comment #24)
> (In reply to amker from comment #23)
> 
> partially agree.
> 
> at least for the single use case given by Seb, I think tree ivopt should do
> it. (I verified clang do ivopt correctly for the case)
> 
> for the rtl re-associate, it's a little bit painful from my experiment
> experiences. as it's not always good to reassociate virtual_frame + offset,
> we can only benefit if it's in loop, because the re-associate will increase
> register pressure, there will be situations that more callee-saved regs
> used, and finally we run into unncessary push/pop in pro/epilogue... and I
> haven't found a good place where we can safely re-use existed rtl info and
> do the rtl re-association as I am afraid rebuild those rtl info will cause
> compile time penalty.

Yes, The ivopt's issue that it doesn't treat "A[d]" as an address type iv use
should be fixed anyway.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2015-01-27  3:21 ` amker at gcc dot gnu.org
@ 2015-01-27  7:56 ` amker at gcc dot gnu.org
  2015-01-27  9:11 ` rguenther at suse dot de
                   ` (12 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-01-27  7:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #27 from amker at gcc dot gnu.org ---
(In reply to Jiong Wang from comment #24)
> (In reply to amker from comment #23)
> 
> partially agree.
> 
> at least for the single use case given by Seb, I think tree ivopt should do
> it. (I verified clang do ivopt correctly for the case)

LLVM generates correct code, but I am not sure it's because of ivopt.  The dump
after ivopt for llvm is like below:

; Function Attrs: nounwind
define void @bar(i32 %d) #0 {
entry:
  %A = alloca [10 x i8], align 1
  %cmp2 = icmp sgt i32 %d, 0
  br i1 %cmp2, label %while.body.lr.ph, label %while.end

while.body.lr.ph:                                 ; preds = %entry
  %0 = sext i32 %d to i64
  br label %while.body

while.body:                                       ; preds = %while.body.lr.ph,
%while.body
  %indvars.iv = phi i64 [ %0, %while.body.lr.ph ], [ %indvars.iv.next,
%while.body ]
  %indvars.iv.next = add nsw i64 %indvars.iv, -1
  %scevgep = getelementptr i8* %A4, i64 %indvars.iv
  %1 = load i8* %scevgep, align 1, !tbaa !1
  tail call void @foo(i8 %1) #2
  %2 = add i64 %indvars.iv.next, 1
  %cmp = icmp sgt i64 %2, 1
  br i1 %cmp, label %while.body, label %while.end.loopexit

The induction variable chosen is the original biv (d) actually, just like GCC.

So even if we fix the idx_find_step issue, GCC's ivopt still can generate below
codes:

Loop-preheader
  ...
Loop-body:
  iv = phi<d, -1>
  tmp = (POINTER_TYPE)&A;
  foo(MEM[base:tmp, index:iv]);

Without proper RTL optimization, very likely the issue in calculation of base
address of A still exists.

> 
> for the rtl re-associate, it's a little bit painful from my experiment
> experiences. as it's not always good to reassociate virtual_frame + offset,
> we can only benefit if it's in loop, because the re-associate will increase
> register pressure, there will be situations that more callee-saved regs
> used, and finally we run into unncessary push/pop in pro/epilogue... and I
> haven't found a good place where we can safely re-use existed rtl info and
> do the rtl re-association as I am afraid rebuild those rtl info will cause
> compile time penalty.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2015-01-27  7:56 ` amker at gcc dot gnu.org
@ 2015-01-27  9:11 ` rguenther at suse dot de
  2015-01-28 18:26 ` LpSolit at netscape dot net
                   ` (11 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenther at suse dot de @ 2015-01-27  9:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #28 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 27 Jan 2015, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #26 from amker at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #17)
> > I really wonder why IVOPTs calls convert_affine_scev with
> > !use_overflow_semantics.
> I don't understand below code in convert_affine_scev:
> 
>   enforce_overflow_semantics = (use_overflow_semantics
>                 && nowrap_type_p (type));
> According to comments, 
> 
>    "USE_OVERFLOW_SEMANTICS is true if this function should assume that
>    the rules for overflow of the given language apply (e.g., that signed
>    arithmetics in C does not overflow) -- i.e., to use them to avoid
> unnecessary
>    tests, but also to enforce that the result follows them."
> 
> Seems to me we need to enforce overflow check for result if we take 
> advantage of USE_OVERFLOW_SEMANTICS to prove there is no overflow for 
> src.  So shouldn't we set enforce_overflow_semantics according to 
> "nowrap_type_p (TREE_TYPE (*base))", rather than the result type.

Yes, I also wondered about this...

> Also it is noted at the end of function, that we can't use the fact 
> "signed variables do not overflow" when we are checking for result.
>
> But the function is used widespread in scev, there shouldn't be anything so
> wrong.

Heh - I wouldn't count on that.

> > Note that for the original testcase 'i' may be negative or zero and thus 'd'
> > may be zero.  We do a bad analysis here because IVOPTs follows complete
> > peeling immediately...  but at least we have range information that looks
> > useful:
> 
> The case also holds for O2, at this level gcc won't completely unroll 
> the first loop.
> 
> An irrelevant question.  Isn't cunroll too aggressive in GCC?  For cases 
> like this one, the code size is bloated and may hurt Icache performance, 
> while only saving several increment instruction.

Yeah - it was Honza enabling this aggressive peeling.  It makes sense
for a limited amount of code growth (like peeling two iterations) but
indeed using the same limit as for unrolling (where we know intermediate
exits are not taken) doesn't make too much sense...  I wonder if
the size estimates are correctly handling that fact...


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2015-01-27  9:11 ` rguenther at suse dot de
@ 2015-01-28 18:26 ` LpSolit at netscape dot net
  2015-01-29  6:48 ` amker at gcc dot gnu.org
                   ` (10 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: LpSolit at netscape dot net @ 2015-01-28 18:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #29 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to amker from comment #20)
> (In reply to Richard Biener from comment #18)
> > It's probably not correct to simply transfer range info from *idx to
> > iv->base.
> > Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
> > the SSA use-def chain.  That's of course a much bigger change :/
> > 
> > The patch may still help in some cases - I suppose the original testcase is
> > reduced from sth else?
> 
> Of course, take range information into consideration is always an
> improvement.

The RANGE info is a good idea, Bin, it's worth a quick exploration.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2015-01-28 18:26 ` LpSolit at netscape dot net
@ 2015-01-29  6:48 ` amker at gcc dot gnu.org
  2015-01-30  6:42 ` amker at gcc dot gnu.org
                   ` (9 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-01-29  6:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #30 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #17)
> I really wonder why IVOPTs calls convert_affine_scev with
> !use_overflow_semantics.
> 
> Note that for the original testcase 'i' may be negative or zero and thus 'd'
> may be zero.  We do a bad analysis here because IVOPTs follows complete
> peeling immediately...  but at least we have range information that looks
> useful:
> 
>   <bb 16>:
>   # RANGE [0, 10] NONZERO 15
>   # d_26 = PHI <i_6(D)(15), d_13(17)>
>   # RANGE [0, 9] NONZERO 15
>   d_13 = d_26 + -1;
>   _14 = A[d_26];
>   # RANGE [0, 255] NONZERO 255
>   _15 = (int) _14;
>   # USE = nonlocal
>   # CLB = nonlocal
>   foo (_15);
>   if (d_13 != 0)
>     goto <bb 17>;
>   else
>     goto <bb 3>;
> 
>   <bb 17>:
>   goto <bb 16>;
> 
> but unfortunately we expand the initial value of the IV for d all the way
> to i_6(D) so we don't see that i_6(D) is constrained by the range for d_26.
> 
> So when we are in idx_find_step before we replace *idx with iv->base
> we could check range-information on whether it wrapped.  Hmm, I think
> we can't really compute this.  But we can transfer range information
> (temporarily) from d_26 to iv->base i_6(D) and make use of that in
> scev_probably_wraps_p.  There we currently compute whether
> (unsigned) i_6(D) + 2147483648 (??) > 9 using fold_binary but with
> range information [0, 10] it would compute as false (huh, so what is it
> actually testing?!).  I think the computation of 'delta' should instead
> be adjusted to use range information - max for negative step and min
> for positive step.  Like the following:
> 
> Index: gcc/tree-ssa-loop-niter.c
> ===================================================================
> --- gcc/tree-ssa-loop-niter.c   (revision 220038)
> +++ gcc/tree-ssa-loop-niter.c   (working copy)
> @@ -3863,12 +3863,17 @@ scev_probably_wraps_p (tree base, tree s
>       bound of the type, and verify that the loop is exited before this
>       occurs.  */
>    unsigned_type = unsigned_type_for (type);
> -  base = fold_convert (unsigned_type, base);
> -
>    if (tree_int_cst_sign_bit (step))
>      {
>        tree extreme = fold_convert (unsigned_type,
>                                    lower_bound_in_type (type, type));
> +      wide_int min, max;
> +      if (TREE_CODE (base) == SSA_NAME
> +         && INTEGRAL_TYPE_P (TREE_TYPE (base))
> +         && get_range_info (base, &min, &max) == VR_RANGE)
> +       base = wide_int_to_tree (unsigned_type, max);
> +      else
> +       base = fold_convert (unsigned_type, base);
>        delta = fold_build2 (MINUS_EXPR, unsigned_type, base, extreme);
>        step_abs = fold_build1 (NEGATE_EXPR, unsigned_type,
>                               fold_convert (unsigned_type, step));
> @@ -3877,6 +3882,13 @@ scev_probably_wraps_p (tree base, tree s
>      {
>        tree extreme = fold_convert (unsigned_type,
>                                    upper_bound_in_type (type, type));
> +      wide_int min, max;
> +      if (TREE_CODE (base) == SSA_NAME
> +         && INTEGRAL_TYPE_P (TREE_TYPE (base))
> +         && get_range_info (base, &min, &max) == VR_RANGE)
> +       base = wide_int_to_tree (unsigned_type, min);
> +      else
> +       base = fold_convert (unsigned_type, base);
>        delta = fold_build2 (MINUS_EXPR, unsigned_type, extreme, base);
>        step_abs = fold_convert (unsigned_type, step);
>      }
> 
> doesn't really help this case unless i_6(D) gets range-information transfered
> temporarily as I said above, of course.

As you said, range information for i_6(D) actually is flow-sensitive
information, we can't simply propagate range_info(d) to i_6(D) generally.  Even
if we can, the code change is not natural since the related functions are
scattered across different functions (ivopt/chrec/niter).
So I take the other way around by passing the IV's ssa_name into
scev_probably_wraps_p along call sequence
"idx_find_step->convert_affine_scev->scev_probably_wraps".  Since the IV's
ssa_name is tagged with right range information, we can use it when proving
there is no overflow/wrap in src scev.

This mothed works and GCC now recognizes the address iv use in A[d].

BUT, the problem not only exists in address type iv's code, but also in compare
type iv's.  The IV dump file is now as below:


  <bb 8>:
  _4 = (sizetype) i_6(D);
  _3 = &A + _4;
  ivtmp.11_17 = (unsigned long) _3;
  _1 = (sizetype) i_6(D);
  _2 = (unsigned int) i_6(D);
  _22 = _2 + 4294967295;
  _21 = (sizetype) _22;
  _20 = _1 - _21;
  _29 = _20 + 18446744073709551615;
  _30 = &A + _29;
  _31 = (unsigned long) _30;

  <bb 9>:
  # ivtmp.11_18 = PHI <ivtmp.11_17(8), ivtmp.11_8(11)>
  _5 = (void *) ivtmp.11_18;
  _14 = MEM[base: _5, offset: 0B];
  foo (_14);
  ivtmp.11_8 = ivtmp.11_18 - 1;
  if (ivtmp.11_8 != _31)
    goto <bb 11>;
  else
    goto <bb 10>;

  <bb 10>:
  goto <bb 3>;

  <bb 11>:
  goto <bb 9>;

The loop preheader is bloated by computing "cand_value_at (loop, cand,
use->stmt, desc->niter, &bnd);" in function "may_eliminate_iv" and we have:

(gdb) call debug_generic_expr(cand->iv->base)
(unsigned long) ((char *) &A + (sizetype) i_6(D))
(gdb) call debug_generic_expr(cand->iv->step)
18446744073709551615
(gdb) call debug_generic_expr(desc->niter)
(unsigned int) (i_6(D) + -1)

GCC is not aware of RANGE_INFO(i_6(D)): [1:10], it doesn't know that below
condition holds:
  (unsigned long)(unsigned int)(i_6(D) + -1)  == (unsigned long)(i_6(D) + -1)


The problem is niter is computed in unsigned version type of control iv, which
is unsigned int because control biv (d) is of type int.  But the candidate is
of larger type "unsigned long" on 64 bits target.

I really think LLVM's front-end makes better decision by promoting the index of
array_ref A[d] to 64 bits signed type, while GCC keeps it in 32 bits.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2015-01-29  6:48 ` amker at gcc dot gnu.org
@ 2015-01-30  6:42 ` amker at gcc dot gnu.org
  2015-01-30 12:32 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-01-30  6:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #31 from amker at gcc dot gnu.org ---
So cand_value_at (loop, cand, use->stmt, desc->niter, &bnd) with arguments as
below:

  cand->iv->base:
      (unsigned long) ((char *) &A + (sizetype) i_6(D))
  cand->iv->step:
      0xFFFFFFFFFFFFFFFF
  desc->niter:
      (unsigned int)(i_6(D) + -1)
  use->stmt:
      is after increment

The result calculated should like below:
  iv->base + iv->step * (unsigned long)niter + step
    <=>
  (unsigned long) ((char *) &A + (sizetype) i_6(D))
    +
  0xFFFFFFFFFFFFFFFF * ((unsigned long)(unsigned int)(i_6(D) + -1))
    +
  0xFFFFFFFFFFFFFFFF

Even with range information [1, 10] for i_6(D), and we can prove that niter's
range is [0, 9].  We can't prove it equals to:
  (unsigned long)((char *) &A)

This is because we use unsigned type for step and lose the sign information?
And this can't be fixed even with the proper range information.  Is this
understanding correct?  Or anything I should do to achieve that?

Thanks very much.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2015-01-30  6:42 ` amker at gcc dot gnu.org
@ 2015-01-30 12:32 ` rguenth at gcc dot gnu.org
  2015-02-05  7:27 ` amker at gcc dot gnu.org
                   ` (7 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-30 12:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #32 from Richard Biener <rguenth at gcc dot gnu.org> ---
"So I take the other way around by passing the IV's ssa_name into
scev_probably_wraps_p along call sequence
"idx_find_step->convert_affine_scev->scev_probably_wraps".  Since the IV's
ssa_name is tagged with right range information, we can use it when proving
there is no overflow/wrap in src scev."

I'm not sure that this is always correct - is the name we ask
scev_probably_wraps
on always equal to the IV?  Is it never offsetted?  I think that we need a
soultion that involves tracking of range information all the way through
SCEV analysis and thus have it on the CHREC itself.

As for the other things - yes, there are multiple places where we lose sign
information and STEP is just one example (I _think_ I made DR_STEP for
data-ref analysis forced signed at some point).

Promoting the array index to a word_mode type sounds interesting, it of course
only makes sense for previously signed types - otherwise you just get another
conversion in the way.  Of course with GCC you will likely run into the issue
that this promotion is cancelled by the forced conversion to sizetype, so
you'll end up with unsigned word_mode again.  Btw - this was kind of my
suggestion with making get_inner_reference return *POFFSET as ssizetype type,
not sizetype (callers might not expect this, so for experimenting just
wrap get_inner_reference in sth converting that back to sizetype for all
callers but IVOPTs and tree-affine.c (and eventually tree-scalar-evolution.c)).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2015-01-30 12:32 ` rguenth at gcc dot gnu.org
@ 2015-02-05  7:27 ` amker at gcc dot gnu.org
  2015-03-11 17:30 ` [Bug tree-optimization/62173] [5 Regression] " jiwang at gcc dot gnu.org
                   ` (6 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-02-05  7:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #33 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #32)
> "So I take the other way around by passing the IV's ssa_name into
> scev_probably_wraps_p along call sequence
> "idx_find_step->convert_affine_scev->scev_probably_wraps".  Since the IV's
> ssa_name is tagged with right range information, we can use it when proving
> there is no overflow/wrap in src scev."
> 
> I'm not sure that this is always correct - is the name we ask
> scev_probably_wraps
> on always equal to the IV?  Is it never offsetted?  I think that we need a
> soultion that involves tracking of range information all the way through
> SCEV analysis and thus have it on the CHREC itself.

Yes this is wanted, but I think it's another improvement.  In IVOPT, we don't
need to inherit range information from CHREC from SCEV.  The structure iv has
field ssa_name pointing to the ssa var.  We can use this var's range
information in IVOPT.  Well, the only problem is that field is reset in
record_use.

Moreover, for this PR, it isn't range information of IV's var that we want. 
What we want is the range information of IV's base var.  As in below dump:

  <bb 16>:
  # RANGE [0, 10] NONZERO 15
  # d_26 = PHI <i_6(D)(15), d_13(17)>
  # RANGE [0, 9] NONZERO 15
  d_13 = d_26 + -1;
  _14 = A[d_26];
  # RANGE [0, 255] NONZERO 255
  _15 = (int) _14;
  # USE = nonlocal
  # CLB = nonlocal
  foo (_15);
  if (d_13 != 0)
    goto <bb 17>;
  else
    goto <bb 3>;

  <bb 17>:
  goto <bb 16>;

We really want to take advantage of i_6(D)'s range information, which isn't an
IV.  And it's possible the range information of i_6(D) may not hold in other
part of code other than this loop.
I had patches for last two issues, will send for review later.

> 
> 
> As for the other things - yes, there are multiple places where we lose sign
> information and STEP is just one example (I _think_ I made DR_STEP for
> data-ref analysis forced signed at some point).
> 
> Promoting the array index to a word_mode type sounds interesting, it of
> course
> only makes sense for previously signed types - otherwise you just get another
> conversion in the way.  Of course with GCC you will likely run into the issue
> that this promotion is cancelled by the forced conversion to sizetype, so
> you'll end up with unsigned word_mode again.  Btw - this was kind of my
> suggestion with making get_inner_reference return *POFFSET as ssizetype type,
> not sizetype (callers might not expect this, so for experimenting just
> wrap get_inner_reference in sth converting that back to sizetype for all
> callers but IVOPTs and tree-affine.c (and eventually
> tree-scalar-evolution.c)).

I now don't think this PR is caused by the different type precision issue. 
However it's related to type conversion and how SCEV handles it in GCC.  I need
more study to fully understand the issue, so shouldn't saying too much in case
it's wrong...


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5 Regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2015-02-05  7:27 ` amker at gcc dot gnu.org
@ 2015-03-11 17:30 ` jiwang at gcc dot gnu.org
  2015-03-11 17:46 ` jiwang at gcc dot gnu.org
                   ` (5 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2015-03-11 17:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #35 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #34)
> Any progress on this?  This is a P1 PR, but no comments have been added for
> more than a month...

from what I known:

  Bin was working on some tree level fix while I was working on RTL level fix.
And I was told this has been deffered to next stage 1.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5 Regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2015-03-11 17:30 ` [Bug tree-optimization/62173] [5 Regression] " jiwang at gcc dot gnu.org
@ 2015-03-11 17:46 ` jiwang at gcc dot gnu.org
  2015-03-11 17:52 ` [Bug tree-optimization/62173] [5/6 " jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2015-03-11 17:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #36 from Jiong Wang <jiwang at gcc dot gnu.org> ---
and for rtl level improvement, need to enable DF_DU_CHAIN build on top of
existing DF_UD_CHAIN (may cause extra compile time resource consumption).

one draft patch is here, no feedback yet.

  https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00736.html

as both tree and rtl fix contain generic code modifications, I think it's
better to defer it to next stage-1 given the issue itself is enhancement not
bug.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5/6 Regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2015-03-11 17:46 ` jiwang at gcc dot gnu.org
@ 2015-03-11 17:52 ` jakub at gcc dot gnu.org
  2015-03-13  8:34 ` amker at gcc dot gnu.org
                   ` (3 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-03-11 17:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|5.0                         |6.0
            Summary|[5 Regression] 64bit Arch   |[5/6 Regression] 64bit Arch
                   |can't ivopt while 32bit     |can't ivopt while 32bit
                   |Arch can                    |Arch can

--- Comment #37 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Ok.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5/6 Regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2015-03-11 17:52 ` [Bug tree-optimization/62173] [5/6 " jakub at gcc dot gnu.org
@ 2015-03-13  8:34 ` amker at gcc dot gnu.org
  2015-06-02  3:34 ` amker at gcc dot gnu.org
                   ` (2 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-03-13  8:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #38 from amker at gcc dot gnu.org ---
The first patch actually helping this case is at
https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00836.html

But the real problem lies in scev/ivopt about how type conversion and scev
overflow are handled.  I am working on this.

According to Richard, both the patch and the real fix are stage1 stuff.  That's
why it has been quite recently.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5/6 Regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2015-03-13  8:34 ` amker at gcc dot gnu.org
@ 2015-06-02  3:34 ` amker at gcc dot gnu.org
  2015-06-03  3:56 ` amker at gcc dot gnu.org
  2015-07-22 11:44 ` jiwang at gcc dot gnu.org
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-06-02  3:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #39 from amker at gcc dot gnu.org ---
Author: amker
Date: Tue Jun  2 03:33:35 2015
New Revision: 224009

URL: https://gcc.gnu.org/viewcvs?rev=224009&root=gcc&view=rev
Log:

        PR tree-optimization/52563
        PR tree-optimization/62173
        * tree-ssa-loop-ivopts.c (struct iv): New field.  Reorder fields.
        (alloc_iv, set_iv): New parameter.
        (determine_biv_step): Delete.
        (find_bivs): Inline original determine_biv_step.  Pass new
        argument to set_iv.
        (idx_find_step): Use no_overflow information for conversion.
        * tree-scalar-evolution.c (analyze_scalar_evolution_in_loop): Let
        resolve_mixers handle folded_casts.
        (instantiate_scev_name): Change bool parameter to bool pointer.
        (instantiate_scev_poly, instantiate_scev_binary): Ditto.
        (instantiate_array_ref, instantiate_scev_not): Ditto.
        (instantiate_scev_3, instantiate_scev_2): Ditto.
        (instantiate_scev_1, instantiate_scev_r): Ditto.
        (instantiate_scev_convert, ): Change parameter.  Pass argument
        to chrec_convert_aggressive.
        (instantiate_scev): Change argument.
        (resolve_mixers): New parameter and set it.
        (scev_const_prop): New argument.
        * tree-scalar-evolution.h (resolve_mixers): New parameter.
        * tree-chrec.c (convert_affine_scev): Call chrec_convert instead
        of chrec_conert_1.
        (chrec_convert): New parameter.  Move definition below.
        (chrec_convert_aggressive): New parameter and set it.  Call
        convert_affine_scev.
        * tree-chrec.h (chrec_convert): New parameter.
        (chrec_convert_aggressive): Ditto.

        gcc/testsuite/ChangeLog
        PR tree-optimization/52563
        PR tree-optimization/62173
        * gcc.dg/tree-ssa/scev-3.c: Remove xfail.
        * gcc.dg/tree-ssa/scev-4.c: Ditto.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/scev-3.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/scev-4.c
    trunk/gcc/tree-chrec.c
    trunk/gcc/tree-chrec.h
    trunk/gcc/tree-scalar-evolution.c
    trunk/gcc/tree-scalar-evolution.h
    trunk/gcc/tree-ssa-loop-ivopts.c


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5/6 Regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2015-06-02  3:34 ` amker at gcc dot gnu.org
@ 2015-06-03  3:56 ` amker at gcc dot gnu.org
  2015-07-22 11:44 ` jiwang at gcc dot gnu.org
  38 siblings, 0 replies; 40+ messages in thread
From: amker at gcc dot gnu.org @ 2015-06-03  3:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #40 from amker at gcc dot gnu.org ---
The issue array reference not recognized as IV is resolved now.  From gimple
optimizer's view, there is still another issue in which loop header is bloated
because of lose of signness information.  The root cause is we use sizetype for
addresses, and unsigned type (unsigned int in this case) for loop niter.
We need to prove (sizetype)(unsigned int)(i - 1) equals to (sizetype)(i-1) in
this case by using VRP/loop initial conditions.

I am fixing this now.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Bug tree-optimization/62173] [5/6 Regression] 64bit Arch can't ivopt while 32bit Arch can
  2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2015-06-03  3:56 ` amker at gcc dot gnu.org
@ 2015-07-22 11:44 ` jiwang at gcc dot gnu.org
  38 siblings, 0 replies; 40+ messages in thread
From: jiwang at gcc dot gnu.org @ 2015-07-22 11:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

Jiong Wang <jiwang at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #41 from Jiong Wang <jiwang at gcc dot gnu.org> ---
Bin have pushed tree level fix on this. Mark as fixed


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-07-22 11:44 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-18 16:12 [Bug target/62173] New: [AArch64] Performance regression due to r213488 spop at gcc dot gnu.org
2014-08-18 16:39 ` [Bug target/62173] " pinskia at gcc dot gnu.org
2014-08-18 19:13 ` spop at gcc dot gnu.org
2014-08-19  1:37 ` amker at gcc dot gnu.org
2014-10-28 11:28 ` [Bug target/62173] [5.0 regression] " jiwang at gcc dot gnu.org
2014-11-14  9:37 ` jiwang at gcc dot gnu.org
2014-11-17  2:14 ` amker.cheng at gmail dot com
2014-11-24 12:15 ` jiwang at gcc dot gnu.org
2014-11-24 12:38 ` jiwang at gcc dot gnu.org
2014-11-24 13:06 ` rguenth at gcc dot gnu.org
2014-11-24 23:01 ` jiwang at gcc dot gnu.org
2014-11-26 10:54 ` [Bug target/62173] [5.0 regression] [AArch64] Can't ivopt array base address while ARM can jiwang at gcc dot gnu.org
2014-11-27  9:35 ` jiwang at gcc dot gnu.org
2014-11-27 12:00 ` [Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can rguenther at suse dot de
2014-11-27 12:16 ` rguenther at suse dot de
2014-11-27 13:34 ` jiwang at gcc dot gnu.org
2015-01-23 17:33 ` jiwang at gcc dot gnu.org
2015-01-26 10:30 ` rguenth at gcc dot gnu.org
2015-01-26 11:10 ` rguenth at gcc dot gnu.org
2015-01-26 13:48 ` ramana at gcc dot gnu.org
2015-01-26 14:19 ` amker at gcc dot gnu.org
2015-01-26 14:51 ` rguenther at suse dot de
2015-01-26 14:53 ` rguenther at suse dot de
2015-01-26 15:03 ` amker at gcc dot gnu.org
2015-01-26 15:38 ` jiwang at gcc dot gnu.org
2015-01-27  3:21 ` amker at gcc dot gnu.org
2015-01-27  7:56 ` amker at gcc dot gnu.org
2015-01-27  9:11 ` rguenther at suse dot de
2015-01-28 18:26 ` LpSolit at netscape dot net
2015-01-29  6:48 ` amker at gcc dot gnu.org
2015-01-30  6:42 ` amker at gcc dot gnu.org
2015-01-30 12:32 ` rguenth at gcc dot gnu.org
2015-02-05  7:27 ` amker at gcc dot gnu.org
2015-03-11 17:30 ` [Bug tree-optimization/62173] [5 Regression] " jiwang at gcc dot gnu.org
2015-03-11 17:46 ` jiwang at gcc dot gnu.org
2015-03-11 17:52 ` [Bug tree-optimization/62173] [5/6 " jakub at gcc dot gnu.org
2015-03-13  8:34 ` amker at gcc dot gnu.org
2015-06-02  3:34 ` amker at gcc dot gnu.org
2015-06-03  3:56 ` amker at gcc dot gnu.org
2015-07-22 11:44 ` jiwang at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).