[Bug tree-optimization/60172] New: ARM performance regression from trunk@207239

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239
@ 2014-02-13  9:54 joey.ye at arm dot com
  2014-02-14  8:20 ` [Bug tree-optimization/60172] " joey.ye at arm dot com
                   ` (24 more replies)
  0 siblings, 25 replies; 26+ messages in thread
From: joey.ye at arm dot com @ 2014-02-13  9:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

            Bug ID: 60172
           Summary: ARM performance regression from trunk@207239
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: joey.ye at arm dot com

Dhrystone on Cortex-M4 drops by 1.5% with this patch:

    2014-01-29  Richard Biener  <rguenther@suse.de>

        PR tree-optimization/58742
        * tree-ssa-forwprop.c (associate_pointerplus): Rename to
        associate_pointerplus_align.
        (associate_pointerplus_diff): New function.
        (associate_pointerplus): Likewise.  Call associate_pointerplus_align
        and associate_pointerplus_diff.

        * gcc.dg/pr58742-1.c: New testcase.
        * gcc.dg/pr58742-2.c: Likewise.
        * gcc.dg/pr58742-3.c: Likewise.


    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@207239

Options used: -O2 -fno-inline -fno-common


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
@ 2014-02-14  8:20 ` joey.ye at arm dot com
  2014-02-14 10:22 ` rguenth at gcc dot gnu.org
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: joey.ye at arm dot com @ 2014-02-14  8:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #2 from Joey Ye <joey.ye at arm dot com> ---
Created attachment 32131
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32131&action=edit
The function that causes the regression

Attached Proc_8 from dhrystone, header file and good/bad.s

It is the only function that generated code diffs with/without the commit.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
  2014-02-14  8:20 ` [Bug tree-optimization/60172] " joey.ye at arm dot com
@ 2014-02-14 10:22 ` rguenth at gcc dot gnu.org
  2014-02-14 10:50 ` joey.ye at arm dot com
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-02-14 10:22 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can't really interpret the asm differences but it seems we need more
registers?

Forwprop applies the association transform (those that fold-const.c already
does when presented with large enough GENERIC trees) - it transforms
(p +p off1) +p off2 to (p +p (off1 + off2)), that is, associates the
pointer that is offsetted first and computes the offset using unsigned
integer arithmetic.  That enables the reassociation pass to process
the offset expression and simplifying it (that pass cannot handle a
pointer addition chain).

This happens in forwprop4 only - thus does -fdisable-tree-forwprop4 fix the
regression?

I really can't see a fundamental difference (but the associated adds) in
the resulting code.  So I wonder what RTL transform does / does not trigger
with one of the variants.

On x86_64 the code difference with -O2 [-fno-tree-forwprop4] is

@@ -11,22 +11,25 @@
        .cfi_startproc
        leal    5(%rdx), %r8d
        movslq  %edx, %rdx
+       salq    $2, %rdx
        movslq  %r8d, %rax
        leaq    0(,%rax,4), %r9
-       addq    %r9, %rax
        leaq    (%rdi,%r9), %r10
-       leaq    (%rax,%rax,4), %rax
+       addq    %r9, %rax
        movl    %ecx, (%r10)
        movl    %ecx, 4(%rdi,%r9)
-       leaq    (%rsi,%rax,4), %rax
+       leaq    (%rax,%rax,4), %rcx
        movl    %r8d, 60(%rdi,%r9)
-       leaq    (%rax,%rdx,4), %rax
+       salq    $2, %rcx
+       leaq    (%rdx,%rcx), %rax
+       addq    %rsi, %rax
        addl    $1, 16(%rax)
        movl    %r8d, 20(%rax)
        movl    %r8d, 24(%rax)
-       movl    (%r10), %edx
+       movl    (%r10), %edi
+       leaq    1000(%rsi,%rcx), %rax
        movl    $5, Int_Glob(%rip)
-       movl    %edx, 1020(%rax)
+       movl    %edi, 20(%rdx,%rax)
        ret
        .cfi_endproc

If we look at immediate uses before RTL expansion relevant changes
(single-use -> non-single-use change or vice-versa - enables combine/fwprop)
are

-_32 : --> single use.
+_32 : -->2 uses.
+_16 = _41 + _32;
 _33 = Arr_2_Par_Ref_22(D) + _32;

which happens when associating

   _32 = pretmp_20 + 1000;
   _33 = Arr_2_Par_Ref_22(D) + _32;
   _34 = *_8;
-  _51 = _33 + _41;
+  _16 = _41 + _32;
+  _51 = Arr_2_Par_Ref_22(D) + _16;
   MEM[(int[25] *)_51 + 20B] = _34;

but _33 is dead after the transform.

+_33 : --> no uses

so that's a spurious difference.  Stmts with no uses are not expanded,
but it seems to change what TER does.  Hmm.

-_32 replace with --> _32 = pretmp_20 + 1000;
-

Killing dead stmts with

Index: gcc/tree-outof-ssa.c
===================================================================
--- gcc/tree-outof-ssa.c        (revision 207757)
+++ gcc/tree-outof-ssa.c        (working copy)
@@ -876,6 +876,21 @@ eliminate_useless_phis (void)
            }
        }
     }
+
+  for (unsigned i = 1; i < num_ssa_names; ++i)
+    {
+      tree name = ssa_name (i);
+      if (!name || !has_zero_uses (name) || virtual_operand_p (name))
+       continue;
+      gimple def_stmt = SSA_NAME_DEF_STMT (name);
+      if (!is_gimple_assign (def_stmt)
+         || gimple_has_side_effects (def_stmt)
+         || stmt_could_throw_p (def_stmt))
+       continue;
+      gimple_stmt_iterator gsi = gsi_for_stmt (def_stmt);
+      gsi_remove (&gsi, true);
+      release_defs (def_stmt);
+    }
 }


fixes that (hack alert).  With that we get strictly more TER.  Does
-fno-tree-ter also make the testcase regress, even with
-fdisable-tree-forwprop4?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
  2014-02-14  8:20 ` [Bug tree-optimization/60172] " joey.ye at arm dot com
  2014-02-14 10:22 ` rguenth at gcc dot gnu.org
@ 2014-02-14 10:50 ` joey.ye at arm dot com
  2014-02-14 12:19 ` rguenth at gcc dot gnu.org
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: joey.ye at arm dot com @ 2014-02-14 10:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #4 from Joey Ye <joey.ye at arm dot com> ---
-fdisable-tree-forwprop4 doesn't help. -fno-tree-ter makes it even worse.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (2 preceding siblings ...)
  2014-02-14 10:50 ` joey.ye at arm dot com
@ 2014-02-14 12:19 ` rguenth at gcc dot gnu.org
  2014-02-14 14:03 ` rguenth at gcc dot gnu.org
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-02-14 12:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Joey Ye from comment #4)
> -fdisable-tree-forwprop4 doesn't help. -fno-tree-ter makes it even worse.

The former is strange because it's the only pass that does sth that is
changed by the patch?  As said, make sure to include the fix for PR59993
in your testing.

Does -fno-tree-forwprop fix the regression?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (3 preceding siblings ...)
  2014-02-14 12:19 ` rguenth at gcc dot gnu.org
@ 2014-02-14 14:03 ` rguenth at gcc dot gnu.org
  2014-02-17  9:56 ` joey.ye at arm dot com
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-02-14 14:03 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that we can probably avoid regressing TER by removing the dead stmt
in forwprop itself (which would be appropriate at this stage).

But as that doesn't help this still needs more analysis.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (4 preceding siblings ...)
  2014-02-14 14:03 ` rguenth at gcc dot gnu.org
@ 2014-02-17  9:56 ` joey.ye at arm dot com
  2014-02-17 10:07 ` rguenther at suse dot de
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: joey.ye at arm dot com @ 2014-02-17  9:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #7 from Joey Ye <joey.ye at arm dot com> ---
(In reply to Richard Biener from comment #5)
> (In reply to Joey Ye from comment #4)
> > -fdisable-tree-forwprop4 doesn't help. -fno-tree-ter makes it even worse.
> 
> The former is strange because it's the only pass that does sth that is
> changed by the patch?  As said, make sure to include the fix for PR59993
> in your testing.
> 
> Does -fno-tree-forwprop fix the regression?

I'm sorry what I meant was: -fdisable-tree-forwprop4 didn't make benchmark
faster. Actually with -fdisable-tree-forwprop4 both revision before/after
207239 get the same lower score.

207239 O2: low
207238 O2: high
207239 O2 -fdisable-tree-forwprop4: low
207238 O2 -fdisable-tree-forwprop4: low


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (5 preceding siblings ...)
  2014-02-17  9:56 ` joey.ye at arm dot com
@ 2014-02-17 10:07 ` rguenther at suse dot de
  2014-02-19 11:19 ` joey.ye at arm dot com
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2014-02-17 10:07 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #9 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 17 Feb 2014, joey.ye at arm dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> --- Comment #8 from Joey Ye <joey.ye at arm dot com> ---
> Here is tree dump and diff of 133t.forwprop4
>   <bb 2>:
>   Int_Index_4 = Int_1_Par_Val_3(D) + 5;
>   Int_Loc.0_5 = (unsigned int) Int_Index_4;
>   _6 = Int_Loc.0_5 * 4;
>   _8 = Arr_1_Par_Ref_7(D) + _6;
>   *_8 = Int_2_Par_Val_10(D);
>   _13 = _6 + 4;
>   _14 = Arr_1_Par_Ref_7(D) + _13;
>   *_14 = Int_2_Par_Val_10(D);
>   _17 = _6 + 60;
>   _18 = Arr_1_Par_Ref_7(D) + _17;
>   *_18 = Int_Index_4;
>   pretmp_20 = Int_Loc.0_5 * 100;
>   pretmp_2 = Arr_2_Par_Ref_22(D) + pretmp_20;
>   _42 = (sizetype) Int_1_Par_Val_3(D);
>   _41 = _42 * 4;
> -  _40 = pretmp_2 + _41; // good
> +  _12 = _41 + pretmp_20; // bad
> +  _40 = Arr_2_Par_Ref_22(D) + _12;  // bad
>   MEM[(int[25] *)_40 + 20B] = Int_Index_4;
>   MEM[(int[25] *)_40 + 24B] = Int_Index_4;
>   _29 = MEM[(int[25] *)_40 + 16B];
>   _30 = _29 + 1;
>   MEM[(int[25] *)_40 + 16B] = _30;
>   _32 = pretmp_20 + 1000;
>   _33 = Arr_2_Par_Ref_22(D) + _32;
>   _34 = *_8;
> -  _51 = _33 + _41;  // good
> +  _16 = _41 + _32;  // bad
> +  _51 = Arr_2_Par_Ref_22(D) + _16;  // bad
> 
>   MEM[(int[25] *)_51 + 20B] = _34;
>   Int_Glob = 5;
>   return;

But that doesn't make sense - it means that -fdisable-tree-forwprop4
should get numbers back to good speed, no?  Because that's the
only change forwprop4 does.

For completeness please base checks on r207316 (it contains a fix
for the blamed revision, but as far as I can see it shouldn't make
a difference for the testcase).

Did you check whether my hackish patch fixes things?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (6 preceding siblings ...)
  2014-02-17 10:07 ` rguenther at suse dot de
@ 2014-02-19 11:19 ` joey.ye at arm dot com
  2014-02-19 11:21 ` joey.ye at arm dot com
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: joey.ye at arm dot com @ 2014-02-19 11:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #10 from Joey Ye <joey.ye at arm dot com> ---
(In reply to rguenther@suse.de from comment #9)
> On Mon, 17 Feb 2014, joey.ye at arm dot com wrote:
> 
> 
> But that doesn't make sense - it means that -fdisable-tree-forwprop4
> should get numbers back to good speed, no?  Because that's the
> only change forwprop4 does.
-fdisable-tree-forwprop4 dooms other transformation and results slightly worse
code than before. So the number isn't back to the best. I think forwprop4 does
some good stuff here and disabling it isn't the solution.
> 
> For completeness please base checks on r207316 (it contains a fix
> for the blamed revision, but as far as I can see it shouldn't make
> a difference for the testcase).
I'm playing with r207686 and it is the same for this case.
> 
> Did you check whether my hackish patch fixes things?
I did with trunk 20140208. But it didn't make any difference to Proc_8


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (7 preceding siblings ...)
  2014-02-19 11:19 ` joey.ye at arm dot com
@ 2014-02-19 11:21 ` joey.ye at arm dot com
  2014-02-19 23:06 ` steven at gcc dot gnu.org
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: joey.ye at arm dot com @ 2014-02-19 11:21 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #11 from Joey Ye <joey.ye at arm dot com> ---
Repost from another record. It is annoying that after commenting one record it
automatically jumps to the next.

Here is good expansion:
;; _41 = _42 * 4;

(insn 20 19 0 (set (reg:SI 126 [ D.5038 ])
        (ashift:SI (reg/v:SI 131 [ Int_1_Par_Val ])
            (const_int 2 [0x2]))) -1
     (nil))

;; _40 = _2 + _41;

(insn 21 20 22 (set (reg:SI 136 [ D.5035 ])
        (plus:SI (reg/v/f:SI 130 [ Arr_2_Par_Ref ])
            (reg:SI 119 [ D.5036 ]))) -1
     (nil))

(insn 22 21 0 (set (reg/f:SI 125 [ D.5035 ])
        (plus:SI (reg:SI 136 [ D.5035 ])
            (reg:SI 126 [ D.5038 ]))) -1
     (nil))


;; MEM[(int[25] *)_51 + 20B] = _34;

(insn 29 28 30 (set (reg:SI 139)
        (plus:SI (reg/v/f:SI 130 [ Arr_2_Par_Ref ])
            (reg:SI 119 [ D.5036 ]))) Proc_8.c:23 -1
     (nil))

(insn 30 29 31 (set (reg:SI 140)
        (plus:SI (reg:SI 139)
            (reg:SI 126 [ D.5038 ]))) Proc_8.c:23 -1
     (nil))

(insn 31 30 32 (set (reg/f:SI 141)
        (plus:SI (reg:SI 140)
            (const_int 1000 [0x3e8]))) Proc_8.c:23 -1
     (nil))

(insn 32 31 0 (set (mem:SI (plus:SI (reg/f:SI 141)
                (const_int 20 [0x14])) [2 MEM[(int[25] *)_51 + 20B]+0 S4 A32])
        (reg:SI 124 [ D.5039 ])) Proc_8.c:23 -1
     (nil))

After cse1 140 can be replaced by 125, thus lead a series of transformation
make it much more efficient.

Here is bad expansion:
;; _40 = Arr_2_Par_Ref_22(D) + _12;

(insn 22 21 23 (set (reg:SI 138 [ D.5038 ])
        (plus:SI (reg:SI 128 [ D.5038 ])
            (reg:SI 121 [ D.5036 ]))) -1
     (nil))

(insn 23 22 0 (set (reg/f:SI 127 [ D.5035 ])
        (plus:SI (reg/v/f:SI 132 [ Arr_2_Par_Ref ])
            (reg:SI 138 [ D.5038 ]))) -1
     (nil))

;; _32 = _20 + 1000;

(insn 29 28 0 (set (reg:SI 124 [ D.5038 ])
        (plus:SI (reg:SI 121 [ D.5036 ])
            (const_int 1000 [0x3e8]))) Proc_8.c:23 -1
     (nil))

;; MEM[(int[25] *)_51 + 20B] = _34;

(insn 32 31 33 (set (reg:SI 141)
        (plus:SI (reg/v/f:SI 132 [ Arr_2_Par_Ref ])
            (reg:SI 124 [ D.5038 ]))) Proc_8.c:23 -1
     (nil))

(insn 33 32 34 (set (reg/f:SI 142)
        (plus:SI (reg:SI 141)
            (reg:SI 128 [ D.5038 ]))) Proc_8.c:23 -1
     (nil))

(insn 34 33 0 (set (mem:SI (plus:SI (reg/f:SI 142)
                (const_int 20 [0x14])) [2 MEM[(int[25] *)_51 + 20B]+0 S4 A32])
        (reg:SI 126 [ D.5039 ])) Proc_8.c:23 -1
     (nil))

Here cse doesn't happen, resulting in less optimal insns. Reason why cse
doesn't happen is unclear yet.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (8 preceding siblings ...)
  2014-02-19 11:21 ` joey.ye at arm dot com
@ 2014-02-19 23:06 ` steven at gcc dot gnu.org
  2014-02-20 10:02 ` rguenther at suse dot de
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: steven at gcc dot gnu.org @ 2014-02-19 23:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |steven at gcc dot gnu.org

--- Comment #12 from Steven Bosscher <steven at gcc dot gnu.org> ---
(In reply to Joey Ye from comment #11)

Sometimes it helps to use -fdump-rtl-slim. Matter of taste but I find
that much easier to interpret than LISP-like RTL dumps.

Annotated "good expansion":
;; _41 = _42 * 4;
20: r126=r131<<2

;; _40 = _2 + _41;
21: r136=r130+r119  // r136=Arr_2_Par_Ref+r119
22: r125=r136+r126  // r125=Arr_2_Par_Ref+r119+r131<<2

;; MEM[(int[25] *)_51 + 20B] = _34;
29: r139=r130+r119  // r139=Arr_2_Par_Ref+r119
30: r140=r139+r126  // r140=Arr_2_Par_Ref+r119+r131<<2 (==r125)
31: r141=r140+1000  // r141=Arr_2_Par_Ref+r119+r131<<2+1000 (==r125+1000)
32: [r141+20]=r124

In this case, the RHS for the SETs of r140 and r125 are lexically
identical for value numbering, so the job for CSE is easy.


Annotated "bad expansion":
;; _40 = Arr_2_Par_Ref_22(D) + _12;
22: r138=r128+r121        
23: r127=r132+r138  // r127=Arr_2_Par_Ref+r128+r121

;; _32 = _20 + 1000;
29: r124=r121+1000

;; MEM[(int[25] *)_51 + 20B] = _34;
32: r141=r132+r124  // r141=Arr_2_Par_Ref+r121+1000
33: r142=r141+r128  // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)
34: [r142+20]=r126

Here, the "+1000" confuses CSE. The sets of r127 and r142 have a common
sub-expression as value, but none of the sub-expressions are lexically 
identical.  RTL CSE has limited ability to look through sub-expressions
to identify "same value" sub-expressions (anchors, base regs, etc.) but
apparently this case is too complex for it to handle.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (9 preceding siblings ...)
  2014-02-19 23:06 ` steven at gcc dot gnu.org
@ 2014-02-20 10:02 ` rguenther at suse dot de
  2014-04-14  7:58 ` [Bug tree-optimization/60172] [4.9/4.10 Regression] " rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2014-02-20 10:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 19 Feb 2014, steven at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> Steven Bosscher <steven at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |steven at gcc dot gnu.org
> 
> --- Comment #12 from Steven Bosscher <steven at gcc dot gnu.org> ---
> (In reply to Joey Ye from comment #11)
> 
> Sometimes it helps to use -fdump-rtl-slim. Matter of taste but I find
> that much easier to interpret than LISP-like RTL dumps.
> 
> Annotated "good expansion":
> ;; _41 = _42 * 4;
> 20: r126=r131<<2
> 
> ;; _40 = _2 + _41;
> 21: r136=r130+r119  // r136=Arr_2_Par_Ref+r119
> 22: r125=r136+r126  // r125=Arr_2_Par_Ref+r119+r131<<2
> 
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 29: r139=r130+r119  // r139=Arr_2_Par_Ref+r119
> 30: r140=r139+r126  // r140=Arr_2_Par_Ref+r119+r131<<2 (==r125)
> 31: r141=r140+1000  // r141=Arr_2_Par_Ref+r119+r131<<2+1000 (==r125+1000)
> 32: [r141+20]=r124
> 
> In this case, the RHS for the SETs of r140 and r125 are lexically
> identical for value numbering, so the job for CSE is easy.
> 
> 
> Annotated "bad expansion":
> ;; _40 = Arr_2_Par_Ref_22(D) + _12;
> 22: r138=r128+r121        
> 23: r127=r132+r138  // r127=Arr_2_Par_Ref+r128+r121
> 
> ;; _32 = _20 + 1000;
> 29: r124=r121+1000
> 
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 32: r141=r132+r124  // r141=Arr_2_Par_Ref+r121+1000
> 33: r142=r141+r128  // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)

(==r138+1000)

> 34: [r142+20]=r126
> 
> Here, the "+1000" confuses CSE. The sets of r127 and r142 have a common
> sub-expression as value, but none of the sub-expressions are lexically 
> identical.  RTL CSE has limited ability to look through sub-expressions
> to identify "same value" sub-expressions (anchors, base regs, etc.) but
> apparently this case is too complex for it to handle.

So expansion generates "better" code (a single insn covering the
two adds), caused by expanding a chain of two regular PLUS_EXPR
rather than a chain of two POINTER_PLUS_EXPRs.

That's of course unfortunate - but I can't see how this should
be not a missed optimization in CSE ...

On the GIMPLE level before expansion we have

 +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);

 _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));

thus a similar issue - missed CSE due to bad association (and to
not having a CSE pass after forwprop exposed the opportunity).

Unfortunately we expose the opportunity by late complete unrolling
only because early unrolling says

size: 7-2, last_iteration: 3-0
  Loop size: 7
  Estimated size after unrolling: 8
Not unrolling loop 1: size would grow.

and you can't make it unroll that loop (outer loops are only ever
unrolled early if doing so doesn't increase code-size).

Now the order is, late unroll - reassoc - DOM - forwprop,
exactly the wrong way around to eventuall catch the CSE opportunity
at the GIMPLE level as it would need to be, late unroll - forwprop - 
reassoc - DOM.

Richard.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (10 preceding siblings ...)
  2014-02-20 10:02 ` rguenther at suse dot de
@ 2014-04-14  7:58 ` rguenth at gcc dot gnu.org
  2014-05-09  8:51 ` thomas.preudhomme at arm dot com
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-04-14  7:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |4.9.1
            Summary|[4.9 regression] ARM        |[4.9/4.10 Regression] ARM
                   |performance regression from |performance regression from
                   |trunk@207239                |trunk@207239


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (11 preceding siblings ...)
  2014-04-14  7:58 ` [Bug tree-optimization/60172] [4.9/4.10 Regression] " rguenth at gcc dot gnu.org
@ 2014-05-09  8:51 ` thomas.preudhomme at arm dot com
  2014-05-15  3:29 ` thomas.preudhomme at arm dot com
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: thomas.preudhomme at arm dot com @ 2014-05-09  8:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Thomas Preud'homme <thomas.preudhomme at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |thomas.preudhomme at arm dot com

--- Comment #14 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
(In reply to Steven Bosscher from comment #12)
> Annotated "bad expansion":
> ;; _40 = Arr_2_Par_Ref_22(D) + _12;
> 22: r138=r128+r121		
> 23: r127=r132+r138  // r127=Arr_2_Par_Ref+r128+r121
> 
> ;; _32 = _20 + 1000;
> 29: r124=r121+1000
> 
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 32: r141=r132+r124  // r141=Arr_2_Par_Ref+r121+1000
> 33: r142=r141+r128  // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)
> 34: [r142+20]=r126

So in gimple the two offsets are added first and then added to the pointer
while after expansion the first offset is added to the pointer and then the
second offset. Is it normal that the order of operations seems to change?
>From gcc-bugs-return-451083-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Fri May 09 08:56:14 2014
Return-Path: <gcc-bugs-return-451083-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 23459 invoked by alias); 9 May 2014 08:56:13 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 23403 invoked by uid 55); 9 May 2014 08:56:08 -0000
From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
Date: Fri, 09 May 2014 08:56:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 4.9.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenther at suse dot de
X-Bugzilla-Status: WAITING
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 4.9.1
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-60172-4-pyVsHEBjiy@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-60172-4@http.gcc.gnu.org/bugzilla/>
References: <bug-60172-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-05/txt/msg00775.txt.bz2
Content-length: 1386

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 9 May 2014, thomas.preudhomme at arm dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> Thomas Preud'homme <thomas.preudhomme at arm dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |thomas.preudhomme at arm dot com
> 
> --- Comment #14 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
> (In reply to Steven Bosscher from comment #12)
> > Annotated "bad expansion":
> > ;; _40 = Arr_2_Par_Ref_22(D) + _12;
> > 22: r138=r128+r121		
> > 23: r127=r132+r138  // r127=Arr_2_Par_Ref+r128+r121
> > 
> > ;; _32 = _20 + 1000;
> > 29: r124=r121+1000
> > 
> > ;; MEM[(int[25] *)_51 + 20B] = _34;
> > 32: r141=r132+r124  // r141=Arr_2_Par_Ref+r121+1000
> > 33: r142=r141+r128  // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)
> > 34: [r142+20]=r126
> 
> So in gimple the two offsets are added first and then added to the pointer
> while after expansion the first offset is added to the pointer and then the
> second offset. Is it normal that the order of operations seems to change?

Yes, that's TER at work
>From gcc-bugs-return-451084-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Fri May 09 09:32:34 2014
Return-Path: <gcc-bugs-return-451084-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 15064 invoked by alias); 9 May 2014 09:32:33 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 15036 invoked by uid 48); 9 May 2014 09:32:28 -0000
From: "john.s.kallal at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug other/61124] New: GCC manual has 68HC11/68HC12 info
Date: Fri, 09 May 2014 09:32:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: other
X-Bugzilla-Version: 4.9.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: minor
X-Bugzilla-Who: john.s.kallal at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter
Message-ID: <bug-61124-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-05/txt/msg00776.txt.bz2
Content-length: 619

http://gcc.gnu.org/bugzilla/show_bug.cgi?ida124

            Bug ID: 61124
           Summary: GCC manual has 68HC11/68HC12 info
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: john.s.kallal at gmail dot com

Bug description:
  In the GCC version 4.8.3 manual pages 379, and 389 (PDF file version) talks
about the 68HC11/68HC12 micro-controllers.
  This support for the 68HC11/68HC12 micro-controllers was declared obsolete in
GCC v4.6.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (12 preceding siblings ...)
  2014-05-09  8:51 ` thomas.preudhomme at arm dot com
@ 2014-05-15  3:29 ` thomas.preudhomme at arm dot com
  2014-05-15  8:01 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: thomas.preudhomme at arm dot com @ 2014-05-15  3:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #16 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
Hi Richard,

could you expand on what you said in comment #13? I don't see how reassoc could
help cse here. From what I understood, reassoc tries to group per rank. In our
case, we have (view of the source with loop unrolling):

Arr_2_Par_Ref [Int_Loc] [Int_Loc] = Int_Loc;
/* some stmts */
Arr_2_Par_Ref [Int_Loc+10] [Int_Loc] = Arr_1_Par_Ref [Int_Loc];

If I'm not mistaken, in the first case you'd have:

Int_Loc * 4
Int_Loc * 100
Arr_2_Par_Ref

that would be added together with several statements. However in the second
case you'd have:

Int_Loc * 4
Int_Loc * 100
1000
Arr_2_Par_Ref

that would be added together with several statements. I don't see how could
1000 being added first or last, it seems to me that it's always going to be in
an intermediate statement and thus not all redanduncy would be eliminated by
CSE.

Please let me know if my reasonning is flawed so that I can progress toward a
solution.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (13 preceding siblings ...)
  2014-05-15  3:29 ` thomas.preudhomme at arm dot com
@ 2014-05-15  8:01 ` rguenth at gcc dot gnu.org
  2014-05-15  8:54 ` thomas.preudhomme at arm dot com
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-05-15  8:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Thomas Preud'homme from comment #16)
> Hi Richard,
> 
> could you expand on what you said in comment #13? I don't see how reassoc
> could help cse here. From what I understood, reassoc tries to group per
> rank. In our case, we have (view of the source with loop unrolling):
> 
> Arr_2_Par_Ref [Int_Loc] [Int_Loc] = Int_Loc;
> /* some stmts */
> Arr_2_Par_Ref [Int_Loc+10] [Int_Loc] = Arr_1_Par_Ref [Int_Loc];
> 
> If I'm not mistaken, in the first case you'd have:
> 
> Int_Loc * 4
> Int_Loc * 100
> Arr_2_Par_Ref
> 
> that would be added together with several statements. However in the second
> case you'd have:
> 
> Int_Loc * 4
> Int_Loc * 100
> 1000
> Arr_2_Par_Ref
> 
> that would be added together with several statements. I don't see how could
> 1000 being added first or last, it seems to me that it's always going to be
> in an intermediate statement and thus not all redanduncy would be eliminated
> by CSE.
> 
> Please let me know if my reasonning is flawed so that I can progress toward
> a solution.

Citing myself:

On the GIMPLE level before expansion we have

 +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);

 _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));

so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);

then _41 + pretmp_20 would be fully redundant with the expression needed
by _40.

Note that IIRC one issue with TER is that it is no longer happening as
there are dead stmts around that confuse its has_single_use logic.  Thus
placing a dce pass right before expand would fix that and might be a good
idea anyway (see comment #3).  Implementing a "proper" poor-mans SSA-based
DCE would be a good way out (out-of-SSA already has one to remove dead
PHIs).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (14 preceding siblings ...)
  2014-05-15  8:01 ` rguenth at gcc dot gnu.org
@ 2014-05-15  8:54 ` thomas.preudhomme at arm dot com
  2014-05-15  9:51 ` thomas.preudhomme at arm dot com
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: thomas.preudhomme at arm dot com @ 2014-05-15  8:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #18 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
(In reply to Richard Biener from comment #17)
> 
> Citing myself:
> 
> On the GIMPLE level before expansion we have
> 
>  +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
> 
>  _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
> 
> so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);
> 
> then _41 + pretmp_20 would be fully redundant with the expression needed
> by _40.

Yes I saw that but I was wondering why would reassoc try this association
rather than another since the header of the file doesn't mention any special
treatment of explicit integer constants.

Besides, wouldn't it still misses that fact that _51 = _40 + 1000?

> 
> Note that IIRC one issue with TER is that it is no longer happening as
> there are dead stmts around that confuse its has_single_use logic.  Thus
> placing a dce pass right before expand would fix that and might be a good
> idea anyway (see comment #3).  Implementing a "proper" poor-mans SSA-based
> DCE would be a good way out (out-of-SSA already has one to remove dead
> PHIs).

Ok


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (15 preceding siblings ...)
  2014-05-15  8:54 ` thomas.preudhomme at arm dot com
@ 2014-05-15  9:51 ` thomas.preudhomme at arm dot com
  2014-05-15 10:12 ` rguenther at suse dot de
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: thomas.preudhomme at arm dot com @ 2014-05-15  9:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #20 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
(In reply to rguenther@suse.de from comment #19)
> On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> > 
> > --- Comment #18 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
> > (In reply to Richard Biener from comment #17)
> > > 
> > > Citing myself:
> > > 
> > > On the GIMPLE level before expansion we have
> > > 
> > >  +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
> > > 
> > >  _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
> > > 
> > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);
> > > 
> > > then _41 + pretmp_20 would be fully redundant with the expression needed
> > > by _40.
> > 
> > Yes I saw that but I was wondering why would reassoc try this association
> > rather than another since the header of the file doesn't mention any special
> > treatment of explicit integer constants.
> > 
> > Besides, wouldn't it still misses that fact that _51 = _40 + 1000?
> 
> Yes.  But reassoc doesn't associate across POINTER_PLUS_EXPRs.

Is there a reason for that?

> 
> RTL CSE could catch it, but for it the association would have to
> be the same for both.  If we start from the proposed form
> then at RTL expansion time we could associate
> pointer + (X + CST) to (pointer + X) + CST.

Right.

> 
> Feels all somewhat hacky, of course (and relies on TER).  There
> may be cases where doing the opposite is better (for example
> if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)).  Association
> to make CSE possible is always hard if CSE itself cannot associate
> to maximize the number of CSE opportunities.  So at the moment
> any choice is just canonicalization.

Exactly my thought. I'm not sure if that's what you have in mind when you write
association for CSE but I was thinking about a scheme that ressemble what
tree_to_aff_combination_expand does and organize all expanded expression to
compare them easily (read efficiently). With such a capability it would then
not be necessary to do the first replacement with forprop+reassoc+dom as
everything could be done in CSE.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (16 preceding siblings ...)
  2014-05-15  9:51 ` thomas.preudhomme at arm dot com
@ 2014-05-15 10:12 ` rguenther at suse dot de
  2014-06-18 14:21 ` bpringlemeir at gmail dot com
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2014-05-15 10:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #21 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> --- Comment #20 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
> (In reply to rguenther@suse.de from comment #19)
> > On Thu, 15 May 2014, thomas.preudhomme at arm dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> > > 
> > > --- Comment #18 from Thomas Preud'homme <thomas.preudhomme at arm dot com> ---
> > > (In reply to Richard Biener from comment #17)
> > > > 
> > > > Citing myself:
> > > > 
> > > > On the GIMPLE level before expansion we have
> > > > 
> > > >  +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
> > > > 
> > > >  _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
> > > > 
> > > > so if _51 were Arr_2_Par_Ref_22(D) + ((_41 + pretmp_20) + 1000);
> > > > 
> > > > then _41 + pretmp_20 would be fully redundant with the expression needed
> > > > by _40.
> > > 
> > > Yes I saw that but I was wondering why would reassoc try this association
> > > rather than another since the header of the file doesn't mention any special
> > > treatment of explicit integer constants.
> > > 
> > > Besides, wouldn't it still misses that fact that _51 = _40 + 1000?
> > 
> > Yes.  But reassoc doesn't associate across POINTER_PLUS_EXPRs.
> 
> Is there a reason for that?

Yes.  It's not easy and it involves undefined overflow (reassoc
doesn't associate signed arithmetic either)

> > 
> > RTL CSE could catch it, but for it the association would have to
> > be the same for both.  If we start from the proposed form
> > then at RTL expansion time we could associate
> > pointer + (X + CST) to (pointer + X) + CST.
> 
> Right.
> 
> > 
> > Feels all somewhat hacky, of course (and relies on TER).  There
> > may be cases where doing the opposite is better (for example
> > if you have ptr1 + (X + 1000) and ptr2 + (X + 1000)).  Association
> > to make CSE possible is always hard if CSE itself cannot associate
> > to maximize the number of CSE opportunities.  So at the moment
> > any choice is just canonicalization.
> 
> Exactly my thought. I'm not sure if that's what you have in mind when you write
> association for CSE but I was thinking about a scheme that ressemble what
> tree_to_aff_combination_expand does and organize all expanded expression to
> compare them easily (read efficiently). With such a capability it would then
> not be necessary to do the first replacement with forprop+reassoc+dom as
> everything could be done in CSE.

Yeah, but that's not how CSE on GIMPLE or RTL works right now ;)
(patches welcome?)  I suppose teaching reassoc to look for CSE
opportunities may be easier (needs separating analysis and transform
stages for the whole function).

Richard.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (17 preceding siblings ...)
  2014-05-15 10:12 ` rguenther at suse dot de
@ 2014-06-18 14:21 ` bpringlemeir at gmail dot com
  2014-06-18 15:15 ` bpringlemeir at gmail dot com
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: bpringlemeir at gmail dot com @ 2014-06-18 14:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Bill Pringlemeir <bpringlemeir at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bpringlemeir at gmail dot com

--- Comment #22 from Bill Pringlemeir <bpringlemeir at gmail dot com> ---
The good ARM assembler uses the 'mla' instruction which is a 'multiply and
accumulate'.  Since this is not recognized, the multiply result needs a
temporary register to do the add with and I think this cause the extra
registers.  I believe you should look to see why the 'mla' is not matched.  I
don't know if the x86 has an op-code like this.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (18 preceding siblings ...)
  2014-06-18 14:21 ` bpringlemeir at gmail dot com
@ 2014-06-18 15:15 ` bpringlemeir at gmail dot com
  2014-07-16 13:28 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: bpringlemeir at gmail dot com @ 2014-06-18 15:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #23 from Bill Pringlemeir <bpringlemeir at gmail dot com> ---
(In reply to Bill Pringlemeir from comment #22)
> The good ARM assembler uses the 'mla' instruction which is a 'multiply and
> accumulate'.  Since this is not recognized, the multiply result needs a
> temporary register to do the add with and I think this cause the extra
> registers.  I believe you should look to see why the 'mla' is not matched. 
> I don't know if the x86 has an op-code like this.

Er, I see.  The 'mla' comes as a result of seeing that the array index
calculations can be reused.  Sorry for the noise.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/4.10 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (19 preceding siblings ...)
  2014-06-18 15:15 ` bpringlemeir at gmail dot com
@ 2014-07-16 13:28 ` jakub at gcc dot gnu.org
  2014-10-30 10:40 ` [Bug tree-optimization/60172] [4.9/5 " jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-07-16 13:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.9.1                       |4.9.2

--- Comment #24 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.9.1 has been released.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/5 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (20 preceding siblings ...)
  2014-07-16 13:28 ` jakub at gcc dot gnu.org
@ 2014-10-30 10:40 ` jakub at gcc dot gnu.org
  2015-03-13 14:55 ` joey.ye at arm dot com
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-10-30 10:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.9.2                       |4.9.3

--- Comment #25 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.9.2 has been released.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/5 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (21 preceding siblings ...)
  2014-10-30 10:40 ` [Bug tree-optimization/60172] [4.9/5 " jakub at gcc dot gnu.org
@ 2015-03-13 14:55 ` joey.ye at arm dot com
  2015-06-26 19:59 ` [Bug tree-optimization/60172] [4.9/5/6 " jakub at gcc dot gnu.org
  2015-06-26 20:30 ` jakub at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: joey.ye at arm dot com @ 2015-03-13 14:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #26 from Joey Ye <joey.ye at arm dot com> ---
Regression disappeared from 4.9 branch since Aug 2014, though the problem
discussed here is not yet confirmed solved.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/5/6 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (22 preceding siblings ...)
  2015-03-13 14:55 ` joey.ye at arm dot com
@ 2015-06-26 19:59 ` jakub at gcc dot gnu.org
  2015-06-26 20:30 ` jakub at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 19:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

--- Comment #27 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.9.3 has been released.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/60172] [4.9/5/6 Regression] ARM performance regression from trunk@207239
  2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
                   ` (23 preceding siblings ...)
  2015-06-26 19:59 ` [Bug tree-optimization/60172] [4.9/5/6 " jakub at gcc dot gnu.org
@ 2015-06-26 20:30 ` jakub at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 20:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.9.3                       |4.9.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-06-26 20:30 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-13  9:54 [Bug tree-optimization/60172] New: ARM performance regression from trunk@207239 joey.ye at arm dot com
2014-02-14  8:20 ` [Bug tree-optimization/60172] " joey.ye at arm dot com
2014-02-14 10:22 ` rguenth at gcc dot gnu.org
2014-02-14 10:50 ` joey.ye at arm dot com
2014-02-14 12:19 ` rguenth at gcc dot gnu.org
2014-02-14 14:03 ` rguenth at gcc dot gnu.org
2014-02-17  9:56 ` joey.ye at arm dot com
2014-02-17 10:07 ` rguenther at suse dot de
2014-02-19 11:19 ` joey.ye at arm dot com
2014-02-19 11:21 ` joey.ye at arm dot com
2014-02-19 23:06 ` steven at gcc dot gnu.org
2014-02-20 10:02 ` rguenther at suse dot de
2014-04-14  7:58 ` [Bug tree-optimization/60172] [4.9/4.10 Regression] " rguenth at gcc dot gnu.org
2014-05-09  8:51 ` thomas.preudhomme at arm dot com
2014-05-15  3:29 ` thomas.preudhomme at arm dot com
2014-05-15  8:01 ` rguenth at gcc dot gnu.org
2014-05-15  8:54 ` thomas.preudhomme at arm dot com
2014-05-15  9:51 ` thomas.preudhomme at arm dot com
2014-05-15 10:12 ` rguenther at suse dot de
2014-06-18 14:21 ` bpringlemeir at gmail dot com
2014-06-18 15:15 ` bpringlemeir at gmail dot com
2014-07-16 13:28 ` jakub at gcc dot gnu.org
2014-10-30 10:40 ` [Bug tree-optimization/60172] [4.9/5 " jakub at gcc dot gnu.org
2015-03-13 14:55 ` joey.ye at arm dot com
2015-06-26 19:59 ` [Bug tree-optimization/60172] [4.9/5/6 " jakub at gcc dot gnu.org
2015-06-26 20:30 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).