public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
@ 2011-06-27 13:58 ` rguenth at gcc dot gnu.org
  2012-01-12 12:28 ` [Bug target/29256] [4.4/4.5/4.6/4.7 " rguenth at gcc dot gnu.org
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-06-27 13:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.3.6                       |4.4.7

--- Comment #40 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-06-27 12:13:46 UTC ---
4.3 branch is being closed, moving to 4.4.7 target.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.4/4.5/4.6/4.7 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
  2011-06-27 13:58 ` [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression rguenth at gcc dot gnu.org
@ 2012-01-12 12:28 ` rguenth at gcc dot gnu.org
  2012-03-13 14:23 ` [Bug target/29256] [4.5/4.6/4.7/4.8 " jakub at gcc dot gnu.org
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-01-12 12:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |powerpc-*-*
          Component|middle-end                  |target

--- Comment #41 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-01-12 12:26:59 UTC ---
Seems to be more-or-less target dependent, so adding a list of targets
affected.

For powerpc we still use N induction variables after unrolling instead of just
one.  Which I suppose was the point of this regression report.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.5/4.6/4.7/4.8 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
  2011-06-27 13:58 ` [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression rguenth at gcc dot gnu.org
  2012-01-12 12:28 ` [Bug target/29256] [4.4/4.5/4.6/4.7 " rguenth at gcc dot gnu.org
@ 2012-03-13 14:23 ` jakub at gcc dot gnu.org
  2012-07-02 12:25 ` rguenth at gcc dot gnu.org
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-13 14:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.4.7                       |4.5.4

--- Comment #42 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-13 12:46:45 UTC ---
4.4 branch is being closed, moving to 4.5.4 target.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.5/4.6/4.7/4.8 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2012-03-13 14:23 ` [Bug target/29256] [4.5/4.6/4.7/4.8 " jakub at gcc dot gnu.org
@ 2012-07-02 12:25 ` rguenth at gcc dot gnu.org
  2013-04-12 15:17 ` [Bug target/29256] [4.7/4.8/4.9 " jakub at gcc dot gnu.org
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-02 12:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.5.4                       |4.6.4

--- Comment #43 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-02 12:23:04 UTC ---
The 4.5 branch is being closed, adjusting target milestone.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.7/4.8/4.9 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2012-07-02 12:25 ` rguenth at gcc dot gnu.org
@ 2013-04-12 15:17 ` jakub at gcc dot gnu.org
  2013-12-09  4:50 ` law at redhat dot com
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2013-04-12 15:17 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.6.4                       |4.7.4

--- Comment #44 from Jakub Jelinek <jakub at gcc dot gnu.org> 2013-04-12 15:16:34 UTC ---
GCC 4.6.4 has been released and the branch has been closed.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.7/4.8/4.9 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2013-04-12 15:17 ` [Bug target/29256] [4.7/4.8/4.9 " jakub at gcc dot gnu.org
@ 2013-12-09  4:50 ` law at redhat dot com
  2014-06-12 13:45 ` [Bug target/29256] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: law at redhat dot com @ 2013-12-09  4:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jeffrey A. Law <law at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |law at redhat dot com

--- Comment #45 from Jeffrey A. Law <law at redhat dot com> ---
This problem still exists and can be seen by making the arrays external and
using -fno-tree-loop-distribute-patterns.

.L2:
        evlddx 31,10,9
        addi 7,9,8
        addi 0,9,16
        addi 11,9,24
        addi 3,9,32
        evstddx 31,8,9
        addi 4,9,40
        evlddx 31,10,7
        addi 5,9,48
        addi 6,9,56
        evlddx 12,10,6
        addi 9,9,64
        evstddx 31,8,7
        evlddx 7,10,0
        evstddx 7,8,0
        evlddx 0,10,11
        evstddx 0,8,11
        evlddx 11,10,3
        evstddx 11,8,3
        evlddx 3,10,4
        evstddx 3,8,4
        evlddx 4,10,5
        evstddx 4,8,5
        evstddx 12,8,6
        bdnz .L2
        evldd 31,8(1)
        addi 1,1,16
        blr


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.7/4.8/4.9/4.10 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2013-12-09  4:50 ` law at redhat dot com
@ 2014-06-12 13:45 ` rguenth at gcc dot gnu.org
  2014-12-19 13:28 ` [Bug target/29256] [4.8/4.9/5 " jakub at gcc dot gnu.org
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-06-12 13:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.7.4                       |4.8.4

--- Comment #46 from Richard Biener <rguenth at gcc dot gnu.org> ---
The 4.7 branch is being closed, moving target milestone to 4.8.4.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.8/4.9/5 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2014-06-12 13:45 ` [Bug target/29256] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
@ 2014-12-19 13:28 ` jakub at gcc dot gnu.org
  2015-04-07 17:36 ` law at redhat dot com
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-12-19 13:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.8.4                       |4.8.5

--- Comment #47 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.8.4 has been released.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.8/4.9/5 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2014-12-19 13:28 ` [Bug target/29256] [4.8/4.9/5 " jakub at gcc dot gnu.org
@ 2015-04-07 17:36 ` law at redhat dot com
  2015-04-08  7:20 ` rguenther at suse dot de
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: law at redhat dot com @ 2015-04-07 17:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #49 from Jeffrey A. Law <law at redhat dot com> ---
Richi, see c#45.  Basically the regression is "gone" for the testcase as-is... 
But it's pretty easy to twiddle it slightly and show the regression.  It's also
important to note this is e500 code, so you need to configure your toolchain
appropriately.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.8/4.9/5 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2015-04-07 17:36 ` law at redhat dot com
@ 2015-04-08  7:20 ` rguenther at suse dot de
  2015-05-19 17:51 ` [Bug target/29256] [4.8/4.9/5/6 " law at redhat dot com
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenther at suse dot de @ 2015-04-08  7:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #50 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 7 Apr 2015, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
> 
> --- Comment #49 from Jeffrey A. Law <law at redhat dot com> ---
> Richi, see c#45.  Basically the regression is "gone" for the testcase as-is... 
> But it's pretty easy to twiddle it slightly and show the regression.  It's also
> important to note this is e500 code, so you need to configure your toolchain
> appropriately.

Please provide a testcase that shows the regression then and instructions
how to configure a cross cc1.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2015-04-08  7:20 ` rguenther at suse dot de
@ 2015-05-19 17:51 ` law at redhat dot com
  2015-05-20  2:21 ` amker at gcc dot gnu.org
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: law at redhat dot com @ 2015-05-19 17:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #51 from Jeffrey A. Law <law at redhat dot com> ---
Configure for powerpc-linux-gnuspec target with the --eanble-e500_double
option:


/home/gcc/GIT-2/gcc/configure powerpc-linux-gnuspe --enable-e500_double

Testcase:
# define N      2000000
extern double   a[N],c[N];
void tuned_STREAM_Copy()
{
        int j;
        for (j=0; j<N; j++)
            c[j] = a[j];
}


./cc1 -O3 -funroll-loops -funroll-all-loops -fno-tree-loop-distribute-patterns
j.c -I./ -mspe

Results in:

tuned_STREAM_Copy:
        stwu 1,-16(1)
        lis 7,0x3
        lis 8,c@ha
        lis 10,a@ha
        ori 0,7,0xd090
        evstdd 31,8(1)
        li 9,0
        la 8,c@l(8)
        la 10,a@l(10)
        mtctr 0
.L2:
        evlddx 31,10,9
        addi 7,9,8
        addi 0,9,16
        addi 11,9,24
        addi 3,9,32
        evstddx 31,8,9
        addi 4,9,40
        evlddx 31,10,7
        addi 5,9,48
        addi 6,9,56
        evlddx 12,10,6
        addi 9,9,64
        evstddx 31,8,7
        evlddx 7,10,0
        evstddx 7,8,0
        evlddx 0,10,11
        evstddx 0,8,11
        evlddx 11,10,3
        evstddx 11,8,3
        evlddx 3,10,4
        evstddx 3,8,4
        evlddx 4,10,5
        evstddx 4,8,5
        evstddx 12,8,6
        bdnz .L2
        evldd 31,8(1)
        addi 1,1,16
        blr
        .size   tuned_STREAM_Copy, .-tuned_STREAM_Copy

Which looks to me like ivopts has mucked things up badly.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2015-05-19 17:51 ` [Bug target/29256] [4.8/4.9/5/6 " law at redhat dot com
@ 2015-05-20  2:21 ` amker at gcc dot gnu.org
  2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-05-20  2:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

amker at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org

--- Comment #52 from amker at gcc dot gnu.org ---
I don't understand powerpc assembly well, but this looks like the same problem
on aarch64/arm.  Ah, and we are even looking at same function...

I think this is a general issue caused by inconsistency between tree level
ivopt and rtl level loop unroller.  To be specific, how we handle unrolled
induction variable registers after unrolling.

The core loop on aarch64 with options "-O3 -funroll-all-loops -mcpu=cortex-a57"
gave below output:

.L3:
        add     x2, x0, 16
        ldr     q16, [x17, x0]
        add     x10, x0, 32
        add     x9, x0, 48
        add     x8, x0, 64
        ldr     q17, [x17, x2]
        add     x3, x0, 80
        add     x6, x0, 96
        add     x5, x0, 112
        add     w1, w1, 8
        ldr     q19, [x17, x10]
        cmp     w1, w14
        ldr     q18, [x17, x9]
        ldr     q20, [x17, x8]
        ldr     q21, [x17, x3]
        ldr     q22, [x17, x6]
        ldr     q23, [x17, x5]
        str     q16, [x18, x0]
        add     x0, x0, 128
        str     q17, [x18, x2]
        str     q19, [x18, x10]
        str     q18, [x18, x9]
        str     q20, [x18, x8]
        str     q21, [x18, x3]
        str     q22, [x18, x6]
        str     q23, [x18, x5]
        bcc     .L3 

The tree ivopt dump is quite neat:

  <bb 6>:
  # ivtmp.16_28 = PHI <ivtmp.16_25(9), 0(5)>
  # ivtmp.19_42 = PHI <ivtmp.19_41(9), 0(5)>
  vect__4.13_62 = MEM[base: vectp_a.12_58, index: ivtmp.19_42, offset: 0B];
  MEM[base: vectp_c.15_63, index: ivtmp.19_42, offset: 0B] = vect__4.13_62;
  ivtmp.16_25 = ivtmp.16_28 + 1;
  ivtmp.19_41 = ivtmp.19_42 + 16;
  if (ivtmp.16_25 < bnd.7_36)
    goto <bb 9>;
  else
    goto <bb 7>;

  ...

  <bb 9>:
  goto <bb 6>;

But after rtl unroller, we have options like "-fsplit-ivs-in-unroller" and
"-fweb".  These two options try to split the long live range of induction
vairables into seperated ones.  Evetually, with folloing fwprop and IRA, we
have multiple ivs for each original iv.  

I see two possible fixes here.  One is to implement a tree level unroller
before IVOPT and remove the rtl one.  The rtl one is some kind of too
aggressive that we didn't enable it by default with "O3".
Another is change how we handle unrolled iv in rtl unroller.  It splits
unrolled iv to avoid pseudo register with long live range since that may affect
rtl optimizers.  This assumption may hold before, but seems not true to me
nowadays, especially for induction variables.  Because on tree level ivopts, we
already made the assumption that each iv occupies a register, also ivs are
intensively used thus should live in one single hard register.  For this
specific case, we can refactor [base+index] out of memory reference and use
[new_base], [new_base+4], [new_base+8], ... etc. in unrolling.  If tree ivopts
choosses [reg+offset] addressing mode, we only need to generate instruction
sequence like "[reg+offset], [reg+(offset+4)], [reg+(offset+8)]... reg = reg +
urolled_times*step"

Thanks,
bin


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2015-05-20  2:21 ` amker at gcc dot gnu.org
@ 2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
  2015-06-23  8:22 ` rguenth at gcc dot gnu.org
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-05-20 12:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #53 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
I'm not a fan of a tree-level unroller.  It's impossible to make good decisions
about unroll factors that early.  But your second approach sounds quite
promising to me.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
@ 2015-06-23  8:22 ` rguenth at gcc dot gnu.org
  2015-06-26 19:59 ` [Bug target/29256] [4.9/5/6 " jakub at gcc dot gnu.org
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-06-23  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.8.5                       |4.9.3

--- Comment #54 from Richard Biener <rguenth at gcc dot gnu.org> ---
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2015-06-23  8:22 ` rguenth at gcc dot gnu.org
@ 2015-06-26 19:59 ` jakub at gcc dot gnu.org
  2015-06-26 20:29 ` jakub at gcc dot gnu.org
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 19:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #55 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.9.3 has been released.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2015-06-26 19:59 ` [Bug target/29256] [4.9/5/6 " jakub at gcc dot gnu.org
@ 2015-06-26 20:29 ` jakub at gcc dot gnu.org
  2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 20:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.9.3                       |4.9.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2015-06-26 20:29 ` jakub at gcc dot gnu.org
@ 2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
  2015-08-12  7:12 ` rguenther at suse dot de
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-11 18:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #56 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to Bill Schmidt from comment #53)
> I'm not a fan of a tree-level unroller.  It's impossible to make good
> decisions about unroll factors that early.  But your second approach sounds
> quite promising to me.

I would be willing to soften this statement.  I think that an early unroller
might well be a profitable approach for most systems with large caches and so
forth, where if the unrolling heuristics are not completely accurate we are
still likely to make a reasonably good decision.  However, I would expect to
see ports with limited caches/memory to want more accurate control over
unrolling decisions.  So I could see allowing ports to select between a GIMPLE
unroller and an RTL unroller (I doubt anybody would want both).

In general it seems like PowerPC could benefit from more aggressive unrolling
much of the time, provided we can also solve the related IVOPTS problems that
cause too much register spill.

I may have an interest in working on a GIMPLE unroller, depending on how
quickly I can complete or shed some other projects...


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
@ 2015-08-12  7:12 ` rguenther at suse dot de
  2015-08-12  7:34 ` amker at gcc dot gnu.org
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenther at suse dot de @ 2015-08-12  7:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #57 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
> 
> --- Comment #56 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > decisions about unroll factors that early.  But your second approach sounds
> > quite promising to me.
> 
> I would be willing to soften this statement.  I think that an early unroller
> might well be a profitable approach for most systems with large caches and so
> forth, where if the unrolling heuristics are not completely accurate we are
> still likely to make a reasonably good decision.  However, I would expect to
> see ports with limited caches/memory to want more accurate control over
> unrolling decisions.  So I could see allowing ports to select between a GIMPLE
> unroller and an RTL unroller (I doubt anybody would want both).
> 
> In general it seems like PowerPC could benefit from more aggressive unrolling
> much of the time, provided we can also solve the related IVOPTS problems that
> cause too much register spill.
> 
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...

I think that a separate unrolling on GIMPLE would be a hard sell
due to the lack of a good cost mode.  _But_ doing unrolling as part
of another transform like we are doing now makes sense.  So does
eventually moving parts of an RTL pass involving unrolling to
GIMPLE, like modulo scheduling or SMS (leaving the scheduling part
to RTL).

Note that the RTL unroller is not enabled by default by any optimization
level and note that unfortunately the RTL unroller shares flags with
the GIMPLE level complete peeling (where it mainly controls cost 
modeling).  Oh, but it's enabled with -fprofile-use.

It's been a long time since I've done SPEC measuring with/without
-funroll-loops (or/and -fpeel-loops).  Note that these flags have
secondary effects as well:

toplev.c:    flag_web = flag_unroll_loops || flag_peel_loops;
toplev.c:    flag_rename_registers = flag_unroll_loops || flag_peel_loops;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2015-08-12  7:12 ` rguenther at suse dot de
@ 2015-08-12  7:34 ` amker at gcc dot gnu.org
  2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-08-12  7:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #58 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > decisions about unroll factors that early.  But your second approach sounds
> > quite promising to me.
> 
> I would be willing to soften this statement.  I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision.  However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions.  So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).

Thanks for the comments.
As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed.  Then we can enable it at O3/Ofast level, that would be wanted I think
since now we don't have a general unroller by default.

> 
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
> 
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...

(In reply to rguenther@suse.de from comment #57)
> On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
> > 
> > --- Comment #56 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
> > (In reply to Bill Schmidt from comment #53)
> > > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > > decisions about unroll factors that early.  But your second approach sounds
> > > quite promising to me.
> > 
> > I would be willing to soften this statement.  I think that an early unroller
> > might well be a profitable approach for most systems with large caches and so
> > forth, where if the unrolling heuristics are not completely accurate we are
> > still likely to make a reasonably good decision.  However, I would expect to
> > see ports with limited caches/memory to want more accurate control over
> > unrolling decisions.  So I could see allowing ports to select between a GIMPLE
> > unroller and an RTL unroller (I doubt anybody would want both).
> > 
> > In general it seems like PowerPC could benefit from more aggressive unrolling
> > much of the time, provided we can also solve the related IVOPTS problems that
> > cause too much register spill.
> > 
> > I may have an interest in working on a GIMPLE unroller, depending on how
> > quickly I can complete or shed some other projects...
> 
> I think that a separate unrolling on GIMPLE would be a hard sell
> due to the lack of a good cost mode.  _But_ doing unrolling as part
> of another transform like we are doing now makes sense.  So does
> eventually moving parts of an RTL pass involving unrolling to
> GIMPLE, like modulo scheduling or SMS (leaving the scheduling part
> to RTL).
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller.  It's impossible to make good
> > decisions about unroll factors that early.  But your second approach sounds
> > quite promising to me.
> 
> I would be willing to soften this statement.  I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision.  However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions.  So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).

As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed.  Then we can enable it at O3/Ofast level, it would be nice since we
don't have a general unroller by default.

About cost-model.  Is it possible to introduce cache information model in GCC? 
I don't see it's a difficult problem, and can be a start for possible cache
sensitive optimizations in the future?  Another general question is: what kind
of cost do we need in a fine unroller, besides cache/branch ones?

> 
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
> 
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...


> 
> Note that the RTL unroller is not enabled by default by any optimization
> level and note that unfortunately the RTL unroller shares flags with
> the GIMPLE level complete peeling (where it mainly controls cost 
> modeling).  Oh, but it's enabled with -fprofile-use.
> 
> It's been a long time since I've done SPEC measuring with/without
> -funroll-loops (or/and -fpeel-loops).  Note that these flags have
> secondary effects as well:
> 
> toplev.c:    flag_web = flag_unroll_loops || flag_peel_loops;
> toplev.c:    flag_rename_registers = flag_unroll_loops || flag_peel_loops;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (18 preceding siblings ...)
  2015-08-12  7:34 ` amker at gcc dot gnu.org
@ 2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
  2015-08-13  2:11 ` amker at gcc dot gnu.org
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-12 13:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #59 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #57)
> 
> It's been a long time since I've done SPEC measuring with/without
> -funroll-loops (or/and -fpeel-loops).  Note that these flags have
> secondary effects as well:
> 
> toplev.c:    flag_web = flag_unroll_loops || flag_peel_loops;
> toplev.c:    flag_rename_registers = flag_unroll_loops || flag_peel_loops;

We don't have a lot of data yet, but we have seen several examples in SPEC and
other benchmarks where turning on -funroll-loops is helpful, but should be much
more helpful -- in many cases performance improves with a much higher unroll
factor.  However, the effectiveness of unrolling is very much tied up with
these issues in IVOPTS, where we currently end up with too many separate base
registers for IVs.  As we increase the unroll factor, we eventually hit this as
a limiting factor, so fixing this IVOPTS issue would be very helpful for POWER.

As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll hot
loop traces in loops that would otherwise be too complex to unroll.  I.e., if
there is a single hot trace through a loop, you can do tail duplication on the
trace to force it into superblock form, and then peel and unroll that
superblock while falling into the original loop if the trace is left.  Complete
unrolling and unrolling by a factor are both possible.  I don't know of
specific benchmarks that would be helped by this, though.

(An RTL unroller could do this as well, but it seems much more natural and
implementable in GIMPLE.)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (19 preceding siblings ...)
  2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
@ 2015-08-13  2:11 ` amker at gcc dot gnu.org
  2015-08-13  3:28 ` wschmidt at gcc dot gnu.org
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-08-13  2:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #60 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #59)
> (In reply to rguenther@suse.de from comment #57)
> > 
> > It's been a long time since I've done SPEC measuring with/without
> > -funroll-loops (or/and -fpeel-loops).  Note that these flags have
> > secondary effects as well:
> > 
> > toplev.c:    flag_web = flag_unroll_loops || flag_peel_loops;
> > toplev.c:    flag_rename_registers = flag_unroll_loops || flag_peel_loops;
> 
> We don't have a lot of data yet, but we have seen several examples in SPEC
> and other benchmarks where turning on -funroll-loops is helpful, but should
> be much more helpful -- in many cases performance improves with a much
> higher unroll factor.  However, the effectiveness of unrolling is very much
> tied up with these issues in IVOPTS, where we currently end up with too many
> separate base registers for IVs.  As we increase the unroll factor, we
By this, do you mean too many candidates are chosen?  Or the issue just like
this PR describes?  Thanks.

> eventually hit this as a limiting factor, so fixing this IVOPTS issue would
> be very helpful for POWER.
> 
> As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll
> hot loop traces in loops that would otherwise be too complex to unroll. 
> I.e., if there is a single hot trace through a loop, you can do tail
> duplication on the trace to force it into superblock form, and then peel and
> unroll that superblock while falling into the original loop if the trace is
> left.  Complete unrolling and unrolling by a factor are both possible.  I
> don't know of specific benchmarks that would be helped by this, though.
> 
> (An RTL unroller could do this as well, but it seems much more natural and
> implementable in GIMPLE.)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (20 preceding siblings ...)
  2015-08-13  2:11 ` amker at gcc dot gnu.org
@ 2015-08-13  3:28 ` wschmidt at gcc dot gnu.org
  2015-08-13  5:01 ` amker at gcc dot gnu.org
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-13  3:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #61 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to amker from comment #60)
> (In reply to Bill Schmidt from comment #59)
> > We don't have a lot of data yet, but we have seen several examples in SPEC
> > and other benchmarks where turning on -funroll-loops is helpful, but should
> > be much more helpful -- in many cases performance improves with a much
> > higher unroll factor.  However, the effectiveness of unrolling is very much
> > tied up with these issues in IVOPTS, where we currently end up with too many
> > separate base registers for IVs.  As we increase the unroll factor, we
> By this, do you mean too many candidates are chosen?  Or the issue just like
> this PR describes?  Thanks.
> 

On the surface, it's the issue from this PR where we have lots of separate
induction variables with their own index registers each requiring an add during
each iteration.  The presence of this issue masks whether we have too many
candidates, but in the sense that we often see register spill associated with
this kind of code, we do have too many.  I.e., the register pressure model may
not be in tune with the kind of addressing mode that's being selected, but
that's just a theory.  Or perhaps pressure is just being generically
under-predicted for POWER.

Up till now we haven't done a lot of detailed analysis.  Hopefully we can free
somebody up to start looking at some of our unrolling issues soon.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [4.9/5/6 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (21 preceding siblings ...)
  2015-08-13  3:28 ` wschmidt at gcc dot gnu.org
@ 2015-08-13  5:01 ` amker at gcc dot gnu.org
  2021-05-14  9:45 ` [Bug target/29256] [9/10/11/12 " jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-08-13  5:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #62 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #61)
> (In reply to amker from comment #60)
> > (In reply to Bill Schmidt from comment #59)
> > > We don't have a lot of data yet, but we have seen several examples in SPEC
> > > and other benchmarks where turning on -funroll-loops is helpful, but should
> > > be much more helpful -- in many cases performance improves with a much
> > > higher unroll factor.  However, the effectiveness of unrolling is very much
> > > tied up with these issues in IVOPTS, where we currently end up with too many
> > > separate base registers for IVs.  As we increase the unroll factor, we
> > By this, do you mean too many candidates are chosen?  Or the issue just like
> > this PR describes?  Thanks.
> > 
> 
> On the surface, it's the issue from this PR where we have lots of separate
> induction variables with their own index registers each requiring an add
> during each iteration.  The presence of this issue masks whether we have too
IMHO, this issue should be fixed by a gimple unroller before IVO, or in RTL
unroller.  It's not that practical to fix it in IVO.

> many candidates, but in the sense that we often see register spill
> associated with this kind of code, we do have too many.  I.e., the register
> pressure model may not be in tune with the kind of addressing mode that's
> being selected, but that's just a theory.  Or perhaps pressure is just being
> generically under-predicted for POWER.
IVO's reg-pressure model fails to preserve a small iv set sometime on aarch64
too.  I have this issue on list.  On the other hand, the loops I saw are
generally very big, it's might be inappropriate that rtl unroller decides to
unroll them at the first place.

> 
> Up till now we haven't done a lot of detailed analysis.  Hopefully we can
> free somebody up to start looking at some of our unrolling issues soon.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [9/10/11/12 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (22 preceding siblings ...)
  2015-08-13  5:01 ` amker at gcc dot gnu.org
@ 2021-05-14  9:45 ` jakub at gcc dot gnu.org
  2021-06-01  8:04 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-05-14  9:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|8.5                         |9.4

--- Comment #70 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 8 branch is being closed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [9/10/11/12 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (23 preceding siblings ...)
  2021-05-14  9:45 ` [Bug target/29256] [9/10/11/12 " jakub at gcc dot gnu.org
@ 2021-06-01  8:04 ` rguenth at gcc dot gnu.org
  2022-05-27  9:33 ` [Bug target/29256] [10/11/12/13 " rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-01  8:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.4                         |9.5

--- Comment #71 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9.4 is being released, retargeting bugs to GCC 9.5.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [10/11/12/13 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (24 preceding siblings ...)
  2021-06-01  8:04 ` rguenth at gcc dot gnu.org
@ 2022-05-27  9:33 ` rguenth at gcc dot gnu.org
  2022-06-28 10:29 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-27  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.5                         |10.4

--- Comment #72 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9 branch is being closed

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [10/11/12/13 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (25 preceding siblings ...)
  2022-05-27  9:33 ` [Bug target/29256] [10/11/12/13 " rguenth at gcc dot gnu.org
@ 2022-06-28 10:29 ` jakub at gcc dot gnu.org
  2023-07-07 10:28 ` [Bug target/29256] [11/12/13/14 " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.4                        |10.5

--- Comment #73 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [11/12/13/14 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (26 preceding siblings ...)
  2022-06-28 10:29 ` jakub at gcc dot gnu.org
@ 2023-07-07 10:28 ` rguenth at gcc dot gnu.org
  2023-07-15  7:57 ` pinskia at gcc dot gnu.org
  2023-07-17  8:29 ` rguenth at gcc dot gnu.org
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07 10:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.5                        |11.5

--- Comment #74 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10 branch is being closed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [11/12/13/14 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (27 preceding siblings ...)
  2023-07-07 10:28 ` [Bug target/29256] [11/12/13/14 " rguenth at gcc dot gnu.org
@ 2023-07-15  7:57 ` pinskia at gcc dot gnu.org
  2023-07-17  8:29 ` rguenth at gcc dot gnu.org
  29 siblings, 0 replies; 30+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-15  7:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #75 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This looks fixed in GCC 11+; I tried x86_64, i686, powerpc (powerpc-spe is no
longer supported).

For 32bit powerpc we get:
tuned_STREAM_Copy:
.LFB0:
        .cfi_startproc
        lis 9,.LANCHOR0@ha
        lis 10,0x3
        la 3,.LANCHOR0@l(9)
        ori 0,10,0xd090
        addis 4,3,0xf4
        mtctr 0
        addi 5,3,-8
        addi 8,4,9208
.L2:
        lwz 6,8(5)
        lwz 7,12(5)
        lfd 2,16(5)
        lfd 4,24(5)
        lfd 6,32(5)
        lfd 8,40(5)
        lfd 10,48(5)
        lfd 12,56(5)
        lfdu 0,64(5)
        stw 6,8(8)
        stw 7,12(8)
        stfd 2,16(8)
        stfd 4,24(8)
        stfd 6,32(8)
        stfd 8,40(8)
        stfd 10,48(8)
        stfd 12,56(8)
        stfdu 0,64(8)
        bdnz .L2
        blr

Which seems to the best.

gimple level for the loop is:
  <bb 3> [local count: 1063004409]:
  # ivtmp.10_8 = PHI <ivtmp.10_7(3), ivtmp.10_12(2)>
  # ivtmp.12_14 = PHI <ivtmp.12_15(3), ivtmp.12_16(2)>
  ivtmp.10_7 = ivtmp.10_8 + 8;
  _18 = (void *) ivtmp.10_7;
  _1 = MEM[(double *)_18];
  ivtmp.12_15 = ivtmp.12_14 + 8;
  _19 = (void *) ivtmp.12_15;
  MEM[(double *)_19] = _1;
  if (ivtmp.10_7 != _21)
    goto <bb 3>; [99.00%]
  else
    goto <bb 4>; [1.00%]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Bug target/29256] [11/12/13/14 regression] loop performance regression
       [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
                   ` (28 preceding siblings ...)
  2023-07-15  7:57 ` pinskia at gcc dot gnu.org
@ 2023-07-17  8:29 ` rguenth at gcc dot gnu.org
  29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-17  8:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #76 from Richard Biener <rguenth at gcc dot gnu.org> ---
x86_64 has

  <bb 3> [local count: 536870800]:
  # ivtmp.13_3 = PHI <ivtmp.13_9(3), 0(2)>
  vect__1.6_12 = MEM <vector(2) double> [(double *)&a + ivtmp.13_3 * 1];
  MEM <vector(2) double> [(double *)&c + ivtmp.13_3 * 1] = vect__1.6_12;
  ivtmp.13_9 = ivtmp.13_3 + 16;
  if (ivtmp.13_9 != 16000000)

and

.L2:
        movapd  a(%rax), %xmm0
        addq    $16, %rax
        movaps  %xmm0, c-16(%rax)
        cmpq    $16000000, %rax
        jne     .L2

which I think is optimal.  With -fPIC we get

.L2:
        movapd  (%rax,%rdx), %xmm0
        addq    $16, %rax
        movaps  %xmm0, -16(%rax,%rcx)
        cmpq    $16000000, %rax
        jne     .L2

let's close this.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-07-17  8:30 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
2011-06-27 13:58 ` [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression rguenth at gcc dot gnu.org
2012-01-12 12:28 ` [Bug target/29256] [4.4/4.5/4.6/4.7 " rguenth at gcc dot gnu.org
2012-03-13 14:23 ` [Bug target/29256] [4.5/4.6/4.7/4.8 " jakub at gcc dot gnu.org
2012-07-02 12:25 ` rguenth at gcc dot gnu.org
2013-04-12 15:17 ` [Bug target/29256] [4.7/4.8/4.9 " jakub at gcc dot gnu.org
2013-12-09  4:50 ` law at redhat dot com
2014-06-12 13:45 ` [Bug target/29256] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
2014-12-19 13:28 ` [Bug target/29256] [4.8/4.9/5 " jakub at gcc dot gnu.org
2015-04-07 17:36 ` law at redhat dot com
2015-04-08  7:20 ` rguenther at suse dot de
2015-05-19 17:51 ` [Bug target/29256] [4.8/4.9/5/6 " law at redhat dot com
2015-05-20  2:21 ` amker at gcc dot gnu.org
2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
2015-06-23  8:22 ` rguenth at gcc dot gnu.org
2015-06-26 19:59 ` [Bug target/29256] [4.9/5/6 " jakub at gcc dot gnu.org
2015-06-26 20:29 ` jakub at gcc dot gnu.org
2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
2015-08-12  7:12 ` rguenther at suse dot de
2015-08-12  7:34 ` amker at gcc dot gnu.org
2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
2015-08-13  2:11 ` amker at gcc dot gnu.org
2015-08-13  3:28 ` wschmidt at gcc dot gnu.org
2015-08-13  5:01 ` amker at gcc dot gnu.org
2021-05-14  9:45 ` [Bug target/29256] [9/10/11/12 " jakub at gcc dot gnu.org
2021-06-01  8:04 ` rguenth at gcc dot gnu.org
2022-05-27  9:33 ` [Bug target/29256] [10/11/12/13 " rguenth at gcc dot gnu.org
2022-06-28 10:29 ` jakub at gcc dot gnu.org
2023-07-07 10:28 ` [Bug target/29256] [11/12/13/14 " rguenth at gcc dot gnu.org
2023-07-15  7:57 ` pinskia at gcc dot gnu.org
2023-07-17  8:29 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).