public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
@ 2011-06-27 13:58 ` rguenth at gcc dot gnu.org
2012-01-12 12:28 ` [Bug target/29256] [4.4/4.5/4.6/4.7 " rguenth at gcc dot gnu.org
` (28 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-06-27 13:58 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.3.6 |4.4.7
--- Comment #40 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-06-27 12:13:46 UTC ---
4.3 branch is being closed, moving to 4.4.7 target.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.4/4.5/4.6/4.7 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
2011-06-27 13:58 ` [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression rguenth at gcc dot gnu.org
@ 2012-01-12 12:28 ` rguenth at gcc dot gnu.org
2012-03-13 14:23 ` [Bug target/29256] [4.5/4.6/4.7/4.8 " jakub at gcc dot gnu.org
` (27 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-01-12 12:28 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |powerpc-*-*
Component|middle-end |target
--- Comment #41 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-01-12 12:26:59 UTC ---
Seems to be more-or-less target dependent, so adding a list of targets
affected.
For powerpc we still use N induction variables after unrolling instead of just
one. Which I suppose was the point of this regression report.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.5/4.6/4.7/4.8 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
2011-06-27 13:58 ` [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression rguenth at gcc dot gnu.org
2012-01-12 12:28 ` [Bug target/29256] [4.4/4.5/4.6/4.7 " rguenth at gcc dot gnu.org
@ 2012-03-13 14:23 ` jakub at gcc dot gnu.org
2012-07-02 12:25 ` rguenth at gcc dot gnu.org
` (26 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-13 14:23 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.4.7 |4.5.4
--- Comment #42 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-13 12:46:45 UTC ---
4.4 branch is being closed, moving to 4.5.4 target.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.5/4.6/4.7/4.8 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2012-03-13 14:23 ` [Bug target/29256] [4.5/4.6/4.7/4.8 " jakub at gcc dot gnu.org
@ 2012-07-02 12:25 ` rguenth at gcc dot gnu.org
2013-04-12 15:17 ` [Bug target/29256] [4.7/4.8/4.9 " jakub at gcc dot gnu.org
` (25 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-02 12:25 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.5.4 |4.6.4
--- Comment #43 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-02 12:23:04 UTC ---
The 4.5 branch is being closed, adjusting target milestone.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.7/4.8/4.9 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2012-07-02 12:25 ` rguenth at gcc dot gnu.org
@ 2013-04-12 15:17 ` jakub at gcc dot gnu.org
2013-12-09 4:50 ` law at redhat dot com
` (24 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2013-04-12 15:17 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.6.4 |4.7.4
--- Comment #44 from Jakub Jelinek <jakub at gcc dot gnu.org> 2013-04-12 15:16:34 UTC ---
GCC 4.6.4 has been released and the branch has been closed.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.7/4.8/4.9 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (4 preceding siblings ...)
2013-04-12 15:17 ` [Bug target/29256] [4.7/4.8/4.9 " jakub at gcc dot gnu.org
@ 2013-12-09 4:50 ` law at redhat dot com
2014-06-12 13:45 ` [Bug target/29256] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
` (23 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: law at redhat dot com @ 2013-12-09 4:50 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Jeffrey A. Law <law at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |law at redhat dot com
--- Comment #45 from Jeffrey A. Law <law at redhat dot com> ---
This problem still exists and can be seen by making the arrays external and
using -fno-tree-loop-distribute-patterns.
.L2:
evlddx 31,10,9
addi 7,9,8
addi 0,9,16
addi 11,9,24
addi 3,9,32
evstddx 31,8,9
addi 4,9,40
evlddx 31,10,7
addi 5,9,48
addi 6,9,56
evlddx 12,10,6
addi 9,9,64
evstddx 31,8,7
evlddx 7,10,0
evstddx 7,8,0
evlddx 0,10,11
evstddx 0,8,11
evlddx 11,10,3
evstddx 11,8,3
evlddx 3,10,4
evstddx 3,8,4
evlddx 4,10,5
evstddx 4,8,5
evstddx 12,8,6
bdnz .L2
evldd 31,8(1)
addi 1,1,16
blr
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.7/4.8/4.9/4.10 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (5 preceding siblings ...)
2013-12-09 4:50 ` law at redhat dot com
@ 2014-06-12 13:45 ` rguenth at gcc dot gnu.org
2014-12-19 13:28 ` [Bug target/29256] [4.8/4.9/5 " jakub at gcc dot gnu.org
` (22 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-06-12 13:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.7.4 |4.8.4
--- Comment #46 from Richard Biener <rguenth at gcc dot gnu.org> ---
The 4.7 branch is being closed, moving target milestone to 4.8.4.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.8/4.9/5 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (6 preceding siblings ...)
2014-06-12 13:45 ` [Bug target/29256] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
@ 2014-12-19 13:28 ` jakub at gcc dot gnu.org
2015-04-07 17:36 ` law at redhat dot com
` (21 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-12-19 13:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.8.4 |4.8.5
--- Comment #47 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.8.4 has been released.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.8/4.9/5 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (7 preceding siblings ...)
2014-12-19 13:28 ` [Bug target/29256] [4.8/4.9/5 " jakub at gcc dot gnu.org
@ 2015-04-07 17:36 ` law at redhat dot com
2015-04-08 7:20 ` rguenther at suse dot de
` (20 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: law at redhat dot com @ 2015-04-07 17:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #49 from Jeffrey A. Law <law at redhat dot com> ---
Richi, see c#45. Basically the regression is "gone" for the testcase as-is...
But it's pretty easy to twiddle it slightly and show the regression. It's also
important to note this is e500 code, so you need to configure your toolchain
appropriately.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.8/4.9/5 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (8 preceding siblings ...)
2015-04-07 17:36 ` law at redhat dot com
@ 2015-04-08 7:20 ` rguenther at suse dot de
2015-05-19 17:51 ` [Bug target/29256] [4.8/4.9/5/6 " law at redhat dot com
` (19 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenther at suse dot de @ 2015-04-08 7:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #50 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 7 Apr 2015, law at redhat dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
>
> --- Comment #49 from Jeffrey A. Law <law at redhat dot com> ---
> Richi, see c#45. Basically the regression is "gone" for the testcase as-is...
> But it's pretty easy to twiddle it slightly and show the regression. It's also
> important to note this is e500 code, so you need to configure your toolchain
> appropriately.
Please provide a testcase that shows the regression then and instructions
how to configure a cross cc1.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (9 preceding siblings ...)
2015-04-08 7:20 ` rguenther at suse dot de
@ 2015-05-19 17:51 ` law at redhat dot com
2015-05-20 2:21 ` amker at gcc dot gnu.org
` (18 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: law at redhat dot com @ 2015-05-19 17:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #51 from Jeffrey A. Law <law at redhat dot com> ---
Configure for powerpc-linux-gnuspec target with the --eanble-e500_double
option:
/home/gcc/GIT-2/gcc/configure powerpc-linux-gnuspe --enable-e500_double
Testcase:
# define N 2000000
extern double a[N],c[N];
void tuned_STREAM_Copy()
{
int j;
for (j=0; j<N; j++)
c[j] = a[j];
}
./cc1 -O3 -funroll-loops -funroll-all-loops -fno-tree-loop-distribute-patterns
j.c -I./ -mspe
Results in:
tuned_STREAM_Copy:
stwu 1,-16(1)
lis 7,0x3
lis 8,c@ha
lis 10,a@ha
ori 0,7,0xd090
evstdd 31,8(1)
li 9,0
la 8,c@l(8)
la 10,a@l(10)
mtctr 0
.L2:
evlddx 31,10,9
addi 7,9,8
addi 0,9,16
addi 11,9,24
addi 3,9,32
evstddx 31,8,9
addi 4,9,40
evlddx 31,10,7
addi 5,9,48
addi 6,9,56
evlddx 12,10,6
addi 9,9,64
evstddx 31,8,7
evlddx 7,10,0
evstddx 7,8,0
evlddx 0,10,11
evstddx 0,8,11
evlddx 11,10,3
evstddx 11,8,3
evlddx 3,10,4
evstddx 3,8,4
evlddx 4,10,5
evstddx 4,8,5
evstddx 12,8,6
bdnz .L2
evldd 31,8(1)
addi 1,1,16
blr
.size tuned_STREAM_Copy, .-tuned_STREAM_Copy
Which looks to me like ivopts has mucked things up badly.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (10 preceding siblings ...)
2015-05-19 17:51 ` [Bug target/29256] [4.8/4.9/5/6 " law at redhat dot com
@ 2015-05-20 2:21 ` amker at gcc dot gnu.org
2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
` (17 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-05-20 2:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
amker at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amker at gcc dot gnu.org
--- Comment #52 from amker at gcc dot gnu.org ---
I don't understand powerpc assembly well, but this looks like the same problem
on aarch64/arm. Ah, and we are even looking at same function...
I think this is a general issue caused by inconsistency between tree level
ivopt and rtl level loop unroller. To be specific, how we handle unrolled
induction variable registers after unrolling.
The core loop on aarch64 with options "-O3 -funroll-all-loops -mcpu=cortex-a57"
gave below output:
.L3:
add x2, x0, 16
ldr q16, [x17, x0]
add x10, x0, 32
add x9, x0, 48
add x8, x0, 64
ldr q17, [x17, x2]
add x3, x0, 80
add x6, x0, 96
add x5, x0, 112
add w1, w1, 8
ldr q19, [x17, x10]
cmp w1, w14
ldr q18, [x17, x9]
ldr q20, [x17, x8]
ldr q21, [x17, x3]
ldr q22, [x17, x6]
ldr q23, [x17, x5]
str q16, [x18, x0]
add x0, x0, 128
str q17, [x18, x2]
str q19, [x18, x10]
str q18, [x18, x9]
str q20, [x18, x8]
str q21, [x18, x3]
str q22, [x18, x6]
str q23, [x18, x5]
bcc .L3
The tree ivopt dump is quite neat:
<bb 6>:
# ivtmp.16_28 = PHI <ivtmp.16_25(9), 0(5)>
# ivtmp.19_42 = PHI <ivtmp.19_41(9), 0(5)>
vect__4.13_62 = MEM[base: vectp_a.12_58, index: ivtmp.19_42, offset: 0B];
MEM[base: vectp_c.15_63, index: ivtmp.19_42, offset: 0B] = vect__4.13_62;
ivtmp.16_25 = ivtmp.16_28 + 1;
ivtmp.19_41 = ivtmp.19_42 + 16;
if (ivtmp.16_25 < bnd.7_36)
goto <bb 9>;
else
goto <bb 7>;
...
<bb 9>:
goto <bb 6>;
But after rtl unroller, we have options like "-fsplit-ivs-in-unroller" and
"-fweb". These two options try to split the long live range of induction
vairables into seperated ones. Evetually, with folloing fwprop and IRA, we
have multiple ivs for each original iv.
I see two possible fixes here. One is to implement a tree level unroller
before IVOPT and remove the rtl one. The rtl one is some kind of too
aggressive that we didn't enable it by default with "O3".
Another is change how we handle unrolled iv in rtl unroller. It splits
unrolled iv to avoid pseudo register with long live range since that may affect
rtl optimizers. This assumption may hold before, but seems not true to me
nowadays, especially for induction variables. Because on tree level ivopts, we
already made the assumption that each iv occupies a register, also ivs are
intensively used thus should live in one single hard register. For this
specific case, we can refactor [base+index] out of memory reference and use
[new_base], [new_base+4], [new_base+8], ... etc. in unrolling. If tree ivopts
choosses [reg+offset] addressing mode, we only need to generate instruction
sequence like "[reg+offset], [reg+(offset+4)], [reg+(offset+8)]... reg = reg +
urolled_times*step"
Thanks,
bin
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (11 preceding siblings ...)
2015-05-20 2:21 ` amker at gcc dot gnu.org
@ 2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
2015-06-23 8:22 ` rguenth at gcc dot gnu.org
` (16 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-05-20 12:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #53 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
I'm not a fan of a tree-level unroller. It's impossible to make good decisions
about unroll factors that early. But your second approach sounds quite
promising to me.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (12 preceding siblings ...)
2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
@ 2015-06-23 8:22 ` rguenth at gcc dot gnu.org
2015-06-26 19:59 ` [Bug target/29256] [4.9/5/6 " jakub at gcc dot gnu.org
` (15 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-06-23 8:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.8.5 |4.9.3
--- Comment #54 from Richard Biener <rguenth at gcc dot gnu.org> ---
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (13 preceding siblings ...)
2015-06-23 8:22 ` rguenth at gcc dot gnu.org
@ 2015-06-26 19:59 ` jakub at gcc dot gnu.org
2015-06-26 20:29 ` jakub at gcc dot gnu.org
` (14 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 19:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #55 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.9.3 has been released.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (14 preceding siblings ...)
2015-06-26 19:59 ` [Bug target/29256] [4.9/5/6 " jakub at gcc dot gnu.org
@ 2015-06-26 20:29 ` jakub at gcc dot gnu.org
2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
` (13 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 20:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.9.3 |4.9.4
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (15 preceding siblings ...)
2015-06-26 20:29 ` jakub at gcc dot gnu.org
@ 2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
2015-08-12 7:12 ` rguenther at suse dot de
` (12 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-11 18:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #56 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to Bill Schmidt from comment #53)
> I'm not a fan of a tree-level unroller. It's impossible to make good
> decisions about unroll factors that early. But your second approach sounds
> quite promising to me.
I would be willing to soften this statement. I think that an early unroller
might well be a profitable approach for most systems with large caches and so
forth, where if the unrolling heuristics are not completely accurate we are
still likely to make a reasonably good decision. However, I would expect to
see ports with limited caches/memory to want more accurate control over
unrolling decisions. So I could see allowing ports to select between a GIMPLE
unroller and an RTL unroller (I doubt anybody would want both).
In general it seems like PowerPC could benefit from more aggressive unrolling
much of the time, provided we can also solve the related IVOPTS problems that
cause too much register spill.
I may have an interest in working on a GIMPLE unroller, depending on how
quickly I can complete or shed some other projects...
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (16 preceding siblings ...)
2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
@ 2015-08-12 7:12 ` rguenther at suse dot de
2015-08-12 7:34 ` amker at gcc dot gnu.org
` (11 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenther at suse dot de @ 2015-08-12 7:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #57 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
>
> --- Comment #56 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller. It's impossible to make good
> > decisions about unroll factors that early. But your second approach sounds
> > quite promising to me.
>
> I would be willing to soften this statement. I think that an early unroller
> might well be a profitable approach for most systems with large caches and so
> forth, where if the unrolling heuristics are not completely accurate we are
> still likely to make a reasonably good decision. However, I would expect to
> see ports with limited caches/memory to want more accurate control over
> unrolling decisions. So I could see allowing ports to select between a GIMPLE
> unroller and an RTL unroller (I doubt anybody would want both).
>
> In general it seems like PowerPC could benefit from more aggressive unrolling
> much of the time, provided we can also solve the related IVOPTS problems that
> cause too much register spill.
>
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...
I think that a separate unrolling on GIMPLE would be a hard sell
due to the lack of a good cost mode. _But_ doing unrolling as part
of another transform like we are doing now makes sense. So does
eventually moving parts of an RTL pass involving unrolling to
GIMPLE, like modulo scheduling or SMS (leaving the scheduling part
to RTL).
Note that the RTL unroller is not enabled by default by any optimization
level and note that unfortunately the RTL unroller shares flags with
the GIMPLE level complete peeling (where it mainly controls cost
modeling). Oh, but it's enabled with -fprofile-use.
It's been a long time since I've done SPEC measuring with/without
-funroll-loops (or/and -fpeel-loops). Note that these flags have
secondary effects as well:
toplev.c: flag_web = flag_unroll_loops || flag_peel_loops;
toplev.c: flag_rename_registers = flag_unroll_loops || flag_peel_loops;
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (17 preceding siblings ...)
2015-08-12 7:12 ` rguenther at suse dot de
@ 2015-08-12 7:34 ` amker at gcc dot gnu.org
2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
` (10 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-08-12 7:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #58 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller. It's impossible to make good
> > decisions about unroll factors that early. But your second approach sounds
> > quite promising to me.
>
> I would be willing to soften this statement. I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision. However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions. So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).
Thanks for the comments.
As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed. Then we can enable it at O3/Ofast level, that would be wanted I think
since now we don't have a general unroller by default.
>
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
>
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...
(In reply to rguenther@suse.de from comment #57)
> On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
> >
> > --- Comment #56 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
> > (In reply to Bill Schmidt from comment #53)
> > > I'm not a fan of a tree-level unroller. It's impossible to make good
> > > decisions about unroll factors that early. But your second approach sounds
> > > quite promising to me.
> >
> > I would be willing to soften this statement. I think that an early unroller
> > might well be a profitable approach for most systems with large caches and so
> > forth, where if the unrolling heuristics are not completely accurate we are
> > still likely to make a reasonably good decision. However, I would expect to
> > see ports with limited caches/memory to want more accurate control over
> > unrolling decisions. So I could see allowing ports to select between a GIMPLE
> > unroller and an RTL unroller (I doubt anybody would want both).
> >
> > In general it seems like PowerPC could benefit from more aggressive unrolling
> > much of the time, provided we can also solve the related IVOPTS problems that
> > cause too much register spill.
> >
> > I may have an interest in working on a GIMPLE unroller, depending on how
> > quickly I can complete or shed some other projects...
>
> I think that a separate unrolling on GIMPLE would be a hard sell
> due to the lack of a good cost mode. _But_ doing unrolling as part
> of another transform like we are doing now makes sense. So does
> eventually moving parts of an RTL pass involving unrolling to
> GIMPLE, like modulo scheduling or SMS (leaving the scheduling part
> to RTL).
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller. It's impossible to make good
> > decisions about unroll factors that early. But your second approach sounds
> > quite promising to me.
>
> I would be willing to soften this statement. I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision. However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions. So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).
As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed. Then we can enable it at O3/Ofast level, it would be nice since we
don't have a general unroller by default.
About cost-model. Is it possible to introduce cache information model in GCC?
I don't see it's a difficult problem, and can be a start for possible cache
sensitive optimizations in the future? Another general question is: what kind
of cost do we need in a fine unroller, besides cache/branch ones?
>
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
>
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...
>
> Note that the RTL unroller is not enabled by default by any optimization
> level and note that unfortunately the RTL unroller shares flags with
> the GIMPLE level complete peeling (where it mainly controls cost
> modeling). Oh, but it's enabled with -fprofile-use.
>
> It's been a long time since I've done SPEC measuring with/without
> -funroll-loops (or/and -fpeel-loops). Note that these flags have
> secondary effects as well:
>
> toplev.c: flag_web = flag_unroll_loops || flag_peel_loops;
> toplev.c: flag_rename_registers = flag_unroll_loops || flag_peel_loops;
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (18 preceding siblings ...)
2015-08-12 7:34 ` amker at gcc dot gnu.org
@ 2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
2015-08-13 2:11 ` amker at gcc dot gnu.org
` (9 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-12 13:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #59 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #57)
>
> It's been a long time since I've done SPEC measuring with/without
> -funroll-loops (or/and -fpeel-loops). Note that these flags have
> secondary effects as well:
>
> toplev.c: flag_web = flag_unroll_loops || flag_peel_loops;
> toplev.c: flag_rename_registers = flag_unroll_loops || flag_peel_loops;
We don't have a lot of data yet, but we have seen several examples in SPEC and
other benchmarks where turning on -funroll-loops is helpful, but should be much
more helpful -- in many cases performance improves with a much higher unroll
factor. However, the effectiveness of unrolling is very much tied up with
these issues in IVOPTS, where we currently end up with too many separate base
registers for IVs. As we increase the unroll factor, we eventually hit this as
a limiting factor, so fixing this IVOPTS issue would be very helpful for POWER.
As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll hot
loop traces in loops that would otherwise be too complex to unroll. I.e., if
there is a single hot trace through a loop, you can do tail duplication on the
trace to force it into superblock form, and then peel and unroll that
superblock while falling into the original loop if the trace is left. Complete
unrolling and unrolling by a factor are both possible. I don't know of
specific benchmarks that would be helped by this, though.
(An RTL unroller could do this as well, but it seems much more natural and
implementable in GIMPLE.)
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (19 preceding siblings ...)
2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
@ 2015-08-13 2:11 ` amker at gcc dot gnu.org
2015-08-13 3:28 ` wschmidt at gcc dot gnu.org
` (8 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-08-13 2:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #60 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #59)
> (In reply to rguenther@suse.de from comment #57)
> >
> > It's been a long time since I've done SPEC measuring with/without
> > -funroll-loops (or/and -fpeel-loops). Note that these flags have
> > secondary effects as well:
> >
> > toplev.c: flag_web = flag_unroll_loops || flag_peel_loops;
> > toplev.c: flag_rename_registers = flag_unroll_loops || flag_peel_loops;
>
> We don't have a lot of data yet, but we have seen several examples in SPEC
> and other benchmarks where turning on -funroll-loops is helpful, but should
> be much more helpful -- in many cases performance improves with a much
> higher unroll factor. However, the effectiveness of unrolling is very much
> tied up with these issues in IVOPTS, where we currently end up with too many
> separate base registers for IVs. As we increase the unroll factor, we
By this, do you mean too many candidates are chosen? Or the issue just like
this PR describes? Thanks.
> eventually hit this as a limiting factor, so fixing this IVOPTS issue would
> be very helpful for POWER.
>
> As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll
> hot loop traces in loops that would otherwise be too complex to unroll.
> I.e., if there is a single hot trace through a loop, you can do tail
> duplication on the trace to force it into superblock form, and then peel and
> unroll that superblock while falling into the original loop if the trace is
> left. Complete unrolling and unrolling by a factor are both possible. I
> don't know of specific benchmarks that would be helped by this, though.
>
> (An RTL unroller could do this as well, but it seems much more natural and
> implementable in GIMPLE.)
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (20 preceding siblings ...)
2015-08-13 2:11 ` amker at gcc dot gnu.org
@ 2015-08-13 3:28 ` wschmidt at gcc dot gnu.org
2015-08-13 5:01 ` amker at gcc dot gnu.org
` (7 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-13 3:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #61 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to amker from comment #60)
> (In reply to Bill Schmidt from comment #59)
> > We don't have a lot of data yet, but we have seen several examples in SPEC
> > and other benchmarks where turning on -funroll-loops is helpful, but should
> > be much more helpful -- in many cases performance improves with a much
> > higher unroll factor. However, the effectiveness of unrolling is very much
> > tied up with these issues in IVOPTS, where we currently end up with too many
> > separate base registers for IVs. As we increase the unroll factor, we
> By this, do you mean too many candidates are chosen? Or the issue just like
> this PR describes? Thanks.
>
On the surface, it's the issue from this PR where we have lots of separate
induction variables with their own index registers each requiring an add during
each iteration. The presence of this issue masks whether we have too many
candidates, but in the sense that we often see register spill associated with
this kind of code, we do have too many. I.e., the register pressure model may
not be in tune with the kind of addressing mode that's being selected, but
that's just a theory. Or perhaps pressure is just being generically
under-predicted for POWER.
Up till now we haven't done a lot of detailed analysis. Hopefully we can free
somebody up to start looking at some of our unrolling issues soon.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [4.9/5/6 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (21 preceding siblings ...)
2015-08-13 3:28 ` wschmidt at gcc dot gnu.org
@ 2015-08-13 5:01 ` amker at gcc dot gnu.org
2021-05-14 9:45 ` [Bug target/29256] [9/10/11/12 " jakub at gcc dot gnu.org
` (6 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: amker at gcc dot gnu.org @ 2015-08-13 5:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #62 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #61)
> (In reply to amker from comment #60)
> > (In reply to Bill Schmidt from comment #59)
> > > We don't have a lot of data yet, but we have seen several examples in SPEC
> > > and other benchmarks where turning on -funroll-loops is helpful, but should
> > > be much more helpful -- in many cases performance improves with a much
> > > higher unroll factor. However, the effectiveness of unrolling is very much
> > > tied up with these issues in IVOPTS, where we currently end up with too many
> > > separate base registers for IVs. As we increase the unroll factor, we
> > By this, do you mean too many candidates are chosen? Or the issue just like
> > this PR describes? Thanks.
> >
>
> On the surface, it's the issue from this PR where we have lots of separate
> induction variables with their own index registers each requiring an add
> during each iteration. The presence of this issue masks whether we have too
IMHO, this issue should be fixed by a gimple unroller before IVO, or in RTL
unroller. It's not that practical to fix it in IVO.
> many candidates, but in the sense that we often see register spill
> associated with this kind of code, we do have too many. I.e., the register
> pressure model may not be in tune with the kind of addressing mode that's
> being selected, but that's just a theory. Or perhaps pressure is just being
> generically under-predicted for POWER.
IVO's reg-pressure model fails to preserve a small iv set sometime on aarch64
too. I have this issue on list. On the other hand, the loops I saw are
generally very big, it's might be inappropriate that rtl unroller decides to
unroll them at the first place.
>
> Up till now we haven't done a lot of detailed analysis. Hopefully we can
> free somebody up to start looking at some of our unrolling issues soon.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [9/10/11/12 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (22 preceding siblings ...)
2015-08-13 5:01 ` amker at gcc dot gnu.org
@ 2021-05-14 9:45 ` jakub at gcc dot gnu.org
2021-06-01 8:04 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-05-14 9:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|8.5 |9.4
--- Comment #70 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 8 branch is being closed.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [9/10/11/12 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (23 preceding siblings ...)
2021-05-14 9:45 ` [Bug target/29256] [9/10/11/12 " jakub at gcc dot gnu.org
@ 2021-06-01 8:04 ` rguenth at gcc dot gnu.org
2022-05-27 9:33 ` [Bug target/29256] [10/11/12/13 " rguenth at gcc dot gnu.org
` (4 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-01 8:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|9.4 |9.5
--- Comment #71 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [10/11/12/13 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (24 preceding siblings ...)
2021-06-01 8:04 ` rguenth at gcc dot gnu.org
@ 2022-05-27 9:33 ` rguenth at gcc dot gnu.org
2022-06-28 10:29 ` jakub at gcc dot gnu.org
` (3 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-27 9:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|9.5 |10.4
--- Comment #72 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9 branch is being closed
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [10/11/12/13 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (25 preceding siblings ...)
2022-05-27 9:33 ` [Bug target/29256] [10/11/12/13 " rguenth at gcc dot gnu.org
@ 2022-06-28 10:29 ` jakub at gcc dot gnu.org
2023-07-07 10:28 ` [Bug target/29256] [11/12/13/14 " rguenth at gcc dot gnu.org
` (2 subsequent siblings)
29 siblings, 0 replies; 30+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|10.4 |10.5
--- Comment #73 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [11/12/13/14 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (26 preceding siblings ...)
2022-06-28 10:29 ` jakub at gcc dot gnu.org
@ 2023-07-07 10:28 ` rguenth at gcc dot gnu.org
2023-07-15 7:57 ` pinskia at gcc dot gnu.org
2023-07-17 8:29 ` rguenth at gcc dot gnu.org
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07 10:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|10.5 |11.5
--- Comment #74 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10 branch is being closed.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [11/12/13/14 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (27 preceding siblings ...)
2023-07-07 10:28 ` [Bug target/29256] [11/12/13/14 " rguenth at gcc dot gnu.org
@ 2023-07-15 7:57 ` pinskia at gcc dot gnu.org
2023-07-17 8:29 ` rguenth at gcc dot gnu.org
29 siblings, 0 replies; 30+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-15 7:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #75 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This looks fixed in GCC 11+; I tried x86_64, i686, powerpc (powerpc-spe is no
longer supported).
For 32bit powerpc we get:
tuned_STREAM_Copy:
.LFB0:
.cfi_startproc
lis 9,.LANCHOR0@ha
lis 10,0x3
la 3,.LANCHOR0@l(9)
ori 0,10,0xd090
addis 4,3,0xf4
mtctr 0
addi 5,3,-8
addi 8,4,9208
.L2:
lwz 6,8(5)
lwz 7,12(5)
lfd 2,16(5)
lfd 4,24(5)
lfd 6,32(5)
lfd 8,40(5)
lfd 10,48(5)
lfd 12,56(5)
lfdu 0,64(5)
stw 6,8(8)
stw 7,12(8)
stfd 2,16(8)
stfd 4,24(8)
stfd 6,32(8)
stfd 8,40(8)
stfd 10,48(8)
stfd 12,56(8)
stfdu 0,64(8)
bdnz .L2
blr
Which seems to the best.
gimple level for the loop is:
<bb 3> [local count: 1063004409]:
# ivtmp.10_8 = PHI <ivtmp.10_7(3), ivtmp.10_12(2)>
# ivtmp.12_14 = PHI <ivtmp.12_15(3), ivtmp.12_16(2)>
ivtmp.10_7 = ivtmp.10_8 + 8;
_18 = (void *) ivtmp.10_7;
_1 = MEM[(double *)_18];
ivtmp.12_15 = ivtmp.12_14 + 8;
_19 = (void *) ivtmp.12_15;
MEM[(double *)_19] = _1;
if (ivtmp.10_7 != _21)
goto <bb 3>; [99.00%]
else
goto <bb 4>; [1.00%]
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Bug target/29256] [11/12/13/14 regression] loop performance regression
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
` (28 preceding siblings ...)
2023-07-15 7:57 ` pinskia at gcc dot gnu.org
@ 2023-07-17 8:29 ` rguenth at gcc dot gnu.org
29 siblings, 0 replies; 30+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-17 8:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |FIXED
--- Comment #76 from Richard Biener <rguenth at gcc dot gnu.org> ---
x86_64 has
<bb 3> [local count: 536870800]:
# ivtmp.13_3 = PHI <ivtmp.13_9(3), 0(2)>
vect__1.6_12 = MEM <vector(2) double> [(double *)&a + ivtmp.13_3 * 1];
MEM <vector(2) double> [(double *)&c + ivtmp.13_3 * 1] = vect__1.6_12;
ivtmp.13_9 = ivtmp.13_3 + 16;
if (ivtmp.13_9 != 16000000)
and
.L2:
movapd a(%rax), %xmm0
addq $16, %rax
movaps %xmm0, c-16(%rax)
cmpq $16000000, %rax
jne .L2
which I think is optimal. With -fPIC we get
.L2:
movapd (%rax,%rdx), %xmm0
addq $16, %rax
movaps %xmm0, -16(%rax,%rcx)
cmpq $16000000, %rax
jne .L2
let's close this.
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2023-07-17 8:30 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-29256-4@http.gcc.gnu.org/bugzilla/>
2011-06-27 13:58 ` [Bug middle-end/29256] [4.3/4.4/4.5/4.6/4.7 regression] loop performance regression rguenth at gcc dot gnu.org
2012-01-12 12:28 ` [Bug target/29256] [4.4/4.5/4.6/4.7 " rguenth at gcc dot gnu.org
2012-03-13 14:23 ` [Bug target/29256] [4.5/4.6/4.7/4.8 " jakub at gcc dot gnu.org
2012-07-02 12:25 ` rguenth at gcc dot gnu.org
2013-04-12 15:17 ` [Bug target/29256] [4.7/4.8/4.9 " jakub at gcc dot gnu.org
2013-12-09 4:50 ` law at redhat dot com
2014-06-12 13:45 ` [Bug target/29256] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
2014-12-19 13:28 ` [Bug target/29256] [4.8/4.9/5 " jakub at gcc dot gnu.org
2015-04-07 17:36 ` law at redhat dot com
2015-04-08 7:20 ` rguenther at suse dot de
2015-05-19 17:51 ` [Bug target/29256] [4.8/4.9/5/6 " law at redhat dot com
2015-05-20 2:21 ` amker at gcc dot gnu.org
2015-05-20 12:45 ` wschmidt at gcc dot gnu.org
2015-06-23 8:22 ` rguenth at gcc dot gnu.org
2015-06-26 19:59 ` [Bug target/29256] [4.9/5/6 " jakub at gcc dot gnu.org
2015-06-26 20:29 ` jakub at gcc dot gnu.org
2015-08-11 18:36 ` wschmidt at gcc dot gnu.org
2015-08-12 7:12 ` rguenther at suse dot de
2015-08-12 7:34 ` amker at gcc dot gnu.org
2015-08-12 13:24 ` wschmidt at gcc dot gnu.org
2015-08-13 2:11 ` amker at gcc dot gnu.org
2015-08-13 3:28 ` wschmidt at gcc dot gnu.org
2015-08-13 5:01 ` amker at gcc dot gnu.org
2021-05-14 9:45 ` [Bug target/29256] [9/10/11/12 " jakub at gcc dot gnu.org
2021-06-01 8:04 ` rguenth at gcc dot gnu.org
2022-05-27 9:33 ` [Bug target/29256] [10/11/12/13 " rguenth at gcc dot gnu.org
2022-06-28 10:29 ` jakub at gcc dot gnu.org
2023-07-07 10:28 ` [Bug target/29256] [11/12/13/14 " rguenth at gcc dot gnu.org
2023-07-15 7:57 ` pinskia at gcc dot gnu.org
2023-07-17 8:29 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).