* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
@ 2012-05-31 0:58 ` matt at use dot net
2012-05-31 9:59 ` rguenth at gcc dot gnu.org
` (45 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-05-31 0:58 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #1 from Matt Hargett <matt at use dot net> 2012-05-31 00:55:36 UTC ---
Created attachment 27526
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27526
tarball containing buildable sources and binaries that demonstrate the severe
performance regression on amdfam10
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
2012-05-31 0:58 ` [Bug middle-end/53533] " matt at use dot net
@ 2012-05-31 9:59 ` rguenth at gcc dot gnu.org
2012-06-11 19:56 ` matt at use dot net
` (44 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-05-31 9:59 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |WAITING
Last reconfirmed| |2012-05-31
Ever Confirmed|0 |1
--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-05-31 09:21:09 UTC ---
Please do not use any of the Graphite optimization flags. Can you produce a
simple testcase please?
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
2012-05-31 0:58 ` [Bug middle-end/53533] " matt at use dot net
2012-05-31 9:59 ` rguenth at gcc dot gnu.org
@ 2012-06-11 19:56 ` matt at use dot net
2012-06-11 19:57 ` matt at use dot net
` (43 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-06-11 19:56 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #3 from Matt Hargett <matt at use dot net> 2012-06-11 19:56:14 UTC ---
Created attachment 27603
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27603
ZIP with pre-processed shorter example, callgrind output, and smaller binaries
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (2 preceding siblings ...)
2012-06-11 19:56 ` matt at use dot net
@ 2012-06-11 19:57 ` matt at use dot net
2012-06-11 20:02 ` matt at use dot net
` (42 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-06-11 19:57 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #4 from Matt Hargett <matt at use dot net> 2012-06-11 19:57:12 UTC ---
Created attachment 27604
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27604
shorter source example, ~150 lines w/o comments
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (3 preceding siblings ...)
2012-06-11 19:57 ` matt at use dot net
@ 2012-06-11 20:02 ` matt at use dot net
2012-06-12 9:54 ` [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark rguenth at gcc dot gnu.org
` (41 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-06-11 20:02 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #5 from Matt Hargett <matt at use dot net> 2012-06-11 20:02:41 UTC ---
Got rid of graphite options, it made no difference. I reduced the original test
from the suite and attached it's source, preprocessor output from 4.6 and 4.7
(no major difference), and callgrind output. To keep things simple, I'm just
using -O3 and -fwhole-program.
According to callgrind, 4.7's instruction references went up by 60% and D1
misses went up by 15% at -O3 versus 4.6 at -O3.
Let me know if you need any more information to continue triaging.
Thanks!
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (4 preceding siblings ...)
2012-06-11 20:02 ` matt at use dot net
@ 2012-06-12 9:54 ` rguenth at gcc dot gnu.org
2012-06-12 10:12 ` rguenth at gcc dot gnu.org
` (40 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 9:54 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |x86_64-*-*
Status|WAITING |NEW
Known to work| |4.6.3
Keywords| |missed-optimization
Component|middle-end |rtl-optimization
CC| |jakub at gcc dot gnu.org,
| |uros at gcc dot gnu.org
Summary|[4.7 regression] loop |[4.7/4.8 regression]
|unrolling as measured by |vectorization causes loop
|Adobe's C++Benchmark is |unrolling test slowdown as
|twice as slow versus |measured by Adobe's
|4.4-4.6 |C++Benchmark
Known to fail| |4.7.1, 4.8.0
Severity|major |normal
--- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 09:54:02 UTC ---
Ok, it seems to me that this has template-metaprogramming loop unrolling. With
GCC 4.7 we unroll and vectorize all loops, for example unroll factor 8 looks
like
<bb 50>:
# vect_var_.941_3474 = PHI <vect_var_.941_3472(50), {0, 0, 0, 0}(64)>
# vect_var_.941_3473 = PHI <vect_var_.941_3471(50), {0, 0, 0, 0}(64)>
# ivtmp.1325_970 = PHI <ivtmp.1325_812(50), ivtmp.1325_813(64)>
D.9934_819 = (void *) ivtmp.1325_970;
vect_var_.918_323 = MEM[base: D.9934_819, offset: 0B];
vect_var_.919_325 = MEM[base: D.9934_819, offset: 16B];
vect_var_.920_328 = vect_var_.918_323 + { 12345, 12345, 12345, 12345 };
vect_var_.920_330 = vect_var_.919_325 + { 12345, 12345, 12345, 12345 };
vect_var_.923_480 = vect_var_.920_328 * { 914237, 914237, 914237, 914237 };
vect_var_.923_895 = vect_var_.920_330 * { 914237, 914237, 914237, 914237 };
vect_var_.926_231 = vect_var_.923_480 + { 12332, 12332, 12332, 12332 };
vect_var_.926_232 = vect_var_.923_895 + { 12332, 12332, 12332, 12332 };
vect_var_.929_235 = vect_var_.926_231 * { 914237, 914237, 914237, 914237 };
vect_var_.929_236 = vect_var_.926_232 * { 914237, 914237, 914237, 914237 };
vect_var_.932_239 = vect_var_.929_235 + { 12332, 12332, 12332, 12332 };
vect_var_.932_240 = vect_var_.929_236 + { 12332, 12332, 12332, 12332 };
vect_var_.935_113 = vect_var_.932_239 * { 914237, 914237, 914237, 914237 };
vect_var_.935_247 = vect_var_.932_240 * { 914237, 914237, 914237, 914237 };
vect_var_.938_582 = vect_var_.935_113 + { -13, -13, -13, -13 };
vect_var_.938_839 = vect_var_.935_247 + { -13, -13, -13, -13 };
vect_var_.941_3472 = vect_var_.938_582 + vect_var_.941_3474;
vect_var_.941_3471 = vect_var_.938_839 + vect_var_.941_3473;
ivtmp.1325_812 = ivtmp.1325_970 + 32;
if (ivtmp.1325_812 != D.9937_388)
goto <bb 50>;
else
goto <bb 51>;
<bb 51>:
# vect_var_.941_3468 = PHI <vect_var_.941_3472(50)>
# vect_var_.941_3467 = PHI <vect_var_.941_3471(50)>
vect_var_.945_3466 = vect_var_.941_3468 + vect_var_.941_3467;
vect_var_.946_3465 = vect_var_.945_3466 v>> 64;
vect_var_.946_3464 = vect_var_.946_3465 + vect_var_.945_3466;
vect_var_.946_3463 = vect_var_.946_3464 v>> 32;
vect_var_.946_3462 = vect_var_.946_3463 + vect_var_.946_3464;
stmp_var_.944_3461 = BIT_FIELD_REF <vect_var_.946_3462, 32, 0>;
init_value.7_795 = init_value;
D.8606_796 = (int) init_value.7_795;
D.8600_797 = D.8606_796 + 12345;
D.8599_798 = D.8600_797 * 914237;
D.8602_799 = D.8599_798 + 12332;
D.8601_800 = D.8602_799 * 914237;
D.8604_801 = D.8601_800 + 12332;
D.8603_802 = D.8604_801 * 914237;
D.8605_803 = D.8603_802 + -13;
temp_804 = D.8605_803 * 8000;
if (temp_804 != stmp_var_.944_3461)
goto <bb 52>;
else
goto <bb 53>;
With GCC 4.6 OTOH the above loop is not vectorized, only the (slow) not
unrolled loop is.
<bb 49>:
# result_622 = PHI <result_704(49), 0(63)>
# ivtmp.852_1026 = PHI <ivtmp.852_842(49), ivtmp.852_844(63)>
D.9283_3302 = (void *) ivtmp.852_1026;
temp_801 = MEM[base: D.9283_3302, offset: 0B];
D.8366_802 = temp_801 + 12345;
D.8365_803 = D.8366_802 * 914237;
D.8368_804 = D.8365_803 + 12332;
D.8367_805 = D.8368_804 * 914237;
D.8370_806 = D.8367_805 + 12332;
D.8369_807 = D.8370_806 * 914237;
temp_808 = D.8369_807 + -13;
result_810 = temp_808 + result_622;
temp_815 = MEM[base: D.9283_3302, offset: 4B];
D.8381_816 = temp_815 + 12345;
D.8382_817 = D.8381_816 * 914237;
D.8378_818 = D.8382_817 + 12332;
D.8379_819 = D.8378_818 * 914237;
D.8376_820 = D.8379_819 + 12332;
D.8377_821 = D.8376_820 * 914237;
temp_822 = D.8377_821 + -13;
result_824 = result_810 + temp_822;
temp_788 = MEM[base: D.9283_3302, offset: 8B];
D.8351_789 = temp_788 + 12345;
D.8352_790 = D.8351_789 * 914237;
D.8348_791 = D.8352_790 + 12332;
D.8349_792 = D.8348_791 * 914237;
D.8346_793 = D.8349_792 + 12332;
D.8347_794 = D.8346_793 * 914237;
temp_795 = D.8347_794 + -13;
result_797 = temp_795 + result_824;
temp_774 = MEM[base: D.9283_3302, offset: 12B];
D.8333_775 = temp_774 + 12345;
D.8334_776 = D.8333_775 * 914237;
D.8330_777 = D.8334_776 + 12332;
D.8331_778 = D.8330_777 * 914237;
D.8328_779 = D.8331_778 + 12332;
D.8329_780 = D.8328_779 * 914237;
temp_781 = D.8329_780 + -13;
result_783 = temp_781 + result_797;
temp_760 = MEM[base: D.9283_3302, offset: 16B];
D.8315_761 = temp_760 + 12345;
D.8316_762 = D.8315_761 * 914237;
D.8312_763 = D.8316_762 + 12332;
D.8313_764 = D.8312_763 * 914237;
D.8310_765 = D.8313_764 + 12332;
D.8311_766 = D.8310_765 * 914237;
temp_767 = D.8311_766 + -13;
result_769 = temp_767 + result_783;
temp_746 = MEM[base: D.9283_3302, offset: 20B];
D.8297_747 = temp_746 + 12345;
D.8298_748 = D.8297_747 * 914237;
D.8294_749 = D.8298_748 + 12332;
D.8295_750 = D.8294_749 * 914237;
D.8292_751 = D.8295_750 + 12332;
D.8293_752 = D.8292_751 * 914237;
temp_753 = D.8293_752 + -13;
result_755 = temp_753 + result_769;
temp_732 = MEM[base: D.9283_3302, offset: 24B];
D.8279_733 = temp_732 + 12345;
D.8280_734 = D.8279_733 * 914237;
D.8276_735 = D.8280_734 + 12332;
D.8277_736 = D.8276_735 * 914237;
D.8274_737 = D.8277_736 + 12332;
D.8275_738 = D.8274_737 * 914237;
temp_739 = D.8275_738 + -13;
result_741 = temp_739 + result_755;
temp_695 = MEM[base: D.9283_3302, offset: 28B];
D.8246_696 = temp_695 + 12345;
D.8245_697 = D.8246_696 * 914237;
D.8248_698 = D.8245_697 + 12332;
D.8247_699 = D.8248_698 * 914237;
D.8250_700 = D.8247_699 + 12332;
D.8249_701 = D.8250_700 * 914237;
temp_702 = D.8249_701 + -13;
result_704 = temp_702 + result_741;
ivtmp.852_842 = ivtmp.852_1026 + 32;
if (ivtmp.852_842 != D.9292_3369)
goto <bb 49>;
else
goto <bb 50>;
<bb 50>:
# result_3198 = PHI <result_704(49)>
init_value.7_825 = init_value;
D.8393_826 = (int) init_value.7_825;
D.8387_827 = D.8393_826 + 12345;
D.8386_828 = D.8387_827 * 914237;
D.8389_829 = D.8386_828 + 12332;
D.8388_830 = D.8389_829 * 914237;
D.8391_831 = D.8388_830 + 12332;
D.8390_832 = D.8391_831 * 914237;
D.8392_833 = D.8390_832 + -13;
temp_834 = D.8392_833 * 8000;
if (temp_834 != result_3198)
goto <bb 51>;
else
goto <bb 52>;
With -fno-tree-vectorize the performance is the same. It seems that
vectorization is not profitable here for some reason. Same behavior
can be observed with GCC 4.8.
I used the preprocessed source for 4.7 from the ZIP file.
The code generated is odd at least, the inner loop looks like
movdqa .LC6(%rip), %xmm3
xorl %ebx, %ebx
movdqa .LC7(%rip), %xmm0
movdqa .LC8(%rip), %xmm1
movdqa .LC9(%rip), %xmm2
.p2align 4,,10
.p2align 3
.L51:
pxor %xmm6, %xmm6
movl $data32, %eax
movdqa %xmm6, %xmm7
.p2align 4,,10
.p2align 3
.L53:
movdqa (%rax), %xmm4
movdqa %xmm0, %xmm8
paddd %xmm3, %xmm4
movdqa %xmm4, %xmm5
psrldq $4, %xmm4
psrldq $4, %xmm8
pmuludq %xmm8, %xmm4
pshufd $8, %xmm4, %xmm4
pmuludq %xmm0, %xmm5
pshufd $8, %xmm5, %xmm5
movdqa %xmm0, %xmm8
psrldq $4, %xmm8
punpckldq %xmm4, %xmm5
paddd %xmm1, %xmm5
movdqa %xmm5, %xmm4
psrldq $4, %xmm5
pmuludq %xmm8, %xmm5
pshufd $8, %xmm5, %xmm5
pmuludq %xmm0, %xmm4
pshufd $8, %xmm4, %xmm4
punpckldq %xmm5, %xmm4
movdqa %xmm0, %xmm5
paddd %xmm1, %xmm4
movdqa %xmm4, %xmm8
psrldq $4, %xmm5
psrldq $4, %xmm4
pmuludq %xmm4, %xmm5
pshufd $8, %xmm5, %xmm5
pmuludq %xmm0, %xmm8
pshufd $8, %xmm8, %xmm4
movdqa %xmm0, %xmm8
psrldq $4, %xmm8
punpckldq %xmm5, %xmm4
paddd %xmm2, %xmm4
paddd %xmm4, %xmm7
movdqa 16(%rax), %xmm4
addq $32, %rax
paddd %xmm3, %xmm4
movdqa %xmm4, %xmm5
psrldq $4, %xmm4
pmuludq %xmm8, %xmm4
pshufd $8, %xmm4, %xmm4
movdqa %xmm0, %xmm8
pmuludq %xmm0, %xmm5
pshufd $8, %xmm5, %xmm5
cmpq $data32+32000, %rax
psrldq $4, %xmm8
punpckldq %xmm4, %xmm5
paddd %xmm1, %xmm5
movdqa %xmm5, %xmm4
psrldq $4, %xmm5
pmuludq %xmm8, %xmm5
pshufd $8, %xmm5, %xmm5
pmuludq %xmm0, %xmm4
pshufd $8, %xmm4, %xmm4
punpckldq %xmm5, %xmm4
movdqa %xmm0, %xmm5
paddd %xmm1, %xmm4
movdqa %xmm4, %xmm8
psrldq $4, %xmm5
psrldq $4, %xmm4
pmuludq %xmm4, %xmm5
pshufd $8, %xmm5, %xmm5
pmuludq %xmm0, %xmm8
pshufd $8, %xmm8, %xmm4
punpckldq %xmm5, %xmm4
paddd %xmm2, %xmm4
paddd %xmm4, %xmm6
jne .L53
which means we expand the multiplications with the constants in an odd way.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (5 preceding siblings ...)
2012-06-12 9:54 ` [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark rguenth at gcc dot gnu.org
@ 2012-06-12 10:12 ` rguenth at gcc dot gnu.org
2012-06-12 10:27 ` rguenth at gcc dot gnu.org
` (39 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 10:12 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:11:51 UTC ---
Btw, when I run the benchmark with the addition of -march=native (for me,
that's
-march=corei7) then GCC 4.7 performs better than 4.6:
4.6:
./t 100000
test description absolute operations ratio with
number time per second test0
0 "int32_t for loop unroll 1" 0.41 sec 1951.22 M 1.00
1 "int32_t for loop unroll 2" 0.51 sec 1568.63 M 1.24
2 "int32_t for loop unroll 3" 0.47 sec 1702.13 M 1.15
3 "int32_t for loop unroll 4" 0.48 sec 1666.67 M 1.17
4 "int32_t for loop unroll 5" 0.47 sec 1702.13 M 1.15
5 "int32_t for loop unroll 6" 0.51 sec 1568.63 M 1.24
6 "int32_t for loop unroll 7" 0.47 sec 1702.13 M 1.15
7 "int32_t for loop unroll 8" 0.47 sec 1702.13 M 1.15
Total absolute time for int32_t for loop unrolling: 3.79 sec
4.7:
./t 100000
test description absolute operations ratio with
number time per second test0
0 "int32_t for loop unroll 1" 0.39 sec 2051.28 M 1.00
1 "int32_t for loop unroll 2" 0.40 sec 2000.00 M 1.03
2 "int32_t for loop unroll 3" 0.39 sec 2051.28 M 1.00
3 "int32_t for loop unroll 4" 0.39 sec 2051.28 M 1.00
4 "int32_t for loop unroll 5" 0.38 sec 2105.26 M 0.97
5 "int32_t for loop unroll 6" 0.41 sec 1951.22 M 1.05
6 "int32_t for loop unroll 7" 0.37 sec 2162.16 M 0.95
7 "int32_t for loop unroll 8" 0.36 sec 2222.22 M 0.92
Total absolute time for int32_t for loop unrolling: 3.09 sec
The loop then looks like (the expected)
.L53:
movdqa (%rax), %xmm4
paddd %xmm3, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm2, %xmm4
paddd %xmm4, %xmm6
movdqa 16(%rax), %xmm4
addq $32, %rax
cmpq $data32+32000, %rax
paddd %xmm3, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm2, %xmm4
paddd %xmm4, %xmm5
jne .L53
looks like pmulld is only available with SSE 4.1 and otherwise we fall back
to the define_insn_and_split "*sse2_mulv4si3". But that complexity is not
reflected in the vectorizer cost model (which needs improvement ...).
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (6 preceding siblings ...)
2012-06-12 10:12 ` rguenth at gcc dot gnu.org
@ 2012-06-12 10:27 ` rguenth at gcc dot gnu.org
2012-06-12 10:39 ` rguenth at gcc dot gnu.org
` (38 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 10:27 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #8 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:27:15 UTC ---
Small testcase:
int a[256];
int b[256];
void foo (void)
{
int i;
for (i = 0; i < 256; ++i)
{
b[i] = a[i] * 23;
}
}
you can see that we shuffle even the vector with constants around! Not taking
into account the REG_EQUAL note which is gone at split1 time, removed by
either loop2_invariant or loop2_unswitch.
(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
(expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
]))
(expr_list:REG_DEAD (reg:V4SI 84)
(expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(nil)))))
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (7 preceding siblings ...)
2012-06-12 10:27 ` rguenth at gcc dot gnu.org
@ 2012-06-12 10:39 ` rguenth at gcc dot gnu.org
2012-06-12 11:57 ` rguenth at gcc dot gnu.org
` (37 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 10:39 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #9 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:39:19 UTC ---
And cprop fails to propagate
(reg:V4SI 85) := (const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
])
but it at least re-adds the REG_EQUAL note, but DSE drops it again. From
(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
(expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
]))
(expr_list:REG_DEAD (reg:V4SI 85)
(expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
(nil)))))
we go to
(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
(expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9,
offset: 0B] ])
(nil)))
Unfortunately there is no cprop pass after split1 to eventually clean things
up again (because of out-of-cfg-layout-mode ...). If I force it to run
it cannot simplify
(insn 42 24 43 3 (set (subreg:V2DI (reg:V4SI 86) 0)
(mult:V2DI (zero_extend:V2DI (vec_select:V2SI (reg:V4SI 83 [
MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ])
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
])))
(zero_extend:V2DI (vec_select:V2SI (reg:V4SI 85)
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
]))))) t.c:9 -1
(nil))
either though.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (8 preceding siblings ...)
2012-06-12 10:39 ` rguenth at gcc dot gnu.org
@ 2012-06-12 11:57 ` rguenth at gcc dot gnu.org
2012-06-12 18:26 ` matt at use dot net
` (36 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 11:57 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |stevenb.gcc at gmail dot
| |com
--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 11:57:20 UTC ---
Changing the insn_and_split to
(define_insn_and_split "*sse2_mulv4si3"
[(set (match_operand:V4SI 0 "register_operand")
(mult:V4SI (match_operand:V4SI 1 "register_operand")
(match_operand:V4SI 2 "nonmemory_vector_operand")))]
...
and defining
(define_predicate "nonmemory_vector_operand"
(ior (match_operand 0 "register_operand")
(match_code "const_vector")))
we ICE because when splitting
(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
(mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
(const_vector:V4SI [
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
(const_int 23 [0x17])
]))) t.c:9 1496 {*sse2_mulv4si3}
(expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9,
offset: 0B] ])
(nil)))
we don't even try to simplify when emitting the code.
But maybe allowing const_vector in (some of) the define_insn_and_split would
be the way to go ...
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (9 preceding siblings ...)
2012-06-12 11:57 ` rguenth at gcc dot gnu.org
@ 2012-06-12 18:26 ` matt at use dot net
2012-06-12 18:55 ` rth at gcc dot gnu.org
` (35 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-06-12 18:26 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #11 from Matt Hargett <matt at use dot net> 2012-06-12 18:25:25 UTC ---
Richard,
Thanks for the quick analysis! Sounds like a perfect storm of sorts :/
re: cprop failure: this may be indicated by another major regression in their
suite for the "simple constant folding" tests. in GCC 4.1-4.6, those tests are
all 0.0s but in 4.7 take tens of seconds. Let me know if you want me to file a
separate bug/reduced test case for that, and then have that new bug depend on
this one. Otherwise, I'll wait until this one sees some resolution and then
retest.
re: multiple passes: if you think that feature has enough merit to be revisited
now, I can look into re-proposing Maxim's patches from October/November 2011
that integrated your feedback at the time.
re: -march workaround: our deployment platform's minimum arch is nocona, and
enabling -march=nocona doesn't workaround the issue. For grins, I tried
-march=amdfam10 (another deployment target, but would require a separate
distributable binary), but that also didn't work around the issue.
I see a small improvement when using -fno-tree-vectorize, but not nearly as
dramatic as yours. For the int32_t for and while loop unrolling, the times go
from ~107s and ~105s to ~96s and ~95s, respectively. The do and goto loop
unrolling times get slightly worse (~2%), but it might be noise.
Let me know if there's any additional testing/footwork you'd like me to do.
Again, thanks for the quick turnaround on such a deep analysis!
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (10 preceding siblings ...)
2012-06-12 18:26 ` matt at use dot net
@ 2012-06-12 18:55 ` rth at gcc dot gnu.org
2012-06-13 9:44 ` rguenth at gcc dot gnu.org
` (34 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-12 18:55 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #12 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-12 18:54:24 UTC ---
(In reply to comment #10)
> But maybe allowing const_vector in (some of) the define_insn_and_split would
> be the way to go ...
Maybe. It certainly would ease some of the simplifications.
At the moment I don't think we can go from
mem -> const -> simplify -> const ->newmem
On the other hand, for this particular test case, where all
of the vector_cst elements are the same, and a reasonably
small number of bits set, it would be great to be able to
leverage synth_mult.
The main complexity for sse2_mulv4si3 is due to the fact that
we have to decompose the operation into V8HImode multiplies.
Whereas if we decompose the multiply, we have the shifts and
adds in V4SImode.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (11 preceding siblings ...)
2012-06-12 18:55 ` rth at gcc dot gnu.org
@ 2012-06-13 9:44 ` rguenth at gcc dot gnu.org
2012-06-14 14:39 ` rth at gcc dot gnu.org
` (33 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-13 9:44 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #13 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-13 09:43:15 UTC ---
(In reply to comment #12)
> (In reply to comment #10)
> > But maybe allowing const_vector in (some of) the define_insn_and_split would
> > be the way to go ...
>
> Maybe. It certainly would ease some of the simplifications.
> At the moment I don't think we can go from
>
> mem -> const -> simplify -> const ->newmem
>
> On the other hand, for this particular test case, where all
> of the vector_cst elements are the same, and a reasonably
> small number of bits set, it would be great to be able to
> leverage synth_mult.
I agree, though that should possibly be done earlier.
> The main complexity for sse2_mulv4si3 is due to the fact that
> we have to decompose the operation into V8HImode multiplies.
> Whereas if we decompose the multiply, we have the shifts and
> adds in V4SImode.
Well, for a constant multiplier one can avoid the shuffles of the
multiplier - we seem to use v2si -> v2di multiplies with sse2_mulv4si3.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (12 preceding siblings ...)
2012-06-13 9:44 ` rguenth at gcc dot gnu.org
@ 2012-06-14 14:39 ` rth at gcc dot gnu.org
2012-06-14 18:02 ` matt at use dot net
` (32 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-14 14:39 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Henderson <rth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rth at gcc dot gnu.org
AssignedTo|unassigned at gcc dot |rth at gcc dot gnu.org
|gnu.org |
--- Comment #14 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-14 14:38:43 UTC ---
Mine, at least for a 4.8 solution.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (13 preceding siblings ...)
2012-06-14 14:39 ` rth at gcc dot gnu.org
@ 2012-06-14 18:02 ` matt at use dot net
2012-06-14 18:39 ` rth at gcc dot gnu.org
` (31 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-06-14 18:02 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #15 from Matt Hargett <matt at use dot net> 2012-06-14 18:01:31 UTC ---
(In reply to comment #14)
> Mine, at least for a 4.8 solution.
What enhancement to 4.7 caused the regression? Can whatever the change was be
(partially) reverted to lessen the impact?
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (14 preceding siblings ...)
2012-06-14 18:02 ` matt at use dot net
@ 2012-06-14 18:39 ` rth at gcc dot gnu.org
2012-06-15 9:04 ` jakub at gcc dot gnu.org
` (30 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-14 18:39 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Henderson <rth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
--- Comment #16 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-14 18:38:30 UTC ---
Dunno exactly. The pre-SSE4.1 emulation of PMULLD has been there since
at least gcc 4.5.
What's not present in *any* version so far are some proper rtx_costs for
integer vector operations. So any questions the vectorizer might be
asking about what transformations are profitable are currently being
given bogus answers.
I'm hoping just that will fix the regression, though I also plan to
address some of the other algorithmic questions raised in this PR.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (15 preceding siblings ...)
2012-06-14 18:39 ` rth at gcc dot gnu.org
@ 2012-06-15 9:04 ` jakub at gcc dot gnu.org
2012-06-15 21:05 ` rth at gcc dot gnu.org
` (29 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-06-15 9:04 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #17 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-06-15 09:03:04 UTC ---
This started with http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173856
The current cost model is seriously insufficient.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (16 preceding siblings ...)
2012-06-15 9:04 ` jakub at gcc dot gnu.org
@ 2012-06-15 21:05 ` rth at gcc dot gnu.org
2012-08-10 9:43 ` rguenth at gcc dot gnu.org
` (28 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-15 21:05 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #18 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-15 21:04:49 UTC ---
See comments in http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01081.html
It's not the vectorization costing, as previously suggested.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (17 preceding siblings ...)
2012-06-15 21:05 ` rth at gcc dot gnu.org
@ 2012-08-10 9:43 ` rguenth at gcc dot gnu.org
2012-08-14 17:26 ` matt at use dot net
` (27 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-08-10 9:43 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |4.7.2
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (18 preceding siblings ...)
2012-08-10 9:43 ` rguenth at gcc dot gnu.org
@ 2012-08-14 17:26 ` matt at use dot net
2012-08-20 23:53 ` matt at use dot net
` (26 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-08-14 17:26 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #19 from Matt Hargett <matt at use dot net> 2012-08-14 17:25:40 UTC ---
Does this mean there will be a fix for this regression committed for 4.7.2? If
there's a patch I can test ahead of time, please let me know. Thanks!
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (19 preceding siblings ...)
2012-08-14 17:26 ` matt at use dot net
@ 2012-08-20 23:53 ` matt at use dot net
2012-09-20 10:27 ` jakub at gcc dot gnu.org
` (25 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: matt at use dot net @ 2012-08-20 23:53 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #20 from Matt Hargett <matt at use dot net> 2012-08-20 23:52:31 UTC ---
Some additional information:
Compared to LLVM 3.1 with -O3, GCC 4.7 is twice as slow on these benchmarks.
LLVM even outperforms GCC 4.1, which previously had the best result. We are
very eager to hear about any resolution for this major regression in 4.7 so we
can deploy it. Even a return to GCC 4.1 performance levels would be fine.
Thanks!
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (20 preceding siblings ...)
2012-08-20 23:53 ` matt at use dot net
@ 2012-09-20 10:27 ` jakub at gcc dot gnu.org
2012-11-29 21:17 ` rth at gcc dot gnu.org
` (24 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-09-20 10:27 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.7.2 |4.7.3
--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-09-20 10:21:07 UTC ---
GCC 4.7.2 has been released.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (21 preceding siblings ...)
2012-09-20 10:27 ` jakub at gcc dot gnu.org
@ 2012-11-29 21:17 ` rth at gcc dot gnu.org
2012-12-03 15:27 ` rguenth at gcc dot gnu.org
` (23 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rth at gcc dot gnu.org @ 2012-11-29 21:17 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Henderson <rth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |NEW
AssignedTo|rth at gcc dot gnu.org |unassigned at gcc dot
| |gnu.org
--- Comment #22 from Richard Henderson <rth at gcc dot gnu.org> 2012-11-29 21:17:05 UTC ---
Needs long-term work in pre-vectorization folding.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (22 preceding siblings ...)
2012-11-29 21:17 ` rth at gcc dot gnu.org
@ 2012-12-03 15:27 ` rguenth at gcc dot gnu.org
2013-04-11 8:00 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9 " rguenth at gcc dot gnu.org
` (22 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-12-03 15:27 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8/4.9 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (23 preceding siblings ...)
2012-12-03 15:27 ` rguenth at gcc dot gnu.org
@ 2013-04-11 8:00 ` rguenth at gcc dot gnu.org
2014-06-12 13:45 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
` (21 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-04-11 8:00 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.7.3 |4.7.4
--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> 2013-04-11 07:59:38 UTC ---
GCC 4.7.3 is being released, adjusting target milestone.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.7/4.8/4.9/4.10 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (24 preceding siblings ...)
2013-04-11 8:00 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9 " rguenth at gcc dot gnu.org
@ 2014-06-12 13:45 ` rguenth at gcc dot gnu.org
2014-12-19 13:28 ` [Bug rtl-optimization/53533] [4.8/4.9/5 " jakub at gcc dot gnu.org
` (20 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-06-12 13:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.7.4 |4.8.4
--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> ---
The 4.7 branch is being closed, moving target milestone to 4.8.4.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.8/4.9/5 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (25 preceding siblings ...)
2014-06-12 13:45 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
@ 2014-12-19 13:28 ` jakub at gcc dot gnu.org
2015-05-03 13:00 ` [Bug rtl-optimization/53533] [4.8/4.9/5/6 " trippels at gcc dot gnu.org
` (19 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-12-19 13:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.8.4 |4.8.5
--- Comment #25 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.8.4 has been released.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (26 preceding siblings ...)
2014-12-19 13:28 ` [Bug rtl-optimization/53533] [4.8/4.9/5 " jakub at gcc dot gnu.org
@ 2015-05-03 13:00 ` trippels at gcc dot gnu.org
2015-05-03 13:01 ` trippels at gcc dot gnu.org
` (18 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-03 13:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Markus Trippelsdorf <trippels at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2012-05-31 00:00:00 |2015-5-3
CC| |trippels at gcc dot gnu.org
--- Comment #26 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
For gcc-5 and gcc-6 there is an additional 50% slowdown:
% g++ -O3 loop_unroll.ii -o loop_unroll
% time ./loop_unroll 10000
./loop_unroll 10000
test description absolute operations ratio with
number time per second test0
0 "int32_t for loop unroll 1" 0.14 sec 552.30 M 1.00
1 "int32_t for loop unroll 2" 0.11 sec 699.49 M 0.79
2 "int32_t for loop unroll 3" 0.14 sec 566.56 M 0.97
3 "int32_t for loop unroll 4" 0.15 sec 532.87 M 1.04
4 "int32_t for loop unroll 5" 0.10 sec 784.70 M 0.70
5 "int32_t for loop unroll 6" 0.09 sec 887.12 M 0.62
6 "int32_t for loop unroll 7" 0.09 sec 913.50 M 0.60
7 "int32_t for loop unroll 8" 0.08 sec 986.45 M 0.56
8 "int32_t for loop unroll 9" 0.23 sec 346.06 M 1.60
9 "int32_t for loop unroll 10" 0.08 sec 1040.06 M 0.53
10 "int32_t for loop unroll 11" 0.23 sec 348.02 M 1.59
11 "int32_t for loop unroll 12" 0.23 sec 353.38 M 1.56
12 "int32_t for loop unroll 13" 0.24 sec 338.32 M 1.63
13 "int32_t for loop unroll 14" 0.24 sec 332.32 M 1.66
14 "int32_t for loop unroll 15" 0.25 sec 321.15 M 1.72
15 "int32_t for loop unroll 16" 0.25 sec 318.23 M 1.74
16 "int32_t for loop unroll 17" 0.24 sec 329.43 M 1.68
17 "int32_t for loop unroll 18" 0.25 sec 321.34 M 1.72
18 "int32_t for loop unroll 19" 0.25 sec 314.53 M 1.76
19 "int32_t for loop unroll 20" 0.25 sec 325.33 M 1.70
20 "int32_t for loop unroll 21" 0.25 sec 323.67 M 1.71
21 "int32_t for loop unroll 22" 0.25 sec 316.85 M 1.74
22 "int32_t for loop unroll 23" 0.25 sec 323.51 M 1.71
23 "int32_t for loop unroll 24" 0.06 sec 1257.94 M 0.44
24 "int32_t for loop unroll 25" 0.24 sec 327.77 M 1.69
25 "int32_t for loop unroll 26" 0.06 sec 1310.44 M 0.42
26 "int32_t for loop unroll 27" 0.07 sec 1072.85 M 0.51
27 "int32_t for loop unroll 28" 0.28 sec 283.44 M 1.95
28 "int32_t for loop unroll 29" 0.30 sec 267.96 M 2.06
29 "int32_t for loop unroll 30" 0.31 sec 258.88 M 2.13
30 "int32_t for loop unroll 31" 0.06 sec 1337.64 M 0.41
31 "int32_t for loop unroll 32" 0.06 sec 1315.10 M 0.42
Total absolute time for int32_t for loop unrolling: 5.85 sec
...
./loop_unroll 10000 41.43s user 0.00s system 100% cpu 41.426 total
==============================================================================
% /usr/x86_64-pc-linux-gnu/gcc-bin/4.9.2/g++ -O3 loop_unroll.ii -o loop_unroll
% time ./loop_unroll 10000
./loop_unroll 10000
test description absolute operations ratio with
number time per second test0
0 "int32_t for loop unroll 1" 0.14 sec 582.13 M 1.00
1 "int32_t for loop unroll 2" 0.13 sec 625.41 M 0.93
2 "int32_t for loop unroll 3" 0.13 sec 635.76 M 0.92
3 "int32_t for loop unroll 4" 0.13 sec 625.41 M 0.93
4 "int32_t for loop unroll 5" 0.12 sec 640.96 M 0.91
5 "int32_t for loop unroll 6" 0.09 sec 888.11 M 0.66
6 "int32_t for loop unroll 7" 0.09 sec 900.10 M 0.65
7 "int32_t for loop unroll 8" 0.10 sec 832.20 M 0.70
8 "int32_t for loop unroll 9" 0.10 sec 834.22 M 0.70
9 "int32_t for loop unroll 10" 0.09 sec 902.04 M 0.65
10 "int32_t for loop unroll 11" 0.10 sec 805.15 M 0.72
11 "int32_t for loop unroll 12" 0.10 sec 823.27 M 0.71
12 "int32_t for loop unroll 13" 0.09 sec 860.51 M 0.68
13 "int32_t for loop unroll 14" 0.11 sec 753.59 M 0.77
14 "int32_t for loop unroll 15" 0.10 sec 781.96 M 0.74
15 "int32_t for loop unroll 16" 0.09 sec 858.76 M 0.68
16 "int32_t for loop unroll 17" 0.09 sec 846.91 M 0.69
17 "int32_t for loop unroll 18" 0.10 sec 783.19 M 0.74
18 "int32_t for loop unroll 19" 0.10 sec 794.81 M 0.73
19 "int32_t for loop unroll 20" 0.10 sec 806.70 M 0.72
20 "int32_t for loop unroll 21" 0.10 sec 823.82 M 0.71
21 "int32_t for loop unroll 22" 0.09 sec 851.74 M 0.68
22 "int32_t for loop unroll 23" 0.10 sec 792.87 M 0.73
23 "int32_t for loop unroll 24" 0.10 sec 809.32 M 0.72
24 "int32_t for loop unroll 25" 0.10 sec 832.18 M 0.70
25 "int32_t for loop unroll 26" 0.10 sec 781.11 M 0.75
26 "int32_t for loop unroll 27" 0.10 sec 792.40 M 0.73
27 "int32_t for loop unroll 28" 0.10 sec 817.22 M 0.71
28 "int32_t for loop unroll 29" 0.10 sec 826.40 M 0.70
29 "int32_t for loop unroll 30" 0.10 sec 803.83 M 0.72
30 "int32_t for loop unroll 31" 0.10 sec 803.48 M 0.72
31 "int32_t for loop unroll 32" 0.10 sec 796.88 M 0.73
Total absolute time for int32_t for loop unrolling: 3.28 sec
...
./loop_unroll 10000 22.75s user 0.00s system 100% cpu 22.746 total
clang:
./loop_unroll 10000 12.93s user 0.00s system 100% cpu 12.933 total
icpc (5* faster than gcc-5):
./loop_unroll 10000 8.38s user 0.00s system 99% cpu 8.382 total
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (27 preceding siblings ...)
2015-05-03 13:00 ` [Bug rtl-optimization/53533] [4.8/4.9/5/6 " trippels at gcc dot gnu.org
@ 2015-05-03 13:01 ` trippels at gcc dot gnu.org
2015-05-04 14:46 ` maltsevm at gmail dot com
` (17 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-03 13:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #27 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
Created attachment 35448
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35448&action=edit
testcase
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (28 preceding siblings ...)
2015-05-03 13:01 ` trippels at gcc dot gnu.org
@ 2015-05-04 14:46 ` maltsevm at gmail dot com
2015-05-04 15:00 ` maltsevm at gmail dot com
` (16 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: maltsevm at gmail dot com @ 2015-05-04 14:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Mikhail Maltsev <maltsevm at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |maltsevm at gmail dot com
--- Comment #28 from Mikhail Maltsev <maltsevm at gmail dot com> ---
Created attachment 35455
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35455&action=edit
testcase, inlining
This testcase marks some functions with __attribute__((always_inline/noinline))
when -DINLINE_MANUALLY is defined.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (29 preceding siblings ...)
2015-05-04 14:46 ` maltsevm at gmail dot com
@ 2015-05-04 15:00 ` maltsevm at gmail dot com
2015-06-23 8:22 ` rguenth at gcc dot gnu.org
` (15 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: maltsevm at gmail dot com @ 2015-05-04 15:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #29 from Mikhail Maltsev <maltsevm at gmail dot com> ---
Results for attached testcase:
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (Haswell)
g++ -O3 -march=native -mtune=native
10000 iterations
Clang 3.7
Total absolute time for int32_t for loop unrolling: 0.99 sec
Total absolute time for int32_t do loop unrolling: 1.00 sec
Total absolute time for double for loop unrolling: 1.37 sec
Total absolute time for double do loop unrolling: 1.37 sec
GCC 4.7.4
Total absolute time for int32_t for loop unrolling: 5.88 sec
Total absolute time for int32_t do loop unrolling: 7.57 sec
Total absolute time for double for loop unrolling: 2.29 sec
Total absolute time for double do loop unrolling: 2.45 sec
GCC 4.8.4
Total absolute time for int32_t for loop unrolling: 3.12 sec
Total absolute time for int32_t do loop unrolling: 3.29 sec
Total absolute time for double for loop unrolling: 1.13 sec
Total absolute time for double do loop unrolling: 1.14 sec
GCC 4.9.2
Total absolute time for int32_t for loop unrolling: 3.02 sec
Total absolute time for int32_t do loop unrolling: 3.29 sec
Total absolute time for double for loop unrolling: 1.10 sec
Total absolute time for double do loop unrolling: 1.13 sec
GCC 6
Total absolute time for int32_t for loop unrolling: 5.95 sec
Total absolute time for int32_t do loop unrolling: 6.95 sec
Total absolute time for double for loop unrolling: 2.39 sec
Total absolute time for double do loop unrolling: 2.39 sec
g++ -DINLINE_MANUALLY -O3 -march=native -mtune=native
50000 iterations
Clang 3.7
Total absolute time for int32_t for loop unrolling: 2.43 sec
Total absolute time for int32_t do loop unrolling: 2.32 sec
Total absolute time for double for loop unrolling: 6.38 sec
Total absolute time for double do loop unrolling: 6.38 sec
GCC 4.9.2
Total absolute time for int32_t for loop unrolling: 10.17 sec
Total absolute time for int32_t do loop unrolling: 10.16 sec
Total absolute time for double for loop unrolling: 3.89 sec
Total absolute time for double do loop unrolling: 3.90 sec
GCC 6
Total absolute time for int32_t for loop unrolling: 10.10 sec
Total absolute time for int32_t do loop unrolling: 10.12 sec
Total absolute time for double for loop unrolling: 3.90 sec
Total absolute time for double do loop unrolling: 3.89 sec
g++ -DINLINE_MANUALLY -Ofast -march=native -mtune=native
GCC 6
Total absolute time for int32_t for loop unrolling: 10.11 sec
Total absolute time for int32_t do loop unrolling: 10.11 sec
Total absolute time for double for loop unrolling: 1.14 sec
Total absolute time for double do loop unrolling: 1.15 sec
So, IMHO there is no regression here (at least w.r.t. vectorization). Floating
point loop gets constant-folded, if reassociation is allowed. Also, GCC6 is
able to infer that "for" and "while" tests are semantically equivalent and
unifies them.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (30 preceding siblings ...)
2015-05-04 15:00 ` maltsevm at gmail dot com
@ 2015-06-23 8:22 ` rguenth at gcc dot gnu.org
2015-06-26 19:58 ` [Bug rtl-optimization/53533] [4.9/5/6 " jakub at gcc dot gnu.org
` (14 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-06-23 8:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.8.5 |4.9.3
--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> ---
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (31 preceding siblings ...)
2015-06-23 8:22 ` rguenth at gcc dot gnu.org
@ 2015-06-26 19:58 ` jakub at gcc dot gnu.org
2015-06-26 20:29 ` jakub at gcc dot gnu.org
` (13 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 19:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #31 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.9.3 has been released.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (32 preceding siblings ...)
2015-06-26 19:58 ` [Bug rtl-optimization/53533] [4.9/5/6 " jakub at gcc dot gnu.org
@ 2015-06-26 20:29 ` jakub at gcc dot gnu.org
2021-02-23 12:24 ` [Bug rtl-optimization/53533] [8/9/10/11 " rguenth at gcc dot gnu.org
` (12 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 20:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|4.9.3 |4.9.4
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [8/9/10/11 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (33 preceding siblings ...)
2015-06-26 20:29 ` jakub at gcc dot gnu.org
@ 2021-02-23 12:24 ` rguenth at gcc dot gnu.org
2021-05-14 9:46 ` [Bug rtl-optimization/53533] [9/10/11/12 " jakub at gcc dot gnu.org
` (11 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-23 12:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2015-05-03 00:00:00 |2021-2-23
--- Comment #41 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-confirmed.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [9/10/11/12 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (34 preceding siblings ...)
2021-02-23 12:24 ` [Bug rtl-optimization/53533] [8/9/10/11 " rguenth at gcc dot gnu.org
@ 2021-05-14 9:46 ` jakub at gcc dot gnu.org
2021-06-01 8:05 ` rguenth at gcc dot gnu.org
` (10 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-05-14 9:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|8.5 |9.4
--- Comment #42 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 8 branch is being closed.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [9/10/11/12 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (35 preceding siblings ...)
2021-05-14 9:46 ` [Bug rtl-optimization/53533] [9/10/11/12 " jakub at gcc dot gnu.org
@ 2021-06-01 8:05 ` rguenth at gcc dot gnu.org
2022-05-27 9:34 ` [Bug rtl-optimization/53533] [10/11/12/13 " rguenth at gcc dot gnu.org
` (9 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-01 8:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|9.4 |9.5
--- Comment #43 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (36 preceding siblings ...)
2021-06-01 8:05 ` rguenth at gcc dot gnu.org
@ 2022-05-27 9:34 ` rguenth at gcc dot gnu.org
2022-05-30 6:40 ` crazylht at gmail dot com
` (8 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-27 9:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|9.5 |10.4
--- Comment #44 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9 branch is being closed
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (37 preceding siblings ...)
2022-05-27 9:34 ` [Bug rtl-optimization/53533] [10/11/12/13 " rguenth at gcc dot gnu.org
@ 2022-05-30 6:40 ` crazylht at gmail dot com
2022-05-30 8:57 ` rguenther at suse dot de
` (7 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: crazylht at gmail dot com @ 2022-05-30 6:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #45 from Hongtao.liu <crazylht at gmail dot com> ---
A reduced testcase.
int a[256];
int b[256];
void foo (void)
{
int i;
for (i = 0; i < 256; ++i)
{
int tmp = a[i] + 12345;
tmp *= 914237;
tmp += 12332;
tmp *= 914237;
tmp += 12332;
tmp *= 914237;
tmp -= 13;
tmp *= 8000;
b[i] = tmp;
}
}
GCC now simply pmulld to pslld + padd + psub, the vectorizer cost model looks
fine, but for scalar version, it's extraly optimized in pass_combine from 4 *
mult + 3 * add to 1 * mult + 2 * add which is not taken in count by vectorizer.
The vectorized version is not simplified later.
mov eax, DWORD PTR a[rdx]
add rdx, 4
add eax, 12345
imul eax, eax, -1564285888
sub eax, 333519936
mov DWORD PTR b[rdx-4], eax
cmp rdx, 1024
jne .L2
I'm wondering could Gimple also simplify
tmp *= 914237;
tmp += 12332;
tmp *= 914237;
tmp += 12332;
tmp *= 914237;
tmp -= 13;
tmp *= 8000;
to
tmp *= -1564285888;
tmp -= 333519936;
refer to https://godbolt.org/z/qYMYMTxEY
Then the vectorized code would be more optimal.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (38 preceding siblings ...)
2022-05-30 6:40 ` crazylht at gmail dot com
@ 2022-05-30 8:57 ` rguenther at suse dot de
2022-05-30 9:10 ` crazylht at gmail dot com
` (6 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenther at suse dot de @ 2022-05-30 8:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #46 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 30 May 2022, crazylht at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
>
> --- Comment #45 from Hongtao.liu <crazylht at gmail dot com> ---
> A reduced testcase.
>
> int a[256];
> int b[256];
>
> void foo (void)
> {
> int i;
> for (i = 0; i < 256; ++i)
> {
> int tmp = a[i] + 12345;
> tmp *= 914237;
> tmp += 12332;
> tmp *= 914237;
> tmp += 12332;
> tmp *= 914237;
> tmp -= 13;
> tmp *= 8000;
> b[i] = tmp;
> }
> }
>
> GCC now simply pmulld to pslld + padd + psub, the vectorizer cost model looks
> fine, but for scalar version, it's extraly optimized in pass_combine from 4 *
> mult + 3 * add to 1 * mult + 2 * add which is not taken in count by vectorizer.
> The vectorized version is not simplified later.
>
> mov eax, DWORD PTR a[rdx]
> add rdx, 4
> add eax, 12345
> imul eax, eax, -1564285888
> sub eax, 333519936
> mov DWORD PTR b[rdx-4], eax
> cmp rdx, 1024
> jne .L2
>
>
> I'm wondering could Gimple also simplify
>
> tmp *= 914237;
> tmp += 12332;
> tmp *= 914237;
> tmp += 12332;
> tmp *= 914237;
> tmp -= 13;
> tmp *= 8000;
>
> to
> tmp *= -1564285888;
> tmp -= 333519936;
>
> refer to https://godbolt.org/z/qYMYMTxEY
>
> Then the vectorized code would be more optimal.
The issue is that the re-association pass doesn't handle operations
with undefined overflow behavior, we do have duplicate bugreports
for this.
On the RTL level likely simplify-rtx (or the variants used by combine)
only have limited support for vector operations.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (39 preceding siblings ...)
2022-05-30 8:57 ` rguenther at suse dot de
@ 2022-05-30 9:10 ` crazylht at gmail dot com
2022-05-30 9:14 ` rguenther at suse dot de
` (5 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: crazylht at gmail dot com @ 2022-05-30 9:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #47 from Hongtao.liu <crazylht at gmail dot com> ---
>
> The issue is that the re-association pass doesn't handle operations
> with undefined overflow behavior, we do have duplicate bugreports
> for this.
>
I saw below in match.pd
478/* Combine successive multiplications. Similar to above, but handling
479 overflow is different. */
480(simplify
481 (mult (mult @0 INTEGER_CST@1) INTEGER_CST@2)
482 (with {
483 wi::overflow_type overflow;
484 wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
485 TYPE_SIGN (type), &overflow);
486 }
487 /* Skip folding on overflow: the only special case is @1 * @2 ==
-INT_MIN,
488 otherwise undefined overflow implies that @0 must be zero. */
489 (if (!overflow || TYPE_OVERFLOW_WRAPS (type))
490 (mult @0 { wide_int_to_tree (type, mul); }))))
Can it be extend to (mult (plus_minus (mult @0 INTEGER_CST@1) INTEGER_CST@3)
INTEGER_CST@2), so at least we can handle it under -fwrapv?
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (40 preceding siblings ...)
2022-05-30 9:10 ` crazylht at gmail dot com
@ 2022-05-30 9:14 ` rguenther at suse dot de
2022-06-16 1:29 ` cvs-commit at gcc dot gnu.org
` (4 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: rguenther at suse dot de @ 2022-05-30 9:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #48 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 30 May 2022, crazylht at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
>
> --- Comment #47 from Hongtao.liu <crazylht at gmail dot com> ---
>
> >
> > The issue is that the re-association pass doesn't handle operations
> > with undefined overflow behavior, we do have duplicate bugreports
> > for this.
> >
>
> I saw below in match.pd
>
> 478/* Combine successive multiplications. Similar to above, but handling
> 479 overflow is different. */
> 480(simplify
> 481 (mult (mult @0 INTEGER_CST@1) INTEGER_CST@2)
> 482 (with {
> 483 wi::overflow_type overflow;
> 484 wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
> 485 TYPE_SIGN (type), &overflow);
> 486 }
> 487 /* Skip folding on overflow: the only special case is @1 * @2 ==
> -INT_MIN,
> 488 otherwise undefined overflow implies that @0 must be zero. */
> 489 (if (!overflow || TYPE_OVERFLOW_WRAPS (type))
> 490 (mult @0 { wide_int_to_tree (type, mul); }))))
>
> Can it be extend to (mult (plus_minus (mult @0 INTEGER_CST@1) INTEGER_CST@3)
> INTEGER_CST@2), so at least we can handle it under -fwrapv?
With -fwrapv the reassoc pass might do this already (not sure with
mixing multiplication and addition, you'd have to try). But sure,
we could add a pattern for the above (with appropriate single-use
handling).
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (41 preceding siblings ...)
2022-05-30 9:14 ` rguenther at suse dot de
@ 2022-06-16 1:29 ` cvs-commit at gcc dot gnu.org
2022-06-16 2:31 ` crazylht at gmail dot com
` (3 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-06-16 1:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #49 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:1089d083117f28f3518f5ec3c7a153236cb92334
commit r13-1126-g1089d083117f28f3518f5ec3c7a153236cb92334
Author: liuhongt <hongtao.liu@intel.com>
Date: Tue May 31 17:13:21 2022 +0800
Simplify (B * v + C) * D -> BD* v + CD when B,C,D are all INTEGER_CST.
Similar for (v + B) * C + D -> C * v + BCD.
Don't simplify it when there's overflow and overflow is UB for type v.
gcc/ChangeLog:
PR tree-optimization/53533
* match.pd: Simplify (B * v + C) * D -> BD * v + CD and
(v + B) * C + D -> C * v + BCD when B,C,D are all INTEGER_CST,
and there's no overflow or !TYPE_OVERFLOW_UNDEFINED.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr53533-1.c: New test.
* gcc.target/i386/pr53533-2.c: New test.
* gcc.target/i386/pr53533-3.c: New test.
* gcc.target/i386/pr53533-4.c: New test.
* gcc.target/i386/pr53533-5.c: New test.
* gcc.dg/vect/slp-11a.c: Adjust testcase.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (42 preceding siblings ...)
2022-06-16 1:29 ` cvs-commit at gcc dot gnu.org
@ 2022-06-16 2:31 ` crazylht at gmail dot com
2022-06-28 10:30 ` jakub at gcc dot gnu.org
` (2 subsequent siblings)
46 siblings, 0 replies; 48+ messages in thread
From: crazylht at gmail dot com @ 2022-06-16 2:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #50 from Hongtao.liu <crazylht at gmail dot com> ---
>
> On the RTL level likely simplify-rtx (or the variants used by combine)
> only have limited support for vector operations.
Instruction sequence window(more than 20 shift instructions) is too big for
combine, hard to fix it in rtl.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (43 preceding siblings ...)
2022-06-16 2:31 ` crazylht at gmail dot com
@ 2022-06-28 10:30 ` jakub at gcc dot gnu.org
2023-07-07 10:29 ` [Bug rtl-optimization/53533] [11/12/13/14 " rguenth at gcc dot gnu.org
2024-07-19 12:55 ` [Bug rtl-optimization/53533] [12/13/14/15 " rguenth at gcc dot gnu.org
46 siblings, 0 replies; 48+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|10.4 |10.5
--- Comment #51 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [11/12/13/14 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (44 preceding siblings ...)
2022-06-28 10:30 ` jakub at gcc dot gnu.org
@ 2023-07-07 10:29 ` rguenth at gcc dot gnu.org
2024-07-19 12:55 ` [Bug rtl-optimization/53533] [12/13/14/15 " rguenth at gcc dot gnu.org
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07 10:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|10.5 |11.5
--- Comment #52 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10 branch is being closed.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [Bug rtl-optimization/53533] [12/13/14/15 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
2012-05-31 0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
` (45 preceding siblings ...)
2023-07-07 10:29 ` [Bug rtl-optimization/53533] [11/12/13/14 " rguenth at gcc dot gnu.org
@ 2024-07-19 12:55 ` rguenth at gcc dot gnu.org
46 siblings, 0 replies; 48+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-07-19 12:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.5 |12.5
--- Comment #53 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11 branch is being closed.
^ permalink raw reply [flat|nested] 48+ messages in thread