[Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
@ 2012-05-31  0:55 matt at use dot net
  2012-05-31  0:58 ` [Bug middle-end/53533] " matt at use dot net
                   ` (45 more replies)
  0 siblings, 46 replies; 47+ messages in thread
From: matt at use dot net @ 2012-05-31  0:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

             Bug #: 53533
           Summary: [4.7 regression] loop unrolling as measured by Adobe's
                    C++Benchmark is twice as slow versus 4.4-4.6
    Classification: Unclassified
           Product: gcc
           Version: 4.7.1
            Status: UNCONFIRMED
          Severity: major
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: matt@use.net


Comparing GCC versions, branches, and optimization levels on Adobe's
C++Benchmark suite, I discovered that 4.7 has a major regression with their
loop unrolling tests. I have captures the data here:

https://docs.google.com/spreadsheet/ccc?key=0Amu19eOay72HdE1xYVRPUTFYWU1TSld3Y2FEOEt5LXc

All compilers were fresh checkouts by me from their trunk revisions as of a few
days ago. My configure command line:
/u/mhargett/src/gcc-4_7-branch/configure --program-suffix=-4.7
--prefix=/u/mhargett --enable-languages=c,c++,lto --enable-lto
--with-build-config=bootstrap-lto --with-fpmath=sse --disable-libmudflap
--disable-libssp --enable-build-with-cxx --enable-gold=yes
--with-mpc=/u/mhargett --with-cloog=/u/mhargett/ --with-ppl=/u/mhargett/
--with-gmp=/u/mhargett/ --with-mpfr=/u/mhargett/ --enable-cloog-backend=isl
--disable-cloog-version-check CC=gcc-4.7 CXX=g++-4.7

The 4.6 and 4.7 versions were both build against the same Cloog, ppl, mpfr,
etc.

Going from "-O3 -floop-block -floop-strip-mine -floop-interchange
-mtune=amdfam10" to "-Ofast -funsafe-loop-optimizations -funroll-loops
-floop-block -floop-strip-mine -floop-interchange" didn't help.

Attached is a tar ball of the 4.6 and 4.7 -O3 optimized builds. 'make report'
re-runs the tests, 'make clean && make' rebuilds.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
@ 2012-05-31  0:58 ` matt at use dot net
  2012-05-31  9:59 ` rguenth at gcc dot gnu.org
                   ` (44 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-05-31  0:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #1 from Matt Hargett <matt at use dot net> 2012-05-31 00:55:36 UTC ---
Created attachment 27526
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27526
tarball containing buildable sources and binaries that demonstrate the severe
performance regression on amdfam10


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
  2012-05-31  0:58 ` [Bug middle-end/53533] " matt at use dot net
@ 2012-05-31  9:59 ` rguenth at gcc dot gnu.org
  2012-06-11 19:56 ` matt at use dot net
                   ` (43 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-05-31  9:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |WAITING
   Last reconfirmed|                            |2012-05-31
     Ever Confirmed|0                           |1

--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-05-31 09:21:09 UTC ---
Please do not use any of the Graphite optimization flags.  Can you produce a
simple testcase please?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
  2012-05-31  0:58 ` [Bug middle-end/53533] " matt at use dot net
  2012-05-31  9:59 ` rguenth at gcc dot gnu.org
@ 2012-06-11 19:56 ` matt at use dot net
  2012-06-11 19:57 ` matt at use dot net
                   ` (42 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-06-11 19:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #3 from Matt Hargett <matt at use dot net> 2012-06-11 19:56:14 UTC ---
Created attachment 27603
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27603
ZIP with pre-processed shorter example, callgrind output, and smaller binaries


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (2 preceding siblings ...)
  2012-06-11 19:56 ` matt at use dot net
@ 2012-06-11 19:57 ` matt at use dot net
  2012-06-11 20:02 ` matt at use dot net
                   ` (41 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-06-11 19:57 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #4 from Matt Hargett <matt at use dot net> 2012-06-11 19:57:12 UTC ---
Created attachment 27604
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27604
shorter source example, ~150 lines w/o comments


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug middle-end/53533] [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (3 preceding siblings ...)
  2012-06-11 19:57 ` matt at use dot net
@ 2012-06-11 20:02 ` matt at use dot net
  2012-06-12  9:54 ` [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark rguenth at gcc dot gnu.org
                   ` (40 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-06-11 20:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #5 from Matt Hargett <matt at use dot net> 2012-06-11 20:02:41 UTC ---
Got rid of graphite options, it made no difference. I reduced the original test
from the suite and attached it's source, preprocessor output from 4.6 and 4.7
(no major difference), and callgrind output. To keep things simple, I'm just
using -O3 and -fwhole-program.

According to callgrind, 4.7's instruction references went up by 60% and D1
misses went up by 15% at -O3 versus 4.6 at -O3.

Let me know if you need any more information to continue triaging.

Thanks!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (4 preceding siblings ...)
  2012-06-11 20:02 ` matt at use dot net
@ 2012-06-12  9:54 ` rguenth at gcc dot gnu.org
  2012-06-12 10:12 ` rguenth at gcc dot gnu.org
                   ` (39 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12  9:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
             Status|WAITING                     |NEW
      Known to work|                            |4.6.3
           Keywords|                            |missed-optimization
          Component|middle-end                  |rtl-optimization
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org
            Summary|[4.7 regression] loop       |[4.7/4.8 regression]
                   |unrolling as measured by    |vectorization causes loop
                   |Adobe's C++Benchmark is     |unrolling test slowdown as
                   |twice as slow versus        |measured by Adobe's
                   |4.4-4.6                     |C++Benchmark
      Known to fail|                            |4.7.1, 4.8.0
           Severity|major                       |normal

--- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 09:54:02 UTC ---
Ok, it seems to me that this has template-metaprogramming loop unrolling.  With
GCC 4.7 we unroll and vectorize all loops, for example unroll factor 8 looks
like

<bb 50>:
  # vect_var_.941_3474 = PHI <vect_var_.941_3472(50), {0, 0, 0, 0}(64)>
  # vect_var_.941_3473 = PHI <vect_var_.941_3471(50), {0, 0, 0, 0}(64)>
  # ivtmp.1325_970 = PHI <ivtmp.1325_812(50), ivtmp.1325_813(64)>
  D.9934_819 = (void *) ivtmp.1325_970;
  vect_var_.918_323 = MEM[base: D.9934_819, offset: 0B];
  vect_var_.919_325 = MEM[base: D.9934_819, offset: 16B];
  vect_var_.920_328 = vect_var_.918_323 + { 12345, 12345, 12345, 12345 };
  vect_var_.920_330 = vect_var_.919_325 + { 12345, 12345, 12345, 12345 };
  vect_var_.923_480 = vect_var_.920_328 * { 914237, 914237, 914237, 914237 };
  vect_var_.923_895 = vect_var_.920_330 * { 914237, 914237, 914237, 914237 };
  vect_var_.926_231 = vect_var_.923_480 + { 12332, 12332, 12332, 12332 };
  vect_var_.926_232 = vect_var_.923_895 + { 12332, 12332, 12332, 12332 };
  vect_var_.929_235 = vect_var_.926_231 * { 914237, 914237, 914237, 914237 };
  vect_var_.929_236 = vect_var_.926_232 * { 914237, 914237, 914237, 914237 };
  vect_var_.932_239 = vect_var_.929_235 + { 12332, 12332, 12332, 12332 };
  vect_var_.932_240 = vect_var_.929_236 + { 12332, 12332, 12332, 12332 };
  vect_var_.935_113 = vect_var_.932_239 * { 914237, 914237, 914237, 914237 };
  vect_var_.935_247 = vect_var_.932_240 * { 914237, 914237, 914237, 914237 };
  vect_var_.938_582 = vect_var_.935_113 + { -13, -13, -13, -13 };
  vect_var_.938_839 = vect_var_.935_247 + { -13, -13, -13, -13 };
  vect_var_.941_3472 = vect_var_.938_582 + vect_var_.941_3474;
  vect_var_.941_3471 = vect_var_.938_839 + vect_var_.941_3473;
  ivtmp.1325_812 = ivtmp.1325_970 + 32;
  if (ivtmp.1325_812 != D.9937_388)
    goto <bb 50>;
  else
    goto <bb 51>;

<bb 51>:
  # vect_var_.941_3468 = PHI <vect_var_.941_3472(50)>
  # vect_var_.941_3467 = PHI <vect_var_.941_3471(50)>
  vect_var_.945_3466 = vect_var_.941_3468 + vect_var_.941_3467;
  vect_var_.946_3465 = vect_var_.945_3466 v>> 64;
  vect_var_.946_3464 = vect_var_.946_3465 + vect_var_.945_3466;
  vect_var_.946_3463 = vect_var_.946_3464 v>> 32;
  vect_var_.946_3462 = vect_var_.946_3463 + vect_var_.946_3464;
  stmp_var_.944_3461 = BIT_FIELD_REF <vect_var_.946_3462, 32, 0>;
  init_value.7_795 = init_value;
  D.8606_796 = (int) init_value.7_795;
  D.8600_797 = D.8606_796 + 12345;
  D.8599_798 = D.8600_797 * 914237;
  D.8602_799 = D.8599_798 + 12332;
  D.8601_800 = D.8602_799 * 914237;
  D.8604_801 = D.8601_800 + 12332;
  D.8603_802 = D.8604_801 * 914237;
  D.8605_803 = D.8603_802 + -13;
  temp_804 = D.8605_803 * 8000;
  if (temp_804 != stmp_var_.944_3461)
    goto <bb 52>;
  else
    goto <bb 53>;


With GCC 4.6 OTOH the above loop is not vectorized, only the (slow) not
unrolled loop is.

<bb 49>:
  # result_622 = PHI <result_704(49), 0(63)>
  # ivtmp.852_1026 = PHI <ivtmp.852_842(49), ivtmp.852_844(63)>
  D.9283_3302 = (void *) ivtmp.852_1026;
  temp_801 = MEM[base: D.9283_3302, offset: 0B];
  D.8366_802 = temp_801 + 12345;
  D.8365_803 = D.8366_802 * 914237;
  D.8368_804 = D.8365_803 + 12332;
  D.8367_805 = D.8368_804 * 914237;
  D.8370_806 = D.8367_805 + 12332;
  D.8369_807 = D.8370_806 * 914237;
  temp_808 = D.8369_807 + -13;
  result_810 = temp_808 + result_622;
  temp_815 = MEM[base: D.9283_3302, offset: 4B];
  D.8381_816 = temp_815 + 12345;
  D.8382_817 = D.8381_816 * 914237;
  D.8378_818 = D.8382_817 + 12332;
  D.8379_819 = D.8378_818 * 914237;
  D.8376_820 = D.8379_819 + 12332;
  D.8377_821 = D.8376_820 * 914237;
  temp_822 = D.8377_821 + -13;
  result_824 = result_810 + temp_822;
  temp_788 = MEM[base: D.9283_3302, offset: 8B];
  D.8351_789 = temp_788 + 12345;
  D.8352_790 = D.8351_789 * 914237;
  D.8348_791 = D.8352_790 + 12332;
  D.8349_792 = D.8348_791 * 914237;
  D.8346_793 = D.8349_792 + 12332;
  D.8347_794 = D.8346_793 * 914237;
  temp_795 = D.8347_794 + -13;
  result_797 = temp_795 + result_824;
  temp_774 = MEM[base: D.9283_3302, offset: 12B];
  D.8333_775 = temp_774 + 12345;
  D.8334_776 = D.8333_775 * 914237;
  D.8330_777 = D.8334_776 + 12332;
  D.8331_778 = D.8330_777 * 914237;
  D.8328_779 = D.8331_778 + 12332;
  D.8329_780 = D.8328_779 * 914237;
  temp_781 = D.8329_780 + -13;
  result_783 = temp_781 + result_797;
  temp_760 = MEM[base: D.9283_3302, offset: 16B];
  D.8315_761 = temp_760 + 12345;
  D.8316_762 = D.8315_761 * 914237;
  D.8312_763 = D.8316_762 + 12332;
  D.8313_764 = D.8312_763 * 914237;
  D.8310_765 = D.8313_764 + 12332;
  D.8311_766 = D.8310_765 * 914237;
  temp_767 = D.8311_766 + -13;
  result_769 = temp_767 + result_783;
  temp_746 = MEM[base: D.9283_3302, offset: 20B];
  D.8297_747 = temp_746 + 12345;
  D.8298_748 = D.8297_747 * 914237;
  D.8294_749 = D.8298_748 + 12332;
  D.8295_750 = D.8294_749 * 914237;
  D.8292_751 = D.8295_750 + 12332;
  D.8293_752 = D.8292_751 * 914237;
  temp_753 = D.8293_752 + -13;
  result_755 = temp_753 + result_769;
  temp_732 = MEM[base: D.9283_3302, offset: 24B];
  D.8279_733 = temp_732 + 12345;
  D.8280_734 = D.8279_733 * 914237;
  D.8276_735 = D.8280_734 + 12332;
  D.8277_736 = D.8276_735 * 914237;
  D.8274_737 = D.8277_736 + 12332;
  D.8275_738 = D.8274_737 * 914237;
  temp_739 = D.8275_738 + -13;
  result_741 = temp_739 + result_755;
  temp_695 = MEM[base: D.9283_3302, offset: 28B];
  D.8246_696 = temp_695 + 12345;
  D.8245_697 = D.8246_696 * 914237;
  D.8248_698 = D.8245_697 + 12332;
  D.8247_699 = D.8248_698 * 914237;
  D.8250_700 = D.8247_699 + 12332;
  D.8249_701 = D.8250_700 * 914237;
  temp_702 = D.8249_701 + -13;
  result_704 = temp_702 + result_741;
  ivtmp.852_842 = ivtmp.852_1026 + 32;
  if (ivtmp.852_842 != D.9292_3369)
    goto <bb 49>;
  else
    goto <bb 50>;

<bb 50>:
  # result_3198 = PHI <result_704(49)>
  init_value.7_825 = init_value;
  D.8393_826 = (int) init_value.7_825;
  D.8387_827 = D.8393_826 + 12345;
  D.8386_828 = D.8387_827 * 914237;
  D.8389_829 = D.8386_828 + 12332;
  D.8388_830 = D.8389_829 * 914237;
  D.8391_831 = D.8388_830 + 12332;
  D.8390_832 = D.8391_831 * 914237;
  D.8392_833 = D.8390_832 + -13;
  temp_834 = D.8392_833 * 8000;
  if (temp_834 != result_3198)
    goto <bb 51>;
  else
    goto <bb 52>;

With -fno-tree-vectorize the performance is the same.  It seems that
vectorization is not profitable here for some reason.  Same behavior
can be observed with GCC 4.8.

I used the preprocessed source for 4.7 from the ZIP file.

The code generated is odd at least, the inner loop looks like

        movdqa  .LC6(%rip), %xmm3
        xorl    %ebx, %ebx
        movdqa  .LC7(%rip), %xmm0
        movdqa  .LC8(%rip), %xmm1
        movdqa  .LC9(%rip), %xmm2
        .p2align 4,,10
        .p2align 3
.L51:
        pxor    %xmm6, %xmm6
        movl    $data32, %eax
        movdqa  %xmm6, %xmm7
        .p2align 4,,10
        .p2align 3
.L53:
        movdqa  (%rax), %xmm4
        movdqa  %xmm0, %xmm8
        paddd   %xmm3, %xmm4
        movdqa  %xmm4, %xmm5
        psrldq  $4, %xmm4
        psrldq  $4, %xmm8
        pmuludq %xmm8, %xmm4
        pshufd  $8, %xmm4, %xmm4
        pmuludq %xmm0, %xmm5
        pshufd  $8, %xmm5, %xmm5
        movdqa  %xmm0, %xmm8
        psrldq  $4, %xmm8
        punpckldq       %xmm4, %xmm5
        paddd   %xmm1, %xmm5
        movdqa  %xmm5, %xmm4
        psrldq  $4, %xmm5
        pmuludq %xmm8, %xmm5
        pshufd  $8, %xmm5, %xmm5
        pmuludq %xmm0, %xmm4
        pshufd  $8, %xmm4, %xmm4
        punpckldq       %xmm5, %xmm4
        movdqa  %xmm0, %xmm5
        paddd   %xmm1, %xmm4
        movdqa  %xmm4, %xmm8
        psrldq  $4, %xmm5
        psrldq  $4, %xmm4
        pmuludq %xmm4, %xmm5
        pshufd  $8, %xmm5, %xmm5
        pmuludq %xmm0, %xmm8
        pshufd  $8, %xmm8, %xmm4
        movdqa  %xmm0, %xmm8
        psrldq  $4, %xmm8
        punpckldq       %xmm5, %xmm4
        paddd   %xmm2, %xmm4
        paddd   %xmm4, %xmm7
        movdqa  16(%rax), %xmm4
        addq    $32, %rax
        paddd   %xmm3, %xmm4
        movdqa  %xmm4, %xmm5
        psrldq  $4, %xmm4
        pmuludq %xmm8, %xmm4
        pshufd  $8, %xmm4, %xmm4
        movdqa  %xmm0, %xmm8
        pmuludq %xmm0, %xmm5
        pshufd  $8, %xmm5, %xmm5
        cmpq    $data32+32000, %rax
        psrldq  $4, %xmm8
        punpckldq       %xmm4, %xmm5
        paddd   %xmm1, %xmm5
        movdqa  %xmm5, %xmm4
        psrldq  $4, %xmm5
        pmuludq %xmm8, %xmm5
        pshufd  $8, %xmm5, %xmm5
        pmuludq %xmm0, %xmm4
        pshufd  $8, %xmm4, %xmm4
        punpckldq       %xmm5, %xmm4
        movdqa  %xmm0, %xmm5
        paddd   %xmm1, %xmm4
        movdqa  %xmm4, %xmm8
        psrldq  $4, %xmm5
        psrldq  $4, %xmm4
        pmuludq %xmm4, %xmm5
        pshufd  $8, %xmm5, %xmm5
        pmuludq %xmm0, %xmm8
        pshufd  $8, %xmm8, %xmm4
        punpckldq       %xmm5, %xmm4
        paddd   %xmm2, %xmm4
        paddd   %xmm4, %xmm6
        jne     .L53

which means we expand the multiplications with the constants in an odd way.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (5 preceding siblings ...)
  2012-06-12  9:54 ` [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark rguenth at gcc dot gnu.org
@ 2012-06-12 10:12 ` rguenth at gcc dot gnu.org
  2012-06-12 10:27 ` rguenth at gcc dot gnu.org
                   ` (38 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 10:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:11:51 UTC ---
Btw, when I run the benchmark with the addition of -march=native (for me,
that's
-march=corei7) then GCC 4.7 performs better than 4.6:

4.6:

./t 100000 

test               description   absolute   operations   ratio with
number                           time       per second   test0

 0 "int32_t for loop unroll 1"   0.41 sec   1951.22 M     1.00
 1 "int32_t for loop unroll 2"   0.51 sec   1568.63 M     1.24
 2 "int32_t for loop unroll 3"   0.47 sec   1702.13 M     1.15
 3 "int32_t for loop unroll 4"   0.48 sec   1666.67 M     1.17
 4 "int32_t for loop unroll 5"   0.47 sec   1702.13 M     1.15
 5 "int32_t for loop unroll 6"   0.51 sec   1568.63 M     1.24
 6 "int32_t for loop unroll 7"   0.47 sec   1702.13 M     1.15
 7 "int32_t for loop unroll 8"   0.47 sec   1702.13 M     1.15

Total absolute time for int32_t for loop unrolling: 3.79 sec

4.7:

./t 100000 

test               description   absolute   operations   ratio with
number                           time       per second   test0

 0 "int32_t for loop unroll 1"   0.39 sec   2051.28 M     1.00
 1 "int32_t for loop unroll 2"   0.40 sec   2000.00 M     1.03
 2 "int32_t for loop unroll 3"   0.39 sec   2051.28 M     1.00
 3 "int32_t for loop unroll 4"   0.39 sec   2051.28 M     1.00
 4 "int32_t for loop unroll 5"   0.38 sec   2105.26 M     0.97
 5 "int32_t for loop unroll 6"   0.41 sec   1951.22 M     1.05
 6 "int32_t for loop unroll 7"   0.37 sec   2162.16 M     0.95
 7 "int32_t for loop unroll 8"   0.36 sec   2222.22 M     0.92

Total absolute time for int32_t for loop unrolling: 3.09 sec

The loop then looks like (the expected)

.L53:
        movdqa  (%rax), %xmm4
        paddd   %xmm3, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm2, %xmm4
        paddd   %xmm4, %xmm6
        movdqa  16(%rax), %xmm4
        addq    $32, %rax
        cmpq    $data32+32000, %rax
        paddd   %xmm3, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm2, %xmm4
        paddd   %xmm4, %xmm5
        jne     .L53

looks like pmulld is only available with SSE 4.1 and otherwise we fall back
to the define_insn_and_split "*sse2_mulv4si3".  But that complexity is not
reflected in the vectorizer cost model (which needs improvement ...).


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (6 preceding siblings ...)
  2012-06-12 10:12 ` rguenth at gcc dot gnu.org
@ 2012-06-12 10:27 ` rguenth at gcc dot gnu.org
  2012-06-12 10:39 ` rguenth at gcc dot gnu.org
                   ` (37 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 10:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #8 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:27:15 UTC ---
Small testcase:

int a[256];
int b[256];

void foo (void)
{
  int i;
  for (i = 0; i < 256; ++i)
    {
      b[i] = a[i] * 23;
    }
}

you can see that we shuffle even the vector with constants around!  Not taking
into account the REG_EQUAL note which is gone at split1 time, removed by
either loop2_invariant or loop2_unswitch.

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
        (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
            (reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
     (expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
            (const_vector:V4SI [
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                ]))
        (expr_list:REG_DEAD (reg:V4SI 84)
            (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
                (nil)))))


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (7 preceding siblings ...)
  2012-06-12 10:27 ` rguenth at gcc dot gnu.org
@ 2012-06-12 10:39 ` rguenth at gcc dot gnu.org
  2012-06-12 11:57 ` rguenth at gcc dot gnu.org
                   ` (36 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 10:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #9 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:39:19 UTC ---
And cprop fails to propagate

  (reg:V4SI 85) := (const_vector:V4SI [
        (const_int 23 [0x17])
        (const_int 23 [0x17])
        (const_int 23 [0x17])
        (const_int 23 [0x17])
    ])

but it at least re-adds the REG_EQUAL note, but DSE drops it again.  From

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
        (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
            (reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
     (expr_list:REG_EQUAL (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
            (const_vector:V4SI [
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                ]))
        (expr_list:REG_DEAD (reg:V4SI 85)
            (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index:
ivtmp.20_9, offset: 0B] ])
                (nil)))))


we go to

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
        (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
            (reg:V4SI 85))) t.c:9 1496 {*sse2_mulv4si3}
     (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9,
offset: 0B] ])
        (nil)))

Unfortunately there is no cprop pass after split1 to eventually clean things
up again (because of out-of-cfg-layout-mode ...).  If I force it to run
it cannot simplify

(insn 42 24 43 3 (set (subreg:V2DI (reg:V4SI 86) 0)
        (mult:V2DI (zero_extend:V2DI (vec_select:V2SI (reg:V4SI 83 [
MEM[symbol: a, index: ivtmp.20_9, offset: 0B] ])
                    (parallel [
                            (const_int 0 [0])
                            (const_int 2 [0x2])
                        ])))
            (zero_extend:V2DI (vec_select:V2SI (reg:V4SI 85)
                    (parallel [
                            (const_int 0 [0])
                            (const_int 2 [0x2])
                        ]))))) t.c:9 -1
     (nil))

either though.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (8 preceding siblings ...)
  2012-06-12 10:39 ` rguenth at gcc dot gnu.org
@ 2012-06-12 11:57 ` rguenth at gcc dot gnu.org
  2012-06-12 18:26 ` matt at use dot net
                   ` (35 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-12 11:57 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |stevenb.gcc at gmail dot
                   |                            |com

--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 11:57:20 UTC ---
Changing the insn_and_split to

(define_insn_and_split "*sse2_mulv4si3"
  [(set (match_operand:V4SI 0 "register_operand")
        (mult:V4SI (match_operand:V4SI 1 "register_operand")
                   (match_operand:V4SI 2 "nonmemory_vector_operand")))]
...

and defining

(define_predicate "nonmemory_vector_operand"
    (ior (match_operand 0 "register_operand")
         (match_code "const_vector")))

we ICE because when splitting

(insn 26 24 27 3 (set (reg:V4SI 82 [ vect_var_.10 ])
        (mult:V4SI (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9, offset: 0B]
])
            (const_vector:V4SI [
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                    (const_int 23 [0x17])
                ]))) t.c:9 1496 {*sse2_mulv4si3}
     (expr_list:REG_DEAD (reg:V4SI 83 [ MEM[symbol: a, index: ivtmp.20_9,
offset: 0B] ])
        (nil)))

we don't even try to simplify when emitting the code.

But maybe allowing const_vector in (some of) the define_insn_and_split would
be the way to go ...


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (9 preceding siblings ...)
  2012-06-12 11:57 ` rguenth at gcc dot gnu.org
@ 2012-06-12 18:26 ` matt at use dot net
  2012-06-12 18:55 ` rth at gcc dot gnu.org
                   ` (34 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-06-12 18:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #11 from Matt Hargett <matt at use dot net> 2012-06-12 18:25:25 UTC ---
Richard,

Thanks for the quick analysis! Sounds like a perfect storm of sorts :/

re: cprop failure: this may be indicated by another major regression in their
suite for the "simple constant folding" tests. in GCC 4.1-4.6, those tests are
all 0.0s but in 4.7 take tens of seconds. Let me know if you want me to file a
separate bug/reduced test case for that, and then have that new bug depend on
this one. Otherwise, I'll wait until this one sees some resolution and then
retest.

re: multiple passes: if you think that feature has enough merit to be revisited
now, I can look into re-proposing Maxim's patches from October/November 2011
that integrated your feedback at the time.

re: -march workaround: our deployment platform's minimum arch is nocona, and
enabling -march=nocona doesn't workaround the issue. For grins, I tried
-march=amdfam10 (another deployment target, but would require a separate
distributable binary), but that also didn't work around the issue.

I see a small improvement when using -fno-tree-vectorize, but not nearly as
dramatic as yours. For the int32_t for and while loop unrolling, the times go
from ~107s and ~105s to ~96s and ~95s, respectively. The do and goto loop
unrolling times get slightly worse (~2%), but it might be noise.

Let me know if there's any additional testing/footwork you'd like me to do.
Again, thanks for the quick turnaround on such a deep analysis!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (10 preceding siblings ...)
  2012-06-12 18:26 ` matt at use dot net
@ 2012-06-12 18:55 ` rth at gcc dot gnu.org
  2012-06-13  9:44 ` rguenth at gcc dot gnu.org
                   ` (33 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-12 18:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #12 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-12 18:54:24 UTC ---
(In reply to comment #10)
> But maybe allowing const_vector in (some of) the define_insn_and_split would
> be the way to go ...

Maybe.  It certainly would ease some of the simplifications.
At the moment I don't think we can go from

  mem -> const -> simplify -> const ->newmem

On the other hand, for this particular test case, where all
of the vector_cst elements are the same, and a reasonably
small number of bits set, it would be great to be able to
leverage synth_mult.

The main complexity for sse2_mulv4si3 is due to the fact that
we have to decompose the operation into V8HImode multiplies.
Whereas if we decompose the multiply, we have the shifts and
adds in V4SImode.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (11 preceding siblings ...)
  2012-06-12 18:55 ` rth at gcc dot gnu.org
@ 2012-06-13  9:44 ` rguenth at gcc dot gnu.org
  2012-06-14 14:39 ` rth at gcc dot gnu.org
                   ` (32 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-13  9:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #13 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-13 09:43:15 UTC ---
(In reply to comment #12)
> (In reply to comment #10)
> > But maybe allowing const_vector in (some of) the define_insn_and_split would
> > be the way to go ...
> 
> Maybe.  It certainly would ease some of the simplifications.
> At the moment I don't think we can go from
> 
>   mem -> const -> simplify -> const ->newmem
> 
> On the other hand, for this particular test case, where all
> of the vector_cst elements are the same, and a reasonably
> small number of bits set, it would be great to be able to
> leverage synth_mult.

I agree, though that should possibly be done earlier.

> The main complexity for sse2_mulv4si3 is due to the fact that
> we have to decompose the operation into V8HImode multiplies.
> Whereas if we decompose the multiply, we have the shifts and
> adds in V4SImode.

Well, for a constant multiplier one can avoid the shuffles of the
multiplier - we seem to use v2si -> v2di multiplies with sse2_mulv4si3.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (12 preceding siblings ...)
  2012-06-13  9:44 ` rguenth at gcc dot gnu.org
@ 2012-06-14 14:39 ` rth at gcc dot gnu.org
  2012-06-14 18:02 ` matt at use dot net
                   ` (31 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-14 14:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Henderson <rth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rth at gcc dot gnu.org
         AssignedTo|unassigned at gcc dot       |rth at gcc dot gnu.org
                   |gnu.org                     |

--- Comment #14 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-14 14:38:43 UTC ---
Mine, at least for a 4.8 solution.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (13 preceding siblings ...)
  2012-06-14 14:39 ` rth at gcc dot gnu.org
@ 2012-06-14 18:02 ` matt at use dot net
  2012-06-14 18:39 ` rth at gcc dot gnu.org
                   ` (30 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-06-14 18:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #15 from Matt Hargett <matt at use dot net> 2012-06-14 18:01:31 UTC ---
(In reply to comment #14)
> Mine, at least for a 4.8 solution.

What enhancement to 4.7 caused the regression? Can whatever the change was be
(partially) reverted to lessen the impact?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (14 preceding siblings ...)
  2012-06-14 18:02 ` matt at use dot net
@ 2012-06-14 18:39 ` rth at gcc dot gnu.org
  2012-06-15  9:04 ` jakub at gcc dot gnu.org
                   ` (29 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-14 18:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Henderson <rth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

--- Comment #16 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-14 18:38:30 UTC ---
Dunno exactly.  The pre-SSE4.1 emulation of PMULLD has been there since
at least gcc 4.5.

What's not present in *any* version so far are some proper rtx_costs for
integer vector operations.  So any questions the vectorizer might be
asking about what transformations are profitable are currently being
given bogus answers.

I'm hoping just that will fix the regression, though I also plan to
address some of the other algorithmic questions raised in this PR.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (15 preceding siblings ...)
  2012-06-14 18:39 ` rth at gcc dot gnu.org
@ 2012-06-15  9:04 ` jakub at gcc dot gnu.org
  2012-06-15 21:05 ` rth at gcc dot gnu.org
                   ` (28 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-06-15  9:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #17 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-06-15 09:03:04 UTC ---
This started with http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173856
The current cost model is seriously insufficient.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (16 preceding siblings ...)
  2012-06-15  9:04 ` jakub at gcc dot gnu.org
@ 2012-06-15 21:05 ` rth at gcc dot gnu.org
  2012-08-10  9:43 ` rguenth at gcc dot gnu.org
                   ` (27 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rth at gcc dot gnu.org @ 2012-06-15 21:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #18 from Richard Henderson <rth at gcc dot gnu.org> 2012-06-15 21:04:49 UTC ---
See comments in http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01081.html

It's not the vectorization costing, as previously suggested.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (17 preceding siblings ...)
  2012-06-15 21:05 ` rth at gcc dot gnu.org
@ 2012-08-10  9:43 ` rguenth at gcc dot gnu.org
  2012-08-14 17:26 ` matt at use dot net
                   ` (26 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-08-10  9:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |4.7.2


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (18 preceding siblings ...)
  2012-08-10  9:43 ` rguenth at gcc dot gnu.org
@ 2012-08-14 17:26 ` matt at use dot net
  2012-08-20 23:53 ` matt at use dot net
                   ` (25 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-08-14 17:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #19 from Matt Hargett <matt at use dot net> 2012-08-14 17:25:40 UTC ---
Does this mean there will be a fix for this regression committed for 4.7.2? If
there's a patch I can test ahead of time, please let me know. Thanks!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (19 preceding siblings ...)
  2012-08-14 17:26 ` matt at use dot net
@ 2012-08-20 23:53 ` matt at use dot net
  2012-09-20 10:27 ` jakub at gcc dot gnu.org
                   ` (24 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: matt at use dot net @ 2012-08-20 23:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #20 from Matt Hargett <matt at use dot net> 2012-08-20 23:52:31 UTC ---
Some additional information:
Compared to LLVM 3.1 with -O3, GCC 4.7 is twice as slow on these benchmarks.
LLVM even outperforms GCC 4.1, which previously had the best result. We are
very eager to hear about any resolution for this major regression in 4.7 so we
can deploy it. Even a return to GCC 4.1 performance levels would be fine.

Thanks!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (20 preceding siblings ...)
  2012-08-20 23:53 ` matt at use dot net
@ 2012-09-20 10:27 ` jakub at gcc dot gnu.org
  2012-11-29 21:17 ` rth at gcc dot gnu.org
                   ` (23 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-09-20 10:27 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.7.2                       |4.7.3

--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-09-20 10:21:07 UTC ---
GCC 4.7.2 has been released.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (21 preceding siblings ...)
  2012-09-20 10:27 ` jakub at gcc dot gnu.org
@ 2012-11-29 21:17 ` rth at gcc dot gnu.org
  2012-12-03 15:27 ` rguenth at gcc dot gnu.org
                   ` (22 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rth at gcc dot gnu.org @ 2012-11-29 21:17 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Henderson <rth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEW
         AssignedTo|rth at gcc dot gnu.org      |unassigned at gcc dot
                   |                            |gnu.org

--- Comment #22 from Richard Henderson <rth at gcc dot gnu.org> 2012-11-29 21:17:05 UTC ---
Needs long-term work in pre-vectorization folding.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (22 preceding siblings ...)
  2012-11-29 21:17 ` rth at gcc dot gnu.org
@ 2012-12-03 15:27 ` rguenth at gcc dot gnu.org
  2013-04-11  8:00 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9 " rguenth at gcc dot gnu.org
                   ` (21 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-12-03 15:27 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8/4.9 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (23 preceding siblings ...)
  2012-12-03 15:27 ` rguenth at gcc dot gnu.org
@ 2013-04-11  8:00 ` rguenth at gcc dot gnu.org
  2014-06-12 13:45 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
                   ` (20 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-04-11  8:00 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.7.3                       |4.7.4

--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> 2013-04-11 07:59:38 UTC ---
GCC 4.7.3 is being released, adjusting target milestone.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.7/4.8/4.9/4.10 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (24 preceding siblings ...)
  2013-04-11  8:00 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9 " rguenth at gcc dot gnu.org
@ 2014-06-12 13:45 ` rguenth at gcc dot gnu.org
  2014-12-19 13:28 ` [Bug rtl-optimization/53533] [4.8/4.9/5 " jakub at gcc dot gnu.org
                   ` (19 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-06-12 13:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.7.4                       |4.8.4

--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> ---
The 4.7 branch is being closed, moving target milestone to 4.8.4.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.8/4.9/5 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (25 preceding siblings ...)
  2014-06-12 13:45 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
@ 2014-12-19 13:28 ` jakub at gcc dot gnu.org
  2015-05-03 13:00 ` [Bug rtl-optimization/53533] [4.8/4.9/5/6 " trippels at gcc dot gnu.org
                   ` (18 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-12-19 13:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.8.4                       |4.8.5

--- Comment #25 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.8.4 has been released.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (26 preceding siblings ...)
  2014-12-19 13:28 ` [Bug rtl-optimization/53533] [4.8/4.9/5 " jakub at gcc dot gnu.org
@ 2015-05-03 13:00 ` trippels at gcc dot gnu.org
  2015-05-03 13:01 ` trippels at gcc dot gnu.org
                   ` (17 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-03 13:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Markus Trippelsdorf <trippels at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2012-05-31 00:00:00         |2015-5-3
                 CC|                            |trippels at gcc dot gnu.org

--- Comment #26 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
For gcc-5 and gcc-6 there is an additional 50% slowdown:

 % g++ -O3 loop_unroll.ii -o loop_unroll
 % time ./loop_unroll 10000
./loop_unroll 10000 

test                description   absolute   operations   ratio with
number                            time       per second   test0

 0  "int32_t for loop unroll 1"   0.14 sec   552.30 M     1.00
 1  "int32_t for loop unroll 2"   0.11 sec   699.49 M     0.79
 2  "int32_t for loop unroll 3"   0.14 sec   566.56 M     0.97
 3  "int32_t for loop unroll 4"   0.15 sec   532.87 M     1.04
 4  "int32_t for loop unroll 5"   0.10 sec   784.70 M     0.70
 5  "int32_t for loop unroll 6"   0.09 sec   887.12 M     0.62
 6  "int32_t for loop unroll 7"   0.09 sec   913.50 M     0.60
 7  "int32_t for loop unroll 8"   0.08 sec   986.45 M     0.56
 8  "int32_t for loop unroll 9"   0.23 sec   346.06 M     1.60
 9 "int32_t for loop unroll 10"   0.08 sec   1040.06 M     0.53
10 "int32_t for loop unroll 11"   0.23 sec   348.02 M     1.59
11 "int32_t for loop unroll 12"   0.23 sec   353.38 M     1.56
12 "int32_t for loop unroll 13"   0.24 sec   338.32 M     1.63
13 "int32_t for loop unroll 14"   0.24 sec   332.32 M     1.66
14 "int32_t for loop unroll 15"   0.25 sec   321.15 M     1.72
15 "int32_t for loop unroll 16"   0.25 sec   318.23 M     1.74
16 "int32_t for loop unroll 17"   0.24 sec   329.43 M     1.68
17 "int32_t for loop unroll 18"   0.25 sec   321.34 M     1.72
18 "int32_t for loop unroll 19"   0.25 sec   314.53 M     1.76
19 "int32_t for loop unroll 20"   0.25 sec   325.33 M     1.70
20 "int32_t for loop unroll 21"   0.25 sec   323.67 M     1.71
21 "int32_t for loop unroll 22"   0.25 sec   316.85 M     1.74
22 "int32_t for loop unroll 23"   0.25 sec   323.51 M     1.71
23 "int32_t for loop unroll 24"   0.06 sec   1257.94 M     0.44
24 "int32_t for loop unroll 25"   0.24 sec   327.77 M     1.69
25 "int32_t for loop unroll 26"   0.06 sec   1310.44 M     0.42
26 "int32_t for loop unroll 27"   0.07 sec   1072.85 M     0.51
27 "int32_t for loop unroll 28"   0.28 sec   283.44 M     1.95
28 "int32_t for loop unroll 29"   0.30 sec   267.96 M     2.06
29 "int32_t for loop unroll 30"   0.31 sec   258.88 M     2.13
30 "int32_t for loop unroll 31"   0.06 sec   1337.64 M     0.41
31 "int32_t for loop unroll 32"   0.06 sec   1315.10 M     0.42

Total absolute time for int32_t for loop unrolling: 5.85 sec
...
./loop_unroll 10000  41.43s user 0.00s system 100% cpu 41.426 total

==============================================================================

 % /usr/x86_64-pc-linux-gnu/gcc-bin/4.9.2/g++ -O3 loop_unroll.ii -o loop_unroll
 % time ./loop_unroll 10000
./loop_unroll 10000 

test                description   absolute   operations   ratio with
number                            time       per second   test0

 0  "int32_t for loop unroll 1"   0.14 sec   582.13 M     1.00
 1  "int32_t for loop unroll 2"   0.13 sec   625.41 M     0.93
 2  "int32_t for loop unroll 3"   0.13 sec   635.76 M     0.92
 3  "int32_t for loop unroll 4"   0.13 sec   625.41 M     0.93
 4  "int32_t for loop unroll 5"   0.12 sec   640.96 M     0.91
 5  "int32_t for loop unroll 6"   0.09 sec   888.11 M     0.66
 6  "int32_t for loop unroll 7"   0.09 sec   900.10 M     0.65
 7  "int32_t for loop unroll 8"   0.10 sec   832.20 M     0.70
 8  "int32_t for loop unroll 9"   0.10 sec   834.22 M     0.70
 9 "int32_t for loop unroll 10"   0.09 sec   902.04 M     0.65
10 "int32_t for loop unroll 11"   0.10 sec   805.15 M     0.72
11 "int32_t for loop unroll 12"   0.10 sec   823.27 M     0.71
12 "int32_t for loop unroll 13"   0.09 sec   860.51 M     0.68
13 "int32_t for loop unroll 14"   0.11 sec   753.59 M     0.77
14 "int32_t for loop unroll 15"   0.10 sec   781.96 M     0.74
15 "int32_t for loop unroll 16"   0.09 sec   858.76 M     0.68
16 "int32_t for loop unroll 17"   0.09 sec   846.91 M     0.69
17 "int32_t for loop unroll 18"   0.10 sec   783.19 M     0.74
18 "int32_t for loop unroll 19"   0.10 sec   794.81 M     0.73
19 "int32_t for loop unroll 20"   0.10 sec   806.70 M     0.72
20 "int32_t for loop unroll 21"   0.10 sec   823.82 M     0.71
21 "int32_t for loop unroll 22"   0.09 sec   851.74 M     0.68
22 "int32_t for loop unroll 23"   0.10 sec   792.87 M     0.73
23 "int32_t for loop unroll 24"   0.10 sec   809.32 M     0.72
24 "int32_t for loop unroll 25"   0.10 sec   832.18 M     0.70
25 "int32_t for loop unroll 26"   0.10 sec   781.11 M     0.75
26 "int32_t for loop unroll 27"   0.10 sec   792.40 M     0.73
27 "int32_t for loop unroll 28"   0.10 sec   817.22 M     0.71
28 "int32_t for loop unroll 29"   0.10 sec   826.40 M     0.70
29 "int32_t for loop unroll 30"   0.10 sec   803.83 M     0.72
30 "int32_t for loop unroll 31"   0.10 sec   803.48 M     0.72
31 "int32_t for loop unroll 32"   0.10 sec   796.88 M     0.73

Total absolute time for int32_t for loop unrolling: 3.28 sec
...
./loop_unroll 10000  22.75s user 0.00s system 100% cpu 22.746 total

clang:
./loop_unroll 10000  12.93s user 0.00s system 100% cpu 12.933 total

icpc (5* faster than gcc-5):
./loop_unroll 10000  8.38s user 0.00s system 99% cpu 8.382 total


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (27 preceding siblings ...)
  2015-05-03 13:00 ` [Bug rtl-optimization/53533] [4.8/4.9/5/6 " trippels at gcc dot gnu.org
@ 2015-05-03 13:01 ` trippels at gcc dot gnu.org
  2015-05-04 14:46 ` maltsevm at gmail dot com
                   ` (16 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-03 13:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #27 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
Created attachment 35448
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35448&action=edit
testcase


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (28 preceding siblings ...)
  2015-05-03 13:01 ` trippels at gcc dot gnu.org
@ 2015-05-04 14:46 ` maltsevm at gmail dot com
  2015-05-04 15:00 ` maltsevm at gmail dot com
                   ` (15 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: maltsevm at gmail dot com @ 2015-05-04 14:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Mikhail Maltsev <maltsevm at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maltsevm at gmail dot com

--- Comment #28 from Mikhail Maltsev <maltsevm at gmail dot com> ---
Created attachment 35455
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35455&action=edit
testcase, inlining

This testcase marks some functions with __attribute__((always_inline/noinline))
when -DINLINE_MANUALLY is defined.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (29 preceding siblings ...)
  2015-05-04 14:46 ` maltsevm at gmail dot com
@ 2015-05-04 15:00 ` maltsevm at gmail dot com
  2015-06-23  8:22 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: maltsevm at gmail dot com @ 2015-05-04 15:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #29 from Mikhail Maltsev <maltsevm at gmail dot com> ---
Results for attached testcase:

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (Haswell)
g++ -O3 -march=native -mtune=native
10000 iterations

Clang 3.7
Total absolute time for int32_t for loop unrolling: 0.99 sec
Total absolute time for int32_t do loop unrolling: 1.00 sec
Total absolute time for double for loop unrolling: 1.37 sec
Total absolute time for double do loop unrolling: 1.37 sec

GCC 4.7.4
Total absolute time for int32_t for loop unrolling: 5.88 sec
Total absolute time for int32_t do loop unrolling: 7.57 sec
Total absolute time for double for loop unrolling: 2.29 sec
Total absolute time for double do loop unrolling: 2.45 sec

GCC 4.8.4
Total absolute time for int32_t for loop unrolling: 3.12 sec
Total absolute time for int32_t do loop unrolling: 3.29 sec
Total absolute time for double for loop unrolling: 1.13 sec
Total absolute time for double do loop unrolling: 1.14 sec

GCC 4.9.2
Total absolute time for int32_t for loop unrolling: 3.02 sec
Total absolute time for int32_t do loop unrolling: 3.29 sec
Total absolute time for double for loop unrolling: 1.10 sec
Total absolute time for double do loop unrolling: 1.13 sec

GCC 6
Total absolute time for int32_t for loop unrolling: 5.95 sec
Total absolute time for int32_t do loop unrolling: 6.95 sec
Total absolute time for double for loop unrolling: 2.39 sec
Total absolute time for double do loop unrolling: 2.39 sec

g++ -DINLINE_MANUALLY -O3 -march=native -mtune=native
50000 iterations

Clang 3.7
Total absolute time for int32_t for loop unrolling: 2.43 sec
Total absolute time for int32_t do loop unrolling: 2.32 sec
Total absolute time for double for loop unrolling: 6.38 sec
Total absolute time for double do loop unrolling: 6.38 sec

GCC 4.9.2
Total absolute time for int32_t for loop unrolling: 10.17 sec
Total absolute time for int32_t do loop unrolling: 10.16 sec
Total absolute time for double for loop unrolling: 3.89 sec
Total absolute time for double do loop unrolling: 3.90 sec

GCC 6
Total absolute time for int32_t for loop unrolling: 10.10 sec
Total absolute time for int32_t do loop unrolling: 10.12 sec
Total absolute time for double for loop unrolling: 3.90 sec
Total absolute time for double do loop unrolling: 3.89 sec

g++ -DINLINE_MANUALLY -Ofast -march=native -mtune=native
GCC 6
Total absolute time for int32_t for loop unrolling: 10.11 sec
Total absolute time for int32_t do loop unrolling: 10.11 sec
Total absolute time for double for loop unrolling: 1.14 sec
Total absolute time for double do loop unrolling: 1.15 sec

So, IMHO there is no regression here (at least w.r.t. vectorization). Floating
point loop gets constant-folded, if reassociation is allowed. Also, GCC6 is
able to infer that "for" and "while" tests are semantically equivalent and
unifies them.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.8/4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (30 preceding siblings ...)
  2015-05-04 15:00 ` maltsevm at gmail dot com
@ 2015-06-23  8:22 ` rguenth at gcc dot gnu.org
  2015-06-26 19:58 ` [Bug rtl-optimization/53533] [4.9/5/6 " jakub at gcc dot gnu.org
                   ` (13 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-06-23  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.8.5                       |4.9.3

--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> ---
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (31 preceding siblings ...)
  2015-06-23  8:22 ` rguenth at gcc dot gnu.org
@ 2015-06-26 19:58 ` jakub at gcc dot gnu.org
  2015-06-26 20:29 ` jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 19:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #31 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 4.9.3 has been released.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [4.9/5/6 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (32 preceding siblings ...)
  2015-06-26 19:58 ` [Bug rtl-optimization/53533] [4.9/5/6 " jakub at gcc dot gnu.org
@ 2015-06-26 20:29 ` jakub at gcc dot gnu.org
  2021-02-23 12:24 ` [Bug rtl-optimization/53533] [8/9/10/11 " rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-06-26 20:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.9.3                       |4.9.4


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [8/9/10/11 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (33 preceding siblings ...)
  2015-06-26 20:29 ` jakub at gcc dot gnu.org
@ 2021-02-23 12:24 ` rguenth at gcc dot gnu.org
  2021-05-14  9:46 ` [Bug rtl-optimization/53533] [9/10/11/12 " jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-23 12:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2015-05-03 00:00:00         |2021-2-23

--- Comment #41 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-confirmed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [9/10/11/12 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (34 preceding siblings ...)
  2021-02-23 12:24 ` [Bug rtl-optimization/53533] [8/9/10/11 " rguenth at gcc dot gnu.org
@ 2021-05-14  9:46 ` jakub at gcc dot gnu.org
  2021-06-01  8:05 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-05-14  9:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|8.5                         |9.4

--- Comment #42 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 8 branch is being closed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [9/10/11/12 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (35 preceding siblings ...)
  2021-05-14  9:46 ` [Bug rtl-optimization/53533] [9/10/11/12 " jakub at gcc dot gnu.org
@ 2021-06-01  8:05 ` rguenth at gcc dot gnu.org
  2022-05-27  9:34 ` [Bug rtl-optimization/53533] [10/11/12/13 " rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-01  8:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.4                         |9.5

--- Comment #43 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9.4 is being released, retargeting bugs to GCC 9.5.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (36 preceding siblings ...)
  2021-06-01  8:05 ` rguenth at gcc dot gnu.org
@ 2022-05-27  9:34 ` rguenth at gcc dot gnu.org
  2022-05-30  6:40 ` crazylht at gmail dot com
                   ` (7 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-27  9:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.5                         |10.4

--- Comment #44 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9 branch is being closed

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (37 preceding siblings ...)
  2022-05-27  9:34 ` [Bug rtl-optimization/53533] [10/11/12/13 " rguenth at gcc dot gnu.org
@ 2022-05-30  6:40 ` crazylht at gmail dot com
  2022-05-30  8:57 ` rguenther at suse dot de
                   ` (6 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: crazylht at gmail dot com @ 2022-05-30  6:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #45 from Hongtao.liu <crazylht at gmail dot com> ---
A reduced testcase.

int a[256];
int b[256];

void foo (void)
{
  int i;
  for (i = 0; i < 256; ++i)
    {
      int tmp = a[i] + 12345;
      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp -= 13;
      tmp *= 8000;
      b[i] = tmp;
    }
}

GCC now simply pmulld to pslld + padd + psub, the vectorizer cost model looks
fine,  but for scalar version, it's extraly optimized in pass_combine from 4 *
mult + 3 * add to 1 * mult + 2 * add which is not taken in count by vectorizer.
The vectorized version is not simplified later.

        mov     eax, DWORD PTR a[rdx]
        add     rdx, 4
        add     eax, 12345
        imul    eax, eax, -1564285888
        sub     eax, 333519936
        mov     DWORD PTR b[rdx-4], eax
        cmp     rdx, 1024
        jne     .L2


I'm wondering could Gimple also simplify 

      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp -= 13;
      tmp *= 8000;

to 
     tmp *= -1564285888;
     tmp -= 333519936;

refer to https://godbolt.org/z/qYMYMTxEY

Then the vectorized code would be more optimal.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (38 preceding siblings ...)
  2022-05-30  6:40 ` crazylht at gmail dot com
@ 2022-05-30  8:57 ` rguenther at suse dot de
  2022-05-30  9:10 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenther at suse dot de @ 2022-05-30  8:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #46 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 30 May 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
> 
> --- Comment #45 from Hongtao.liu <crazylht at gmail dot com> ---
> A reduced testcase.
> 
> int a[256];
> int b[256];
> 
> void foo (void)
> {
>   int i;
>   for (i = 0; i < 256; ++i)
>     {
>       int tmp = a[i] + 12345;
>       tmp *= 914237;
>       tmp += 12332;
>       tmp *= 914237;
>       tmp += 12332;
>       tmp *= 914237;
>       tmp -= 13;
>       tmp *= 8000;
>       b[i] = tmp;
>     }
> }
> 
> GCC now simply pmulld to pslld + padd + psub, the vectorizer cost model looks
> fine,  but for scalar version, it's extraly optimized in pass_combine from 4 *
> mult + 3 * add to 1 * mult + 2 * add which is not taken in count by vectorizer.
> The vectorized version is not simplified later.
> 
>         mov     eax, DWORD PTR a[rdx]
>         add     rdx, 4
>         add     eax, 12345
>         imul    eax, eax, -1564285888
>         sub     eax, 333519936
>         mov     DWORD PTR b[rdx-4], eax
>         cmp     rdx, 1024
>         jne     .L2
> 
> 
> I'm wondering could Gimple also simplify 
> 
>       tmp *= 914237;
>       tmp += 12332;
>       tmp *= 914237;
>       tmp += 12332;
>       tmp *= 914237;
>       tmp -= 13;
>       tmp *= 8000;
> 
> to 
>      tmp *= -1564285888;
>      tmp -= 333519936;
> 
> refer to https://godbolt.org/z/qYMYMTxEY
> 
> Then the vectorized code would be more optimal.

The issue is that the re-association pass doesn't handle operations
with undefined overflow behavior, we do have duplicate bugreports
for this.

On the RTL level likely simplify-rtx (or the variants used by combine)
only have limited support for vector operations.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (39 preceding siblings ...)
  2022-05-30  8:57 ` rguenther at suse dot de
@ 2022-05-30  9:10 ` crazylht at gmail dot com
  2022-05-30  9:14 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: crazylht at gmail dot com @ 2022-05-30  9:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #47 from Hongtao.liu <crazylht at gmail dot com> ---

> 
> The issue is that the re-association pass doesn't handle operations
> with undefined overflow behavior, we do have duplicate bugreports
> for this.
> 

I saw below in match.pd

 478/* Combine successive multiplications.  Similar to above, but handling
 479   overflow is different.  */
 480(simplify
 481 (mult (mult @0 INTEGER_CST@1) INTEGER_CST@2)
 482 (with {
 483   wi::overflow_type overflow;
 484   wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
 485                           TYPE_SIGN (type), &overflow);
 486  }
 487  /* Skip folding on overflow: the only special case is @1 * @2 ==
-INT_MIN,
 488     otherwise undefined overflow implies that @0 must be zero.  */
 489  (if (!overflow || TYPE_OVERFLOW_WRAPS (type))
 490   (mult @0 { wide_int_to_tree (type, mul); }))))

Can it be extend to (mult (plus_minus (mult @0 INTEGER_CST@1) INTEGER_CST@3)
INTEGER_CST@2), so at least we can handle it under -fwrapv?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (40 preceding siblings ...)
  2022-05-30  9:10 ` crazylht at gmail dot com
@ 2022-05-30  9:14 ` rguenther at suse dot de
  2022-06-16  1:29 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: rguenther at suse dot de @ 2022-05-30  9:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #48 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 30 May 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
> 
> --- Comment #47 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > 
> > The issue is that the re-association pass doesn't handle operations
> > with undefined overflow behavior, we do have duplicate bugreports
> > for this.
> > 
> 
> I saw below in match.pd
> 
>  478/* Combine successive multiplications.  Similar to above, but handling
>  479   overflow is different.  */
>  480(simplify
>  481 (mult (mult @0 INTEGER_CST@1) INTEGER_CST@2)
>  482 (with {
>  483   wi::overflow_type overflow;
>  484   wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
>  485                           TYPE_SIGN (type), &overflow);
>  486  }
>  487  /* Skip folding on overflow: the only special case is @1 * @2 ==
> -INT_MIN,
>  488     otherwise undefined overflow implies that @0 must be zero.  */
>  489  (if (!overflow || TYPE_OVERFLOW_WRAPS (type))
>  490   (mult @0 { wide_int_to_tree (type, mul); }))))
> 
> Can it be extend to (mult (plus_minus (mult @0 INTEGER_CST@1) INTEGER_CST@3)
> INTEGER_CST@2), so at least we can handle it under -fwrapv?

With -fwrapv the reassoc pass might do this already (not sure with
mixing multiplication and addition, you'd have to try).  But sure,
we could add a pattern for the above (with appropriate single-use
handling).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (41 preceding siblings ...)
  2022-05-30  9:14 ` rguenther at suse dot de
@ 2022-06-16  1:29 ` cvs-commit at gcc dot gnu.org
  2022-06-16  2:31 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-06-16  1:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #49 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:1089d083117f28f3518f5ec3c7a153236cb92334

commit r13-1126-g1089d083117f28f3518f5ec3c7a153236cb92334
Author: liuhongt <hongtao.liu@intel.com>
Date:   Tue May 31 17:13:21 2022 +0800

    Simplify (B * v + C) * D -> BD* v + CD when B,C,D are all INTEGER_CST.

    Similar for (v + B) * C + D -> C * v + BCD.
    Don't simplify it when there's overflow and overflow is UB for type v.

    gcc/ChangeLog:

            PR tree-optimization/53533
            * match.pd: Simplify (B * v + C) * D -> BD * v + CD and
            (v + B) * C + D -> C * v + BCD when B,C,D are all INTEGER_CST,
            and there's no overflow or !TYPE_OVERFLOW_UNDEFINED.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr53533-1.c: New test.
            * gcc.target/i386/pr53533-2.c: New test.
            * gcc.target/i386/pr53533-3.c: New test.
            * gcc.target/i386/pr53533-4.c: New test.
            * gcc.target/i386/pr53533-5.c: New test.
            * gcc.dg/vect/slp-11a.c: Adjust testcase.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (42 preceding siblings ...)
  2022-06-16  1:29 ` cvs-commit at gcc dot gnu.org
@ 2022-06-16  2:31 ` crazylht at gmail dot com
  2022-06-28 10:30 ` jakub at gcc dot gnu.org
  2023-07-07 10:29 ` [Bug rtl-optimization/53533] [11/12/13/14 " rguenth at gcc dot gnu.org
  45 siblings, 0 replies; 47+ messages in thread
From: crazylht at gmail dot com @ 2022-06-16  2:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #50 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> On the RTL level likely simplify-rtx (or the variants used by combine)
> only have limited support for vector operations.

Instruction sequence window(more than 20 shift instructions) is too big for
combine, hard to fix it in rtl.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (43 preceding siblings ...)
  2022-06-16  2:31 ` crazylht at gmail dot com
@ 2022-06-28 10:30 ` jakub at gcc dot gnu.org
  2023-07-07 10:29 ` [Bug rtl-optimization/53533] [11/12/13/14 " rguenth at gcc dot gnu.org
  45 siblings, 0 replies; 47+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.4                        |10.5

--- Comment #51 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/53533] [11/12/13/14 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
  2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
                   ` (44 preceding siblings ...)
  2022-06-28 10:30 ` jakub at gcc dot gnu.org
@ 2023-07-07 10:29 ` rguenth at gcc dot gnu.org
  45 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07 10:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.5                        |11.5

--- Comment #52 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10 branch is being closed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-07-07 10:29 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-31  0:55 [Bug middle-end/53533] New: [4.7 regression] loop unrolling as measured by Adobe's C++Benchmark is twice as slow versus 4.4-4.6 matt at use dot net
2012-05-31  0:58 ` [Bug middle-end/53533] " matt at use dot net
2012-05-31  9:59 ` rguenth at gcc dot gnu.org
2012-06-11 19:56 ` matt at use dot net
2012-06-11 19:57 ` matt at use dot net
2012-06-11 20:02 ` matt at use dot net
2012-06-12  9:54 ` [Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark rguenth at gcc dot gnu.org
2012-06-12 10:12 ` rguenth at gcc dot gnu.org
2012-06-12 10:27 ` rguenth at gcc dot gnu.org
2012-06-12 10:39 ` rguenth at gcc dot gnu.org
2012-06-12 11:57 ` rguenth at gcc dot gnu.org
2012-06-12 18:26 ` matt at use dot net
2012-06-12 18:55 ` rth at gcc dot gnu.org
2012-06-13  9:44 ` rguenth at gcc dot gnu.org
2012-06-14 14:39 ` rth at gcc dot gnu.org
2012-06-14 18:02 ` matt at use dot net
2012-06-14 18:39 ` rth at gcc dot gnu.org
2012-06-15  9:04 ` jakub at gcc dot gnu.org
2012-06-15 21:05 ` rth at gcc dot gnu.org
2012-08-10  9:43 ` rguenth at gcc dot gnu.org
2012-08-14 17:26 ` matt at use dot net
2012-08-20 23:53 ` matt at use dot net
2012-09-20 10:27 ` jakub at gcc dot gnu.org
2012-11-29 21:17 ` rth at gcc dot gnu.org
2012-12-03 15:27 ` rguenth at gcc dot gnu.org
2013-04-11  8:00 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9 " rguenth at gcc dot gnu.org
2014-06-12 13:45 ` [Bug rtl-optimization/53533] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
2014-12-19 13:28 ` [Bug rtl-optimization/53533] [4.8/4.9/5 " jakub at gcc dot gnu.org
2015-05-03 13:00 ` [Bug rtl-optimization/53533] [4.8/4.9/5/6 " trippels at gcc dot gnu.org
2015-05-03 13:01 ` trippels at gcc dot gnu.org
2015-05-04 14:46 ` maltsevm at gmail dot com
2015-05-04 15:00 ` maltsevm at gmail dot com
2015-06-23  8:22 ` rguenth at gcc dot gnu.org
2015-06-26 19:58 ` [Bug rtl-optimization/53533] [4.9/5/6 " jakub at gcc dot gnu.org
2015-06-26 20:29 ` jakub at gcc dot gnu.org
2021-02-23 12:24 ` [Bug rtl-optimization/53533] [8/9/10/11 " rguenth at gcc dot gnu.org
2021-05-14  9:46 ` [Bug rtl-optimization/53533] [9/10/11/12 " jakub at gcc dot gnu.org
2021-06-01  8:05 ` rguenth at gcc dot gnu.org
2022-05-27  9:34 ` [Bug rtl-optimization/53533] [10/11/12/13 " rguenth at gcc dot gnu.org
2022-05-30  6:40 ` crazylht at gmail dot com
2022-05-30  8:57 ` rguenther at suse dot de
2022-05-30  9:10 ` crazylht at gmail dot com
2022-05-30  9:14 ` rguenther at suse dot de
2022-06-16  1:29 ` cvs-commit at gcc dot gnu.org
2022-06-16  2:31 ` crazylht at gmail dot com
2022-06-28 10:30 ` jakub at gcc dot gnu.org
2023-07-07 10:29 ` [Bug rtl-optimization/53533] [11/12/13/14 " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).