public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2
@ 2011-01-14 20:53 Joost.VandeVondele at pci dot uzh.ch
  2011-01-14 21:02 ` [Bug middle-end/47298] " rguenth at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2011-01-14 20:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

           Summary: -O3 destroys beautifully vectorized code obtained at
                    -O2
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: Joost.VandeVondele@pci.uzh.ch


current trunk generates really fast vectorized code for the following testcase
(a 12x12x12 matrix multiply, c=c+a*b, benchmarked with a,b,c in cache) as can
be seen from the assembly:

> cat compare.f90
   SUBROUTINE HARD_NN_12_12_12(C,A,B) 
      REAL(KIND=8), INTENT(INOUT) :: C(12,*)
      REAL(KIND=8), INTENT(IN)    :: B(12,*), A(12,*)
      INTEGER ::i,j,l
      DO j=1,12 ; DO i=1,12; DO l=1,12
         C(i,j)=C(i,j)+A(i,l)*B(l,j)
      ENDDO ; ENDDO ; ENDDO
   END SUBROUTINE HARD_NN_12_12_12

however, this only happens with:

gfortran-trunk -O2 -funroll-loops -ftree-vectorize -ffast-math -march=corei7
-msse4.2  compare.f90

while switch -O2 to -O3 causes 'bad' code.

gfortran-trunk -O3 -funroll-loops -ftree-vectorize -ffast-math -march=corei7
-msse4.2  compare.f90

with the following tester below

-O2 runs in about 4.4s
-O3 runs in about 7.0s

> cat test_compare.f90 
      REAL(KIND=8), DIMENSION(12,12) :: A,B,C
      A=0 ; B=0 ; C=0
      DO I=1,10000000
         CALL HARD_NN_12_12_12(C,12,A,12,B,12)
      ENDDO
      END


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
@ 2011-01-14 21:02 ` rguenth at gcc dot gnu.org
  2011-01-14 21:11 ` Joost.VandeVondele at pci dot uzh.ch
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-14 21:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-14 20:43:15 UTC ---
It's faster for me with -O3 (Athlon64, using -march=native).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
  2011-01-14 21:02 ` [Bug middle-end/47298] " rguenth at gcc dot gnu.org
@ 2011-01-14 21:11 ` Joost.VandeVondele at pci dot uzh.ch
  2011-01-14 21:27 ` Joost.VandeVondele at pci dot uzh.ch
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2011-01-14 21:11 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #2 from Joost VandeVondele <Joost.VandeVondele at pci dot uzh.ch> 2011-01-14 20:52:54 UTC ---
(In reply to comment #1)
> It's faster for me with -O3 (Athlon64, using -march=native).

well not on 
model name      : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping        : 5

I have 8Gflops with -O2 and somewhat more than 4 with -O3

BTW, the proper test program is
> cat test_compare.f90 
      REAL(KIND=8), DIMENSION(12,12) :: A,B,C
      A=0 ; B=0 ; C=0
      DO I=1,10000000
         CALL HARD_NN_12_12_12(C,A,B)
      ENDDO
      END


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
  2011-01-14 21:02 ` [Bug middle-end/47298] " rguenth at gcc dot gnu.org
  2011-01-14 21:11 ` Joost.VandeVondele at pci dot uzh.ch
@ 2011-01-14 21:27 ` Joost.VandeVondele at pci dot uzh.ch
  2012-06-29 14:44 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2011-01-14 21:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #3 from Joost VandeVondele <Joost.VandeVondele at pci dot uzh.ch> 2011-01-14 21:02:04 UTC ---
Actually, also on AMD I have at -O2 9.4s -O3 11.8s

model           : 9
model name      : AMD Opteron(tm) Processor 6176 SE
stepping        : 1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (2 preceding siblings ...)
  2011-01-14 21:27 ` Joost.VandeVondele at pci dot uzh.ch
@ 2012-06-29 14:44 ` Joost.VandeVondele at mat dot ethz.ch
  2012-06-29 15:01 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-29 14:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2012-06-29

--- Comment #4 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-29 14:44:05 UTC ---
on 4.8 this still is not handled optimally. I get

4.3s for: gfortran -O2 -funroll-loops -ftree-vectorize -ffast-math
-march=native 
6.7s for: gfortran -O3 -funroll-loops -ftree-vectorize -ffast-math
-march=native

so more than 50% slowdown going from -O2 to -O3

on

-march=corei7 -mcx16 -msahf -mno-movbe -mno-aes -mno-pclmul -mpopcnt -mno-abm
-mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx
-mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c
-mno-fsgsbase --param l1-cache-size=32 --param l1-cache-line-size=64


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (3 preceding siblings ...)
  2012-06-29 14:44 ` Joost.VandeVondele at mat dot ethz.ch
@ 2012-06-29 15:01 ` rguenth at gcc dot gnu.org
  2012-07-05  7:36 ` ebotcazou at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-29 15:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1

--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-29 15:01:28 UTC ---
The issue is we completely unroll the innermost loop at -O3 -funroll-loops.
We then vectorize the outer loop but have to peel for alignment (and are not
good at seeing we run at most once there).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (4 preceding siblings ...)
  2012-06-29 15:01 ` rguenth at gcc dot gnu.org
@ 2012-07-05  7:36 ` ebotcazou at gcc dot gnu.org
  2012-07-05  8:38 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ebotcazou at gcc dot gnu.org @ 2012-07-05  7:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

Eric Botcazou <ebotcazou at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ebotcazou at gcc dot
                   |                            |gnu.org

--- Comment #6 from Eric Botcazou <ebotcazou at gcc dot gnu.org> 2012-07-05 07:36:15 UTC ---
> The issue is we completely unroll the innermost loop at -O3 -funroll-loops.
> We then vectorize the outer loop but have to peel for alignment (and are not
> good at seeing we run at most once there).

It's not cunroll (-funroll-loops), it's cunrolli which can have adverse effects
on vectorization and cannot be disabled.  We run into this in Ada as well.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (5 preceding siblings ...)
  2012-07-05  7:36 ` ebotcazou at gcc dot gnu.org
@ 2012-07-05  8:38 ` rguenth at gcc dot gnu.org
  2012-07-05  8:48 ` ebotcazou at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-05  8:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-05 08:38:05 UTC ---
It's a pass ordering issue, cunrolli also can tremendously help vectorization
because it enables vectorization of the loop that is then the innermost loop
after unrolling.  It also helps exposing redunancies as you can trivially
see in SPEC CPU 2006 calculix (gfortran.dg/reassoc_4.f).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (6 preceding siblings ...)
  2012-07-05  8:38 ` rguenth at gcc dot gnu.org
@ 2012-07-05  8:48 ` ebotcazou at gcc dot gnu.org
  2012-07-05 10:10 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ebotcazou at gcc dot gnu.org @ 2012-07-05  8:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #8 from Eric Botcazou <ebotcazou at gcc dot gnu.org> 2012-07-05 08:48:24 UTC ---
> It's a pass ordering issue, cunrolli also can tremendously help vectorization
> because it enables vectorization of the loop that is then the innermost loop
> after unrolling.  It also helps exposing redunancies as you can trivially
> see in SPEC CPU 2006 calculix (gfortran.dg/reassoc_4.f).

Sure, no disagreement here.  But we have cases where the outer loop is
trivially not vectorizable because of CFG contructs and cunrolli kills the
vectorization for the 32 innermost loops...

Possible stopgap measures are a switch to disable cunrolli or a "no vectorize"
pragma on the outer loop to thwart it.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (7 preceding siblings ...)
  2012-07-05  8:48 ` ebotcazou at gcc dot gnu.org
@ 2012-07-05 10:10 ` rguenth at gcc dot gnu.org
  2012-07-05 10:12 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-05 10:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #9 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-05 10:10:28 UTC ---
I have a few patches that try to estimate CSE opportunities exposed by
complete unrolling.  In this case the CSE opportunity is the reduction
into C(i,j) (possibly also detected by store motion later).

Adding a patch to enable disabling of cunrolli (and cunroll - which you
also cannot disable) would be fine, but we should keep "unrolling"
once rolling loops early at least.

Note that we should still try to fix

"(and are not good at seeing we run at most once there)"

so we avoid messing up things here.  In theory the vectorizer should be
fully capable of vectorizing even the unrolled loop (in this particular
case) via SLP or basic-block vectorization.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (8 preceding siblings ...)
  2012-07-05 10:10 ` rguenth at gcc dot gnu.org
@ 2012-07-05 10:12 ` rguenth at gcc dot gnu.org
  2012-07-05 10:30 ` ebotcazou at gcc dot gnu.org
  2013-03-27 13:02 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-05 10:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-05 10:11:55 UTC ---
Oh, and you can disable cunrolli already via -fdisable-tree-cunrolli.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (9 preceding siblings ...)
  2012-07-05 10:12 ` rguenth at gcc dot gnu.org
@ 2012-07-05 10:30 ` ebotcazou at gcc dot gnu.org
  2013-03-27 13:02 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: ebotcazou at gcc dot gnu.org @ 2012-07-05 10:30 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

--- Comment #11 from Eric Botcazou <ebotcazou at gcc dot gnu.org> 2012-07-05 10:30:09 UTC ---
> Oh, and you can disable cunrolli already via -fdisable-tree-cunrolli.

Indeed, I always forget that we have it in 4.7 and above.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/47298] -O3 destroys beautifully vectorized code obtained at -O2
  2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
                   ` (10 preceding siblings ...)
  2012-07-05 10:30 ` ebotcazou at gcc dot gnu.org
@ 2013-03-27 13:02 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-03-27 13:02 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47298

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> 2013-03-27 13:02:05 UTC ---
On trunk we now vectorize the loop and then unroll it from cunroll.

4.6 -O2 -funroll-loops -ftree-vectorize -ffast-math: 10.7s
4.6 -O3 -funroll-loops -ftree-vectorize -ffast-math: 8.3s
4.7 -O2 -funroll-loops -ftree-vectorize -ffast-math: 7.4s
4.7 -O3 -funroll-loops -ftree-vectorize -ffast-math: 8.5s
4.8 -O2 -funroll-loops -ftree-vectorize -ffast-math: 6.1s
4.8 -O3 -funroll-loops -ftree-vectorize -ffast-math: 6.5s

with -march=native added (iCore5)

4.8 -O2 ... -march=native: 3.9s
4.8 -O3 ... -march=native: 4s

Apart from very minor scheduling differences I see no difference in
code generation on trunk -O2 vs. -O3.

I'd say "fixed" without more details.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-03-27 13:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-14 20:53 [Bug middle-end/47298] New: -O3 destroys beautifully vectorized code obtained at -O2 Joost.VandeVondele at pci dot uzh.ch
2011-01-14 21:02 ` [Bug middle-end/47298] " rguenth at gcc dot gnu.org
2011-01-14 21:11 ` Joost.VandeVondele at pci dot uzh.ch
2011-01-14 21:27 ` Joost.VandeVondele at pci dot uzh.ch
2012-06-29 14:44 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-29 15:01 ` rguenth at gcc dot gnu.org
2012-07-05  7:36 ` ebotcazou at gcc dot gnu.org
2012-07-05  8:38 ` rguenth at gcc dot gnu.org
2012-07-05  8:48 ` ebotcazou at gcc dot gnu.org
2012-07-05 10:10 ` rguenth at gcc dot gnu.org
2012-07-05 10:12 ` rguenth at gcc dot gnu.org
2012-07-05 10:30 ` ebotcazou at gcc dot gnu.org
2013-03-27 13:02 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).