[Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug fortran/42108]  New: Performance drop from 4.3 to 4.4/4.5
@ 2009-11-19 16:01 sfilippone at uniroma2 dot it
  2009-11-19 16:01 ` [Bug fortran/42108] " sfilippone at uniroma2 dot it
                   ` (51 more replies)
  0 siblings, 52 replies; 53+ messages in thread
From: sfilippone at uniroma2 dot it @ 2009-11-19 16:01 UTC (permalink / raw)
  To: gcc-bugs

With the attached sample code I get a substantial performance drop from 4.3.1
to either 4.4.1 or 4.5.0, same compiler option, same machine. To reproduce,
feed a size to the program (in the case below, 40000) and time the executable. 

[sfilippo@donald fgp_fmm_20091112]$ gfortran -v  
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.3.1/configure --prefix=/usr/local/gcc43
--with-mpfr=/u
sr/local/mpfr --with-gmp=/usr/local/gmp
Thread model: posix
gcc version 4.3.1 (GCC) 
[sfilippo@donald fgp_fmm_20091112]$ gfortran -O3 -o try_eval eval.f90
[sfilippo@donald fgp_fmm_20091112]$ time ./try_eval <<EOF
40000
EOF

real    0m10.871s
user    0m10.825s
sys     0m0.011s
[sfilippo@donald fgp_fmm_20091112]$ module unload gnu43
[sfilippo@donald fgp_fmm_20091112]$ module load gnu45 
        gnu45 - loads the GNU 4.5.0-pre compilers suite

        Version 1.0

[sfilippo@donald fgp_fmm_20091112]$ gfortran -v 
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/local/gnu45/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/
lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc/configure --prefix=/usr/local/gnu45
--enable-languages=c
,c++,fortran : (reconfigured) ../gcc/configure --prefix=/usr/local/gnu45
--enabl
e-languages=c,c++,fortran : (reconfigured) ../gcc/configure
--prefix=/usr/local/
gnu45 --enable-languages=c,c++,fortran,lto --no-create --no-recursion :
(reconfi
gured) ../gcc/configure --prefix=/usr/local/gnu45
--enable-languages=c,c++,fortr
an,lto --no-create --no-recursion
Thread model: posix
gcc version 4.5.0 20091119 (experimental) (GCC) 
[sfilippo@donald fgp_fmm_20091112]$ gfortran -O3 -o try_eval eval.f90
[sfilippo@donald fgp_fmm_20091112]$ time ./try_eval <<EOF
40000
EOF

real    0m23.935s
user    0m23.862s
sys     0m0.011s
[sfilippo@donald fgp_fmm_20091112]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 2
model name      : AMD Athlon(tm) 7750 Dual-Core Processor
stepping        : 3
cpu MHz         : 2700.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16
lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch
osvw ibs
bogomips        : 5424.74
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate


-- 
           Summary: Performance drop from 4.3 to 4.4/4.5
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: sfilippone at uniroma2 dot it
 GCC build triplet: x86_64-unknown-linux-gnu
  GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug fortran/42108] Performance drop from 4.3 to 4.4/4.5
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
@ 2009-11-19 16:01 ` sfilippone at uniroma2 dot it
  2009-11-19 16:50 ` [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression rguenth at gcc dot gnu dot org
                   ` (50 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: sfilippone at uniroma2 dot it @ 2009-11-19 16:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from sfilippone at uniroma2 dot it  2009-11-19 16:01 -------
Created an attachment (id=19054)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=19054&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
  2009-11-19 16:01 ` [Bug fortran/42108] " sfilippone at uniroma2 dot it
@ 2009-11-19 16:50 ` rguenth at gcc dot gnu dot org
  2009-11-19 17:17 ` sfilippone at uniroma2 dot it
                   ` (49 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-11-19 16:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from rguenth at gcc dot gnu dot org  2009-11-19 16:49 -------
-ftree-vectorizer-verbose=2 tells you:

eval.f90:35: note: not vectorized: relevant stmt not supported: D.1684_73 =
((D.1683_72));

eval.f90:32: note: not vectorized: relevant stmt not supported: D.1684_58 =
((D.1683_57));

PAREN_EXPRs are new in 4.4 and I believe they cannot be turned off
right now.

The loops are

  do i=1,nnd
    x(i) = 1.d0 + (1.d0*i)/nnd
  end do
  do i=1,n
    foo4(i) = 1.d0 + (1.d0*i)/n
  end do

where the vectorizer doesn't know how to ensure evaluation order is
preserved when trying to vectorize (1.d0*i)/n.  Writing them as
1.d0*i/n vectorizes the function.

Still the performance is lower by a factor of two compared to 4.3
(even with -ffast-math).

Probably the bug should be split.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at il dot ibm dot com,
                   |                            |rguenth at gcc dot gnu dot
                   |                            |org
           Severity|normal                      |enhancement
             Status|UNCONFIRMED                 |NEW
          Component|fortran                     |tree-optimization
     Ever Confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|0000-00-00 00:00:00         |2009-11-19 16:49:51
               date|                            |
            Summary|Performance drop from 4.3 to|[4.4/4.5 Regression]
                   |4.4/4.5                     |Vectorizer cannot deal with
                   |                            |PAREN_EXPR gracefully, 50%
                   |                            |performance regression
   Target Milestone|---                         |4.4.3


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
  2009-11-19 16:01 ` [Bug fortran/42108] " sfilippone at uniroma2 dot it
  2009-11-19 16:50 ` [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression rguenth at gcc dot gnu dot org
@ 2009-11-19 17:17 ` sfilippone at uniroma2 dot it
  2009-11-19 17:30 ` rguenther at suse dot de
                   ` (48 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: sfilippone at uniroma2 dot it @ 2009-11-19 17:17 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from sfilippone at uniroma2 dot it  2009-11-19 17:17 -------
(In reply to comment #2)
> -ftree-vectorizer-verbose=2 tells you:
> 
> eval.f90:35: note: not vectorized: relevant stmt not supported: D.1684_73 =
> ((D.1683_72));
> 
> eval.f90:32: note: not vectorized: relevant stmt not supported: D.1684_58 =
> ((D.1683_57));
> 
> PAREN_EXPRs are new in 4.4 and I believe they cannot be turned off
> right now.
> 
> The loops are
> 
>   do i=1,nnd
>     x(i) = 1.d0 + (1.d0*i)/nnd
>   end do
>   do i=1,n
>     foo4(i) = 1.d0 + (1.d0*i)/n
>   end do
> 
> where the vectorizer doesn't know how to ensure evaluation order is
> preserved when trying to vectorize (1.d0*i)/n.  Writing them as
> 1.d0*i/n vectorizes the function.
> 
> Still the performance is lower by a factor of two compared to 4.3
> (even with -ffast-math).
> 
> Probably the bug should be split.
> 

Well, the performance drop I am looking at is  in the subroutine. The
initialization loops are (to me)  irrelevant, I had posted a previous version
to the mailing list where the initialization was done with random_number and
the situation was the same. 
A run with profiling shows that more than 99% of the time is spent in eval_


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (2 preceding siblings ...)
  2009-11-19 17:17 ` sfilippone at uniroma2 dot it
@ 2009-11-19 17:30 ` rguenther at suse dot de
  2009-11-19 19:42 ` sfilippone at uniroma2 dot it
                   ` (47 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenther at suse dot de @ 2009-11-19 17:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from rguenther at suse dot de  2009-11-19 17:30 -------
Subject: Re:  [4.4/4.5 Regression] Vectorizer
 cannot deal with PAREN_EXPR gracefully, 50% performance regression

On Thu, 19 Nov 2009, sfilippone at uniroma2 dot it wrote:

> ------- Comment #3 from sfilippone at uniroma2 dot it  2009-11-19 17:17 -------
> (In reply to comment #2)
> > -ftree-vectorizer-verbose=2 tells you:
> > 
> > eval.f90:35: note: not vectorized: relevant stmt not supported: D.1684_73 =
> > ((D.1683_72));
> > 
> > eval.f90:32: note: not vectorized: relevant stmt not supported: D.1684_58 =
> > ((D.1683_57));
> > 
> > PAREN_EXPRs are new in 4.4 and I believe they cannot be turned off
> > right now.
> > 
> > The loops are
> > 
> >   do i=1,nnd
> >     x(i) = 1.d0 + (1.d0*i)/nnd
> >   end do
> >   do i=1,n
> >     foo4(i) = 1.d0 + (1.d0*i)/n
> >   end do
> > 
> > where the vectorizer doesn't know how to ensure evaluation order is
> > preserved when trying to vectorize (1.d0*i)/n.  Writing them as
> > 1.d0*i/n vectorizes the function.
> > 
> > Still the performance is lower by a factor of two compared to 4.3
> > (even with -ffast-math).
> > 
> > Probably the bug should be split.
> > 
> 
> Well, the performance drop I am looking at is  in the subroutine. The
> initialization loops are (to me)  irrelevant, I had posted a previous version
> to the mailing list where the initialization was done with random_number and
> the situation was the same. 
> A run with profiling shows that more than 99% of the time is spent in eval_

Heh, with -fwhole-program GCC optimizes the test away and I get 0.0s
runtime.

Well, within eval there's nothing really obvious to me.  The
innermost loop is exactly the same:

.L39:
        movsd   (%r15), %xmm0
        addq    %rsi, %r15
        subsd   (%rdx), %xmm0
        addq    %rsi, %rdx
        subl    $1, %eax
        mulsd   %xmm0, %xmm0
        addsd   %xmm0, %xmm1
        jne     .L39

the next outer loop has some less loads in 4.5 but also different
induction variables.  So - nothing obvious to me.

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (3 preceding siblings ...)
  2009-11-19 17:30 ` rguenther at suse dot de
@ 2009-11-19 19:42 ` sfilippone at uniroma2 dot it
  2009-11-19 19:53 ` toon at moene dot org
                   ` (46 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: sfilippone at uniroma2 dot it @ 2009-11-19 19:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from sfilippone at uniroma2 dot it  2009-11-19 19:42 -------
(In reply to comment #4)
> Subject: Re:  [4.4/4.5 Regression] Vectorizer
>  cannot deal with PAREN_EXPR gracefully, 50% performance regression
> 
> 
> Heh, with -fwhole-program GCC optimizes the test away and I get 0.0s
> runtime.
> 
Not too surprising, after all this was extracted to make the test case
manageable, the original code is not pointless..:-)

> Well, within eval there's nothing really obvious to me.  The
> innermost loop is exactly the same:
> 
> .L39:
>         movsd   (%r15), %xmm0
>         addq    %rsi, %r15
>         subsd   (%rdx), %xmm0
>         addq    %rsi, %rdx
>         subl    $1, %eax
>         mulsd   %xmm0, %xmm0
>         addsd   %xmm0, %xmm1
>         jne     .L39
> 
> the next outer loop has some less loads in 4.5 but also different
> induction variables.  So - nothing obvious to me.
> 
Exactly, it's quite surprising to see a difference with such a simple loop. 
However the size of the generated assembler is different, so there must be
something... 

> Richard.
> 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (4 preceding siblings ...)
  2009-11-19 19:42 ` sfilippone at uniroma2 dot it
@ 2009-11-19 19:53 ` toon at moene dot org
  2009-11-19 22:33 ` anlauf at gmx dot de
                   ` (45 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: toon at moene dot org @ 2009-11-19 19:53 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from toon at moene dot org  2009-11-19 19:53 -------
Richard Guenther wrote:

> Well, within eval there's nothing really obvious to me.  The
> innermost loop is exactly the same:

But it is a very inefficient way of vectorizing, because the inner loop's body
is either executed twice or three times per outer loop (depending on the value
of i).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (5 preceding siblings ...)
  2009-11-19 19:53 ` toon at moene dot org
@ 2009-11-19 22:33 ` anlauf at gmx dot de
  2009-11-20  8:32 ` sfilippone at uniroma2 dot it
                   ` (44 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: anlauf at gmx dot de @ 2009-11-19 22:33 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from anlauf at gmx dot de  2009-11-19 22:33 -------
I tried the code on a x86 Core2 system (32 bit mode).

gfortran 4.3, 4.5:
22.74user 0.03system 0:22.82elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

Intels ifort 11.1 is only ~ 5% faster, but:

SunStudio 12.1: (sunf95 -fast)
11.50user 0.00system 0:11.51elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

Wow, that gives a 100% improvement potential!

(I added a
  print *, foo3(n)
after the call to eval to make sure that nothing gets optimized away.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (6 preceding siblings ...)
  2009-11-19 22:33 ` anlauf at gmx dot de
@ 2009-11-20  8:32 ` sfilippone at uniroma2 dot it
  2009-11-20 13:45 ` dominiq at lps dot ens dot fr
                   ` (43 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: sfilippone at uniroma2 dot it @ 2009-11-20  8:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from sfilippone at uniroma2 dot it  2009-11-20 08:32 -------
(In reply to comment #6)
> Richard Guenther wrote:
> 
> > Well, within eval there's nothing really obvious to me.  The
> > innermost loop is exactly the same:
> 
> But it is a very inefficient way of vectorizing, because the inner loop's body
> is either executed twice or three times per outer loop (depending on the value
> of i).
> 
While I agree that I would code in a different way, still there is the change
in compiler's behaviour. Although comment 7 indicates it's probably only at
64bits


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (7 preceding siblings ...)
  2009-11-20  8:32 ` sfilippone at uniroma2 dot it
@ 2009-11-20 13:45 ` dominiq at lps dot ens dot fr
  2009-11-20 14:04 ` sfilippone at uniroma2 dot it
                   ` (42 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: dominiq at lps dot ens dot fr @ 2009-11-20 13:45 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from dominiq at lps dot ens dot fr  2009-11-20 13:45 -------
I am rather confused by some comments:

(1) Although I am not fluent with x86 assembly, I am pretty sure that no code
in eval is vectorized (assembly taken from this pr or from the original post
http://gcc.gnu.org/ml/fortran/2009-11/msg00163.html).

(2) If I am not mistaken, the k loop always handle 3 elements for i, i+n, and
i+2*n.

(3) On a core2duo 2.1Ghz, I only see small changes in the timing between 4.3.4
to trunk, -O1 to -O3, and 32 or 64 bit mode.

Now if I do the following change:

--- pr42108_1_db.f90    2009-11-20 14:14:05.000000000 +0100
+++ pr42108_1_db_1.f90  2009-11-20 14:15:24.000000000 +0100
@@ -7,12 +7,10 @@ subroutine  eval(foo1,foo2,foo3,foo4,x,n
   do i=2,n
     foo3(i)=foo2*foo4(i)
     do  j=1,i-1
-      temp=0.0d0
-      jmini=j-i
-      do  k=i,nnd,n
-        temp=temp+(x(k)-x(k+jmini))**2
-      end do
-      temp = sqrt(temp+foo1)
+      temp = sqrt( (x(i) - x(j))**2 &
+                  +(x(i+n) - x(j+n))**2 &
+                  +(x(i+2*n)-x(j+2*n))**2 &
+                  +foo1)
       foo3(i)=foo3(i)+temp*foo4(j)
       foo3(j)=foo3(j)+temp*foo4(i)
     end do

I go from 9.2s to 5.5s for n=20000. So the k loop is not automatically unrolled
even with -funroll-loops.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (8 preceding siblings ...)
  2009-11-20 13:45 ` dominiq at lps dot ens dot fr
@ 2009-11-20 14:04 ` sfilippone at uniroma2 dot it
  2009-11-20 14:12 ` sfilippone at uniroma2 dot it
                   ` (41 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: sfilippone at uniroma2 dot it @ 2009-11-20 14:04 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from sfilippone at uniroma2 dot it  2009-11-20 14:03 -------
(In reply to comment #9)
> I am rather confused by some comments:
> 
> (1) Although I am not fluent with x86 assembly, I am pretty sure that no code
> in eval is vectorized (assembly taken from this pr or from the original post
> http://gcc.gnu.org/ml/fortran/2009-11/msg00163.html).
> 
> (2) If I am not mistaken, the k loop always handle 3 elements for i, i+n, and
> i+2*n.
> 
Yup, in the test case, in the original application the factor might be
different from 3. And yes, it may be better to declare the array as 2D


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (9 preceding siblings ...)
  2009-11-20 14:04 ` sfilippone at uniroma2 dot it
@ 2009-11-20 14:12 ` sfilippone at uniroma2 dot it
  2009-11-20 14:14 ` rguenth at gcc dot gnu dot org
                   ` (40 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: sfilippone at uniroma2 dot it @ 2009-11-20 14:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from sfilippone at uniroma2 dot it  2009-11-20 14:12 -------
(In reply to comment #10)
Again, I am no asking for help in writing a better code (I think I know how to
handle this, and I will convince my colleague), I just thought it was worth
mentioning that the optimizer has apparently done a worse job lately (at least
on the platform I am using).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (10 preceding siblings ...)
  2009-11-20 14:12 ` sfilippone at uniroma2 dot it
@ 2009-11-20 14:14 ` rguenth at gcc dot gnu dot org
  2009-11-20 19:45 ` toon at moene dot org
                   ` (39 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-11-20 14:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from rguenth at gcc dot gnu dot org  2009-11-20 14:13 -------
The loop is not unrolled because the frontend presents us with very funny
obfuscated code:

      do  k=i,nnd,n
        temp=temp+(x(k)-x(k+jmini))**2
      end do

gets translated to

{
  character(kind=4) countm1.6;
  integer(kind=4) D.1551;
  integer(kind=4) D.1550;
  integer(kind=4) D.1549;

  D.1549 = i;
  D.1550 = *nnd;
  D.1551 = *n;
  k = D.1549;
  if (D.1551 > 0)
    {
      if (D.1550 < D.1549) goto L.6;, countm1.6 = (character(kind=4)) (D.1550 -
D.1549) / (character(kind=4)) D.1551;;
    }
  else
    {
      if (D.1550 > D.1549) goto L.6;, countm1.6 = (character(kind=4)) (D.1549 -
D.1550) / (character(kind=4)) -D.1551;;
    }
  while (1)
    {
        {
          real(kind=8) D.1556;
          real(kind=8) D.1555;

          D.1555 = (((*x)[(integer(kind=8)) k + -1] - (*x)[(integer(kind=8)) (k
+ jmini) + -1]));
          D.1556 = D.1555 * D.1555;
          temp = temp + D.1556;
        }
      L.5:;
      k = k + D.1551;
      if (countm1.6 == 0) goto L.6;
      countm1.6 = countm1.6 + 4294967295;
    }
  L.6:;
}


WTF!?

The funny conditional initialization of countm1.6 makes the analysis of
the number of iterations of this loop impossible (not to mention the
conversions to character(kind=4)).

Why does the frontend do induction variable "optimization" at all and
not simply generate a loop with a non-unit counting IV?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (11 preceding siblings ...)
  2009-11-20 14:14 ` rguenth at gcc dot gnu dot org
@ 2009-11-20 19:45 ` toon at moene dot org
  2009-11-20 23:48 ` rguenth at gcc dot gnu dot org
                   ` (38 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: toon at moene dot org @ 2009-11-20 19:45 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #13 from toon at moene dot org  2009-11-20 19:45 -------
> The funny conditional initialization of countm1.6 makes the analysis of
> the number of iterations of this loop impossible (not to mention the
> conversions to character(kind=4)).

> Why does the frontend do induction variable "optimization" at all and
> not simply generate a loop with a non-unit counting IV?

It's not trying to be funny - it just follows the text of the Fortran Standard
(hey, what a concept !):

12   8.1.6.6.1    Loop initiation
13 1 When the DO statement is executed, the DO construct becomes active. If
loop-control is
14 2     [ , ] do-variable = scalar-int-expr 1 , scalar-int-expr 2 [ ,
scalar-int-expr 3 ]
15 3 the following steps are performed in sequence.
16          (1)    The initial parameter m1 , the terminal parameter m2 , and
the incrementation parameter m3 are
17                 of type integer with the same kind type parameter as the
do-variable. Their values are established
18                 by evaluating scalar-int-expr 1 , scalar-int-expr 2 , and
scalar-int-expr 3 , respectively, including, if ne-
19                 cessary, conversion to the kind type parameter of the
do-variable according to the rules for numeric
20                 conversion (Table 7.11). If scalar-int-expr 3 does not
appear, m3 has the value 1. The value of m3
21                 shall not be zero.
22          (2)    The DO variable becomes defined with the value of the
initial parameter m1 .
23          (3)    The iteration count is established and is the value of the
expression (m2 - m1 + m3 )/m3 , unless that
24                 value is negative, in which case the iteration count is 0.

Only interprocedural analysis can tell us that this is a simple loop only
executed 3 times (I got this wrong at first - it's *always* executed 3 times).

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (12 preceding siblings ...)
  2009-11-20 19:45 ` toon at moene dot org
@ 2009-11-20 23:48 ` rguenth at gcc dot gnu dot org
  2009-11-21 12:11 ` toon at moene dot org
                   ` (37 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-11-20 23:48 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #14 from rguenth at gcc dot gnu dot org  2009-11-20 23:48 -------
(In reply to comment #13)
> > The funny conditional initialization of countm1.6 makes the analysis of
> > the number of iterations of this loop impossible (not to mention the
> > conversions to character(kind=4)).
> 
> > Why does the frontend do induction variable "optimization" at all and
> > not simply generate a loop with a non-unit counting IV?
> 
> It's not trying to be funny - it just follows the text of the Fortran Standard
> (hey, what a concept !):
> 
> 12   8.1.6.6.1    Loop initiation
> 13 1 When the DO statement is executed, the DO construct becomes active. If
> loop-control is
> 14 2     [ , ] do-variable = scalar-int-expr 1 , scalar-int-expr 2 [ ,
> scalar-int-expr 3 ]
> 15 3 the following steps are performed in sequence.
> 16          (1)    The initial parameter m1 , the terminal parameter m2 , and
> the incrementation parameter m3 are
> 17                 of type integer with the same kind type parameter as the
> do-variable. Their values are established
> 18                 by evaluating scalar-int-expr 1 , scalar-int-expr 2 , and
> scalar-int-expr 3 , respectively, including, if ne-
> 19                 cessary, conversion to the kind type parameter of the
> do-variable according to the rules for numeric
> 20                 conversion (Table 7.11). If scalar-int-expr 3 does not
> appear, m3 has the value 1. The value of m3
> 21                 shall not be zero.
> 22          (2)    The DO variable becomes defined with the value of the
> initial parameter m1 .
> 23          (3)    The iteration count is established and is the value of the
> expression (m2 - m1 + m3 )/m3 , unless that
> 24                 value is negative, in which case the iteration count is 0.
> 
> Only interprocedural analysis can tell us that this is a simple loop only
> executed 3 times (I got this wrong at first - it's *always* executed 3 times).

I don't see that the standard suggests the specific code the Frontend
generates.  In fact it should be valid to increment the DO variable
by m3 and express the exit test in terms of the DO variable as well.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (13 preceding siblings ...)
  2009-11-20 23:48 ` rguenth at gcc dot gnu dot org
@ 2009-11-21 12:11 ` toon at moene dot org
  2009-11-21 12:19 ` rguenther at suse dot de
                   ` (36 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: toon at moene dot org @ 2009-11-21 12:11 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #15 from toon at moene dot org  2009-11-21 12:11 -------
> I don't see that the standard suggests the specific code the Frontend
> generates.  In fact it should be valid to increment the DO variable
> by m3 and express the exit test in terms of the DO variable as well.

The Standard doesn't prescribe the code the Frontend generates - however, to be
sure one follows the Standard, it's most easy to simply implement the steps
given.

To illustrate this with a simple example:

DO I = M1, M2, M3
   B(I) = A(I)
ENDDO

would be most easily, and atraightforwardly, implemented as follows:

     IF (M3 > 0 .AND. M1 < M2) GOTO 200  ! Loop executed zero times
     IF (M3 < 0 .AND. M1 > M2) GOTO 200  ! Ditto
     ITEMP = (M2 - M1 + M3) / M3         ! Temporary loop count
     I     = M1
 100 CONTINUE
     B(I)  = A(I)
     ITEMP = ITEMP - 1                   ! Adjust internal loop counter
     I     = I + M3                      ! Adjust DO loop variable
     IF (ITEMP > 0) GOTO 100
 200 CONTINUE

That there are two induction variables in this loop is inconsequential - one of
them should be eliminated by induction variable elimination (at least, that was
the case with g77 and the RTL loop optimization pass).

If you think that the Frontend does something different / in addition to the
above, feel free to open a separate PR.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (14 preceding siblings ...)
  2009-11-21 12:11 ` toon at moene dot org
@ 2009-11-21 12:19 ` rguenther at suse dot de
  2009-11-21 13:58 ` rguenth at gcc dot gnu dot org
                   ` (35 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenther at suse dot de @ 2009-11-21 12:19 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #16 from rguenther at suse dot de  2009-11-21 12:19 -------
Subject: Re:  [4.4/4.5 Regression] Vectorizer
 cannot deal with PAREN_EXPR gracefully, 50% performance regression

On Sat, 21 Nov 2009, toon at moene dot org wrote:

> ------- Comment #15 from toon at moene dot org  2009-11-21 12:11 -------
> > I don't see that the standard suggests the specific code the Frontend
> > generates.  In fact it should be valid to increment the DO variable
> > by m3 and express the exit test in terms of the DO variable as well.
> 
> The Standard doesn't prescribe the code the Frontend generates - however, to be
> sure one follows the Standard, it's most easy to simply implement the steps
> given.
> 
> To illustrate this with a simple example:
> 
> DO I = M1, M2, M3
>    B(I) = A(I)
> ENDDO
> 
> would be most easily, and atraightforwardly, implemented as follows:
> 
>      IF (M3 > 0 .AND. M1 < M2) GOTO 200  ! Loop executed zero times
>      IF (M3 < 0 .AND. M1 > M2) GOTO 200  ! Ditto
>      ITEMP = (M2 - M1 + M3) / M3         ! Temporary loop count
>      I     = M1
>  100 CONTINUE
>      B(I)  = A(I)
>      ITEMP = ITEMP - 1                   ! Adjust internal loop counter
>      I     = I + M3                      ! Adjust DO loop variable
>      IF (ITEMP > 0) GOTO 100
>  200 CONTINUE
> 
> That there are two induction variables in this loop is inconsequential - one of
> them should be eliminated by induction variable elimination (at least, that was
> the case with g77 and the RTL loop optimization pass).

Sure, but the frontend generates

  if (M3 > 0)
     ITEMP = (M2 - M1) / M3
  else
     ITEMP = (M1 - M2) / -M3
  I = M1
100 CONTINUE
  B(I) = A(I)
  I = I + M3
  if (ITEMP == 0) GOTO 200
  ITEMP = ITEMP - 1
  GOTO 100
200 CONTINUE

The conditional setting of ITEMP is what confuses GCC.  Also I don't
see the test for zero-time executing loops (but maybe I omitted it
from my pasting in comment #12).

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (15 preceding siblings ...)
  2009-11-21 12:19 ` rguenther at suse dot de
@ 2009-11-21 13:58 ` rguenth at gcc dot gnu dot org
  2009-11-23  9:02 ` irar at il dot ibm dot com
                   ` (34 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-11-21 13:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #17 from rguenth at gcc dot gnu dot org  2009-11-21 13:58 -------
I have filed PR42131 for the DO loop translation issue.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (16 preceding siblings ...)
  2009-11-21 13:58 ` rguenth at gcc dot gnu dot org
@ 2009-11-23  9:02 ` irar at il dot ibm dot com
  2009-11-27 11:23 ` rguenth at gcc dot gnu dot org
                   ` (33 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: irar at il dot ibm dot com @ 2009-11-23  9:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #18 from irar at il dot ibm dot com  2009-11-23 09:02 -------
I tried to vectorize eval.f90 with 4.3 and mainline on x86_64-suse-linux. In
both cases no loop gets vectorized in subroutine eval. The k loop is not
vectorizable because the step of x is unknown (function argument), and scalar
evolution analysis fails to analyze it. The j loop is not vectorized first of
all because of the k loop unknown loop bound (this is on our todo list).

Ira


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (17 preceding siblings ...)
  2009-11-23  9:02 ` irar at il dot ibm dot com
@ 2009-11-27 11:23 ` rguenth at gcc dot gnu dot org
  2009-11-30  8:53 ` irar at il dot ibm dot com
                   ` (32 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-11-27 11:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #19 from rguenth at gcc dot gnu dot org  2009-11-27 11:23 -------
I guess this PR should be split further, a bug about the PAREN_EXPR wrt
vectorization and a bug about the yet unanalyzed performance regression.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|enhancement                 |normal
           Priority|P3                          |P2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (18 preceding siblings ...)
  2009-11-27 11:23 ` rguenth at gcc dot gnu dot org
@ 2009-11-30  8:53 ` irar at il dot ibm dot com
  2009-11-30  8:54 ` irar at il dot ibm dot com
                   ` (31 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: irar at il dot ibm dot com @ 2009-11-30  8:53 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #20 from irar at il dot ibm dot com  2009-11-30 08:52 -------
Actually, PAREN_EXPRs are vectorizable (the support was added by you, Richard,
in your original PAREN_EXPR patch
http://gcc.gnu.org/viewcvs?limit_changes=0&view=revision&revision=132515 )).

The problem here is that vectorizable_assignment does not support multiple
types. The attached patch adds this support, but I don't know if the patch is
suitable for the current stage...

Ira


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (19 preceding siblings ...)
  2009-11-30  8:53 ` irar at il dot ibm dot com
@ 2009-11-30  8:54 ` irar at il dot ibm dot com
  2009-11-30 10:13 ` rguenther at suse dot de
                   ` (30 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: irar at il dot ibm dot com @ 2009-11-30  8:54 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #21 from irar at il dot ibm dot com  2009-11-30 08:54 -------
Created an attachment (id=19183)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=19183&action=view)
Multiple types support patch


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (20 preceding siblings ...)
  2009-11-30  8:54 ` irar at il dot ibm dot com
@ 2009-11-30 10:13 ` rguenther at suse dot de
  2009-11-30 12:21 ` irar at il dot ibm dot com
                   ` (29 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenther at suse dot de @ 2009-11-30 10:13 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #22 from rguenther at suse dot de  2009-11-30 10:13 -------
Subject: Re:  [4.4/4.5 Regression] Vectorizer
 cannot deal with PAREN_EXPR gracefully, 50% performance regression

On Mon, 30 Nov 2009, irar at il dot ibm dot com wrote:

> ------- Comment #20 from irar at il dot ibm dot com  2009-11-30 08:52 -------
> Actually, PAREN_EXPRs are vectorizable (the support was added by you, Richard,
> in your original PAREN_EXPR patch
> http://gcc.gnu.org/viewcvs?limit_changes=0&view=revision&revision=132515 )).

Oh, indeed ;)

> The problem here is that vectorizable_assignment does not support multiple
> types. The attached patch adds this support, but I don't know if the patch is
> suitable for the current stage...

Probably not (though it looks small).  If you feel confident about it
you may well apply it still though.

Thanks,
Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (21 preceding siblings ...)
  2009-11-30 10:13 ` rguenther at suse dot de
@ 2009-11-30 12:21 ` irar at il dot ibm dot com
  2009-12-04 14:25 ` [Bug tree-optimization/42108] [4.4/4.5 Regression] " dominiq at lps dot ens dot fr
                   ` (28 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: irar at il dot ibm dot com @ 2009-11-30 12:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #23 from irar at il dot ibm dot com  2009-11-30 12:20 -------
Applied:
http://gcc.gnu.org/viewcvs?limit_changes=0&view=revision&revision=154794

Thanks,
Ira


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (22 preceding siblings ...)
  2009-11-30 12:21 ` irar at il dot ibm dot com
@ 2009-12-04 14:25 ` dominiq at lps dot ens dot fr
  2009-12-13 23:48 ` matz at gcc dot gnu dot org
                   ` (27 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: dominiq at lps dot ens dot fr @ 2009-12-04 14:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #24 from dominiq at lps dot ens dot fr  2009-12-04 14:25 -------
AFAICT fixing pr42131 does not help.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (23 preceding siblings ...)
  2009-12-04 14:25 ` [Bug tree-optimization/42108] [4.4/4.5 Regression] " dominiq at lps dot ens dot fr
@ 2009-12-13 23:48 ` matz at gcc dot gnu dot org
  2009-12-14  4:55 ` matz at gcc dot gnu dot org
                   ` (26 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: matz at gcc dot gnu dot org @ 2009-12-13 23:48 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #25 from matz at gcc dot gnu dot org  2009-12-13 23:48 -------
The reason that the testcase still is slow (and that the inner loop isn't
unrolled or vectorized) is still the calculation of countm1.  The division
therein stays in the second inner loop, whereas with GCC 4.3 it can be moved
into the outer loop.  In this specific testcase it's a pass ordering problem:
we start with (at .vrp1) (only parts shown):

<bb 2>:
  D.1564_45 = *n_9(D);
  if (D.1564_45 > 1)
   ...
<bb 6>:
  D.1572_60 = *n_9(D);
  if (D.1572_60 > 0)
    goto <bb 7>;
  else
    goto <bb 8>;

Here _45 and _60 are equivalent, but VRP doesn't know this, hence it doesn't
detect the goto <bb 8> as dead.  The equivalence is only detected after PRE 
(not by PRE, though :-/ ), which means VRP2 does detect the jump as  dead,
and hence leaves only the step>0 case in the code.  But this is too late for
the late PRE (running before VRP2 and the loop optimizers) in order to move
the dependend division to the outer loop.

As the division isn't moved as loop invariant to the outer loop this also 
means that the loop count determination doesn't work, hence no unrolling.

But the slowness itself is due to the div instruction in the second loop,
instead of in the outer loop as with 4.3.

-- 

matz at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matz at gcc dot gnu dot org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (24 preceding siblings ...)
  2009-12-13 23:48 ` matz at gcc dot gnu dot org
@ 2009-12-14  4:55 ` matz at gcc dot gnu dot org
  2009-12-14  5:26 ` matz at gcc dot gnu dot org
                   ` (25 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: matz at gcc dot gnu dot org @ 2009-12-14  4:55 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #26 from matz at gcc dot gnu dot org  2009-12-14 04:55 -------
And if I fix this problem (so that only one reference to *n_9) remains
I hit the problem that the fortran frontend emits the computation of countm1
after the loop bound test.  No pass is moving code in front of that test as
this is potentially a regression (more evaluations in out-of-bound case).

And if I fix _that_ I hit the problem of the fix for PR38819.  PRE won't move
the division at all, because it could trap :-/  If I disable this I get back
the 4.3 performance.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (25 preceding siblings ...)
  2009-12-14  4:55 ` matz at gcc dot gnu dot org
@ 2009-12-14  5:26 ` matz at gcc dot gnu dot org
  2009-12-14 10:51 ` dominiq at lps dot ens dot fr
                   ` (24 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: matz at gcc dot gnu dot org @ 2009-12-14  5:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #27 from matz at gcc dot gnu dot org  2009-12-14 05:25 -------
Created an attachment (id=19287)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=19287&action=view)
three hacks

My current collection of patches and hacks for this problem.  Obviously the
"if (0)" in tree-ssa-pre.c will break pr38819 again; apart from that untested,
hence probably miscompiles everything except this testcase here :-)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (26 preceding siblings ...)
  2009-12-14  5:26 ` matz at gcc dot gnu dot org
@ 2009-12-14 10:51 ` dominiq at lps dot ens dot fr
  2009-12-14 11:21 ` dominiq at lps dot ens dot fr
                   ` (23 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: dominiq at lps dot ens dot fr @ 2009-12-14 10:51 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #28 from dominiq at lps dot ens dot fr  2009-12-14 10:51 -------
(In reply to comment #27)
> My current collection of patches and hacks for this problem.  Obviously the
> "if (0)" in tree-ssa-pre.c will break pr38819 again; apart from that untested,
> hence probably miscompiles everything except this testcase here :-)

I have not tested the patch (yet), but it seems that replacing "if(0)" with
something such as "if(!flag_trapping_math)" could make everybody happy: if you
don't want to break pr38819, don't use -fno-trapping-math; if you want speed,
use it or use -ffast-math. Would it be acceptable?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (27 preceding siblings ...)
  2009-12-14 10:51 ` dominiq at lps dot ens dot fr
@ 2009-12-14 11:21 ` dominiq at lps dot ens dot fr
  2009-12-14 11:23 ` rguenth at gcc dot gnu dot org
                   ` (22 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: dominiq at lps dot ens dot fr @ 2009-12-14 11:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #29 from dominiq at lps dot ens dot fr  2009-12-14 11:21 -------
On x86_64-apple-darwin10, I don't see any speedup with the patch in comment #27
(not a clean bootstrap, but just an incremental build).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (28 preceding siblings ...)
  2009-12-14 11:21 ` dominiq at lps dot ens dot fr
@ 2009-12-14 11:23 ` rguenth at gcc dot gnu dot org
  2009-12-14 11:50 ` rguenth at gcc dot gnu dot org
                   ` (21 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-14 11:23 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #30 from rguenth at gcc dot gnu dot org  2009-12-14 11:23 -------
I fail to see why FRE does not remove the redundant load of *n_9(D).  Oh, it
is because we first value-number D.1537_58 = *n_9(D); and only after it
we value-number D.1529_45 = *n_9(D);

This is because while we visit the SCC members in RPO order we do not impose
any order on visiting SCCs and those two stmts are not dependent on each
other (we neither account for virtual operands nor control dependences there).  
Old problem.

The fix for this is to either wait for the VN rewrite or to collect all SCCs,
sort them in RPO order and only then process them.  Note that it still can
be difficult to impose a total ordering on SCCs (but at least this case
should be easy).  Another possibility is to artificially grow SCCs and
their dependencies by honoring dominating virtual operand uses, not only
defs (ugh).

For non-memory the missing ordering is not a problem as we do not rely on
walking stmts during expression lookup (and that walking only visits
dominating expressions).  Something to keep in mind for the VN
re-implementation as well.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (29 preceding siblings ...)
  2009-12-14 11:23 ` rguenth at gcc dot gnu dot org
@ 2009-12-14 11:50 ` rguenth at gcc dot gnu dot org
  2009-12-14 12:27 ` rguenther at suse dot de
                   ` (20 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-14 11:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #31 from rguenth at gcc dot gnu dot org  2009-12-14 11:49 -------
Created an attachment (id=19288)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=19288&action=view)
another hack

Sorts the SCCs after collecting them all.  Breaks most of the PRE/FRE testcases
because it sorts all SCCs, not only those that have no direct dependencies.
Not easy to fix though.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (30 preceding siblings ...)
  2009-12-14 11:50 ` rguenth at gcc dot gnu dot org
@ 2009-12-14 12:27 ` rguenther at suse dot de
  2009-12-14 12:30 ` rguenth at gcc dot gnu dot org
                   ` (19 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenther at suse dot de @ 2009-12-14 12:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #32 from rguenther at suse dot de  2009-12-14 12:27 -------
Subject: Re:  [4.4/4.5 Regression] 50% performance
 regression

On Mon, 14 Dec 2009, matz at gcc dot gnu dot org wrote:

> ------- Comment #26 from matz at gcc dot gnu dot org  2009-12-14 04:55 -------
> And if I fix this problem (so that only one reference to *n_9) remains
> I hit the problem that the fortran frontend emits the computation of countm1
> after the loop bound test.  No pass is moving code in front of that test as
> this is potentially a regression (more evaluations in out-of-bound case).
> 
> And if I fix _that_ I hit the problem of the fix for PR38819.  PRE won't move
> the division at all, because it could trap :-/  If I disable this I get back
> the 4.3 performance.

Well.  VRP should mark divisions as non-trapping if possible.  I see
(after fixing the FRE issue):

  # iftmp.12_8 = PHI <-1(30), 1(12)>
  D.1588_67 = iftmp.12_8 * D.1529_119;
  D.1589_68 = (character(kind=4)) D.1588_67;
  countm1.6_69 = D.1583_64 / D.1589_68;

with

D.1529_119: [2, +INF]
iftmp.12_8: [1, 1]
D.1588_67: [2, +INF]
D.1589_68: [2, 2147483647]
countm1.6_69: [0, 2147483647]

so as D.1589_68 is never -1 or zero the division doesn't trap.
Now it's easy to mark the stmt in VRP this way but non-trivial
to keep track of it in the SCCVN IL.

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (31 preceding siblings ...)
  2009-12-14 12:27 ` rguenther at suse dot de
@ 2009-12-14 12:30 ` rguenth at gcc dot gnu dot org
  2009-12-14 12:58 ` rguenth at gcc dot gnu dot org
                   ` (18 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-14 12:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #33 from rguenth at gcc dot gnu dot org  2009-12-14 12:30 -------
Created an attachment (id=19289)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=19289&action=view)
VRP hack

Hack marking divisions non-trapping during VRP (re-using some stmt bit, not
updating relevant places to make PRE recognize the non-trappingness).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (32 preceding siblings ...)
  2009-12-14 12:30 ` rguenth at gcc dot gnu dot org
@ 2009-12-14 12:58 ` rguenth at gcc dot gnu dot org
  2009-12-14 16:58 ` matz at gcc dot gnu dot org
                   ` (17 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-14 12:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #34 from rguenth at gcc dot gnu dot org  2009-12-14 12:57 -------
"Another possibility is to artificially grow SCCs and
their dependencies by honoring dominating virtual operand uses, not only
defs (ugh)."

what I mean with this is that when finding SCCs we process all uses of a
stmt.  In case you have a memory load

  # VUSE <.MEM_3>
  x_2 = *p_1(D);

then you have to consider all dominating DEFs that use .MEM_3 uses of
that stmt, not only the single def of .MEM_3 (well, you only have to
consider aliasing uses of course).  This isn't easyly retrofitted into
the non-recursive DFS walk though.

Thus, for

  # .MEM_3 = VDEF <.MEM_5(D)>
  *p_1(D) = 0;
  # VUSE <.MEM_3>
  y_4 = *p_1(D);
  # VUSE <.MEM_3>
  x_2 = *p_1(D);

when visiting x_2 = *p_1(D); the uses are p_1, y_4 and .MEM_3.  y_4
is new because its the DEF in the dominating stmt that uses .MEM_3.

This would possibly increase SCC sizes and compile-time a lot.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (33 preceding siblings ...)
  2009-12-14 12:58 ` rguenth at gcc dot gnu dot org
@ 2009-12-14 16:58 ` matz at gcc dot gnu dot org
  2009-12-15  7:10 ` tkoenig at gcc dot gnu dot org
                   ` (16 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: matz at gcc dot gnu dot org @ 2009-12-14 16:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #35 from matz at gcc dot gnu dot org  2009-12-14 16:58 -------
Exactly my thinking (growing SCCs -> slow, sorting SCCs -> difficult).
What I thought about the trapping problem is that in this situation we could
ignore the trap test.  We start with this situation:

bb1:
  goto bb2
bb2:
  PHI<from bb1, from bbX>     ; with bbX being dominated by bb1
  a = b / c                   ; with b and c loop invariant

Now it's clear that inserting a computation b/c in bb1 does not ever introduce
additional traps, as there's no intervening statement that could stop
execution without us knowing (in PR38819 it's a call that does exit(0)).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (34 preceding siblings ...)
  2009-12-14 16:58 ` matz at gcc dot gnu dot org
@ 2009-12-15  7:10 ` tkoenig at gcc dot gnu dot org
  2009-12-15 11:08 ` rguenth at gcc dot gnu dot org
                   ` (15 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: tkoenig at gcc dot gnu dot org @ 2009-12-15  7:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #36 from tkoenig at gcc dot gnu dot org  2009-12-15 07:09 -------
If it is any help, code which traps for a do loop is illegal Fortran,
so the compiler may do anything in this case anyway.

Is there a function like
__builtin_i_dont_care_if_this_traps_or_not_if_it_traps_its_the_users_fault ?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (35 preceding siblings ...)
  2009-12-15  7:10 ` tkoenig at gcc dot gnu dot org
@ 2009-12-15 11:08 ` rguenth at gcc dot gnu dot org
  2009-12-18 15:43 ` rguenth at gcc dot gnu dot org
                   ` (14 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-15 11:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #37 from rguenth at gcc dot gnu dot org  2009-12-15 11:08 -------
No, there isn't.  I'd simply allow TREE_THIS_NOTRAP on all expression codes
that in principle could.  Now of course the middle-end would still need to
make use of this (like transition it to a stmt flag on a tuple).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (36 preceding siblings ...)
  2009-12-15 11:08 ` rguenth at gcc dot gnu dot org
@ 2009-12-18 15:43 ` rguenth at gcc dot gnu dot org
  2009-12-18 21:04 ` dominiq at lps dot ens dot fr
                   ` (13 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-18 15:43 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #38 from rguenth at gcc dot gnu dot org  2009-12-18 15:43 -------
Created an attachment (id=19346)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=19346&action=view)
patch to fix SCCVN issue

This patch fixes the SCCVN issue, I'm giving it more testing.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (37 preceding siblings ...)
  2009-12-18 15:43 ` rguenth at gcc dot gnu dot org
@ 2009-12-18 21:04 ` dominiq at lps dot ens dot fr
  2009-12-18 21:40 ` matz at gcc dot gnu dot org
                   ` (12 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: dominiq at lps dot ens dot fr @ 2009-12-18 21:04 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #39 from dominiq at lps dot ens dot fr  2009-12-18 21:04 -------
The patch in comment #38 does not fix the speed issue: the code with the inner
loop is still 4 times slower than the code with the loop manually unrolled.

Note that the included test regtests successfully. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (38 preceding siblings ...)
  2009-12-18 21:04 ` dominiq at lps dot ens dot fr
@ 2009-12-18 21:40 ` matz at gcc dot gnu dot org
  2009-12-18 23:44 ` rguenth at gcc dot gnu dot org
                   ` (11 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: matz at gcc dot gnu dot org @ 2009-12-18 21:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #40 from matz at gcc dot gnu dot org  2009-12-18 21:40 -------
That's expected.  There are three problems and the patch in comment #38 hacks
around only one of them.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (39 preceding siblings ...)
  2009-12-18 21:40 ` matz at gcc dot gnu dot org
@ 2009-12-18 23:44 ` rguenth at gcc dot gnu dot org
  2009-12-19 11:25 ` rguenth at gcc dot gnu dot org
                   ` (10 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-18 23:44 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #41 from rguenth at gcc dot gnu dot org  2009-12-18 23:44 -------
Indeed.  The PRE issue could be fixed by fixing PR38819 not in the way it is
done now but "properly" detect the invalid situations during ANTIC computation
and simply never mark trapping expressions so.  At the current point its
hard to tell if the insertion is valid because the original expression is
always executed if the insertion point is - simply because we no longer
know where the original expression was.

Thus, the "proper" place (err, I think at least) is during translating
ANTIC_OUT through the basic-block to ANTIC_IN (thus, in clean()).  It
might be a bit expensive, though pre-computing if a basic-block possibly
exits the CFG could speed this up significantly.  Another "proper" place
would be to add fake edges to exit for each such point in the CFG
(basically split blocks at each possibly noreturn call and add an edge
to exit).  But that might be even more expensive.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (40 preceding siblings ...)
  2009-12-18 23:44 ` rguenth at gcc dot gnu dot org
@ 2009-12-19 11:25 ` rguenth at gcc dot gnu dot org
  2009-12-19 19:29 ` rguenth at gcc dot gnu dot org
                   ` (9 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-19 11:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #42 from rguenth at gcc dot gnu dot org  2009-12-19 11:25 -------
Subject: Bug 42108

Author: rguenth
Date: Sat Dec 19 11:24:49 2009
New Revision: 155360

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155360
Log:
2009-12-19  Richard Guenther  <rguenther@suse.de>

        PR tree-optimization/42108
        * tree-ssa-sccvn.c (last_vuse_ptr): New variable.
        (vn_reference_lookup_2): Update last seen VUSE.
        (vn_reference_lookup_3): Avoid updating last seen VUSE after
        translating.
        (visit_reference_op_load): Use last seen VUSE from the first
        lookup when entering into the table.

        * gfortran.dg/pr42108.f90: New testcase.

Added:
    trunk/gcc/testsuite/gfortran.dg/pr42108.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-sccvn.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (41 preceding siblings ...)
  2009-12-19 11:25 ` rguenth at gcc dot gnu dot org
@ 2009-12-19 19:29 ` rguenth at gcc dot gnu dot org
  2009-12-19 19:41 ` rguenth at gcc dot gnu dot org
                   ` (8 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-19 19:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #43 from rguenth at gcc dot gnu dot org  2009-12-19 19:29 -------
Btw, with the patch from comment #33 LIM will now hoist the division
properly and the performance regression would be fixed(?).  The patch
will though likely cause verification issues with -fnon-call-exceptions
for one and infrastructure-wise is still a hack.

Now testing the orginal testcase on i?86 doesn't show a performance improvement
though (but it's marginally faster than with 4.3).  Interestingly with
-mfpmath=sse -march=native trunk is a lot slower than 4.3 (but 4.3 isn't
much faster than trunk w/o that options).  The patch from comment #33
seems to recover that.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (42 preceding siblings ...)
  2009-12-19 19:29 ` rguenth at gcc dot gnu dot org
@ 2009-12-19 19:41 ` rguenth at gcc dot gnu dot org
  2009-12-19 21:10 ` rguenth at gcc dot gnu dot org
                   ` (7 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-19 19:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #44 from rguenth at gcc dot gnu dot org  2009-12-19 19:41 -------
PR42436 now tracks the possible VRP and middle-end improvement.  Only the
PRE fixing possibility would count as a regression fix IMHO.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (43 preceding siblings ...)
  2009-12-19 19:41 ` rguenth at gcc dot gnu dot org
@ 2009-12-19 21:10 ` rguenth at gcc dot gnu dot org
  2010-01-21 13:15 ` jakub at gcc dot gnu dot org
                   ` (6 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-12-19 21:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #45 from rguenth at gcc dot gnu dot org  2009-12-19 21:10 -------
(In reply to comment #41)
> Indeed.  The PRE issue could be fixed by fixing PR38819 not in the way it is
> done now but "properly" detect the invalid situations during ANTIC computation
> and simply never mark trapping expressions so.  At the current point its
> hard to tell if the insertion is valid because the original expression is
> always executed if the insertion point is - simply because we no longer
> know where the original expression was.
> 
> Thus, the "proper" place (err, I think at least) is during translating
> ANTIC_OUT through the basic-block to ANTIC_IN (thus, in clean()).  It
> might be a bit expensive, though pre-computing if a basic-block possibly
> exits the CFG could speed this up significantly.  Another "proper" place
> would be to add fake edges to exit for each such point in the CFG
> (basically split blocks at each possibly noreturn call and add an edge
> to exit).  But that might be even more expensive.

Doing this in a straight-forward way shows that the division isn't
partially redundant:

<bb 4>:
  # j_2 = PHI <1(3), j_101(7)>
  jmini_55 = j_2 - i_1;
  D.1530_57 = *nnd_28(D);
  if (i_1 > D.1530_57)
    goto <bb 7>;
  else
    goto <bb 5>;

<bb 5>:
  D.1576_60 = D.1530_57 - i_1;
  D.1577_64 = (character(kind=4)) D.1576_60;
  D.1583_68 = (character(kind=4)) D.1582_45;
  countm1.6_69 = D.1577_64 / D.1583_68;
...
  if (countm1.6_69 == 0)
    goto <bb 7>;
  else
    goto <bb 6>;

<bb 6>:
...
  if (countm1.6_81 == 0)
    goto <bb 7>;
  else
    goto <bb 6>;

<bb 7>:
...
  if (j_2 == D.1560_49)
    goto <bb 8>;
  else
    goto <bb 4>;

<bb 8>:
  i_103 = i_1 + 1;
  if (i_1 == D.1582_45)
    goto <bb 9>;
  else
    goto <bb 3>;

The division may be not executed if i > nnd is always true which it is
if nnd is <= 2.  Thus fixing PRE is not the solution here (LIM will
still move the expensive division if it is proven to not trap by VRP though).
That is, computing coumtm before the loop entry check as suggested by
Michael.

But then going with the VRP solution sounds like a better idea to me
(to fix this particular regression, that is).

PR42438 tracks the PRE issue now which IMHO is unrelated to this bug.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (44 preceding siblings ...)
  2009-12-19 21:10 ` rguenth at gcc dot gnu dot org
@ 2010-01-21 13:15 ` jakub at gcc dot gnu dot org
  2010-04-05 12:53 ` steven at gcc dot gnu dot org
                   ` (5 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: jakub at gcc dot gnu dot org @ 2010-01-21 13:15 UTC (permalink / raw)
  To: gcc-bugs



-- 

jakub at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.4.3                       |4.4.4


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (45 preceding siblings ...)
  2010-01-21 13:15 ` jakub at gcc dot gnu dot org
@ 2010-04-05 12:53 ` steven at gcc dot gnu dot org
  2010-04-05 12:54 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: steven at gcc dot gnu dot org @ 2010-04-05 12:53 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #46 from steven at gcc dot gnu dot org  2010-04-05 12:52 -------
What happened with the patch of comment #33?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (46 preceding siblings ...)
  2010-04-05 12:53 ` steven at gcc dot gnu dot org
@ 2010-04-05 12:54 ` rguenther at suse dot de
  2010-04-05 12:57 ` rguenther at suse dot de
                   ` (3 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenther at suse dot de @ 2010-04-05 12:54 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #47 from rguenther at suse dot de  2010-04-05 12:54 -------
Subject: Re:  [4.4/4.5 Regression] 50% performance
 regression

On Mon, 5 Apr 2010, steven at gcc dot gnu dot org wrote:

> ------- Comment #46 from steven at gcc dot gnu dot org  2010-04-05 12:52 -------
> What happened with the patch of comment #33?

scheduled for stage1 (maybe, after much cleanup)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (47 preceding siblings ...)
  2010-04-05 12:54 ` rguenther at suse dot de
@ 2010-04-05 12:57 ` rguenther at suse dot de
  2010-04-05 13:02 ` steven at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  51 siblings, 0 replies; 53+ messages in thread
From: rguenther at suse dot de @ 2010-04-05 12:57 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #48 from rguenther at suse dot de  2010-04-05 12:56 -------
Subject: Re:  [4.4/4.5 Regression] 50% performance
 regression

On Mon, 5 Apr 2010, rguenther at suse dot de wrote:

> ------- Comment #47 from rguenther at suse dot de  2010-04-05 12:54 -------
> Subject: Re:  [4.4/4.5 Regression] 50% performance
>  regression
> 
> On Mon, 5 Apr 2010, steven at gcc dot gnu dot org wrote:
> 
> > ------- Comment #46 from steven at gcc dot gnu dot org  2010-04-05 12:52 -------
> > What happened with the patch of comment #33?
> 
> scheduled for stage1 (maybe, after much cleanup)

No, it even got applied.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (48 preceding siblings ...)
  2010-04-05 12:57 ` rguenther at suse dot de
@ 2010-04-05 13:02 ` steven at gcc dot gnu dot org
  2010-04-05 14:23 ` rguenth at gcc dot gnu dot org
  2010-04-30  8:55 ` [Bug tree-optimization/42108] [4.4/4.5/4.6 " jakub at gcc dot gnu dot org
  51 siblings, 0 replies; 53+ messages in thread
From: steven at gcc dot gnu dot org @ 2010-04-05 13:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #49 from steven at gcc dot gnu dot org  2010-04-05 13:01 -------
At least the tree-vrp.c bit did not get applied (as of trunk r157950)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (49 preceding siblings ...)
  2010-04-05 13:02 ` steven at gcc dot gnu dot org
@ 2010-04-05 14:23 ` rguenth at gcc dot gnu dot org
  2010-04-30  8:55 ` [Bug tree-optimization/42108] [4.4/4.5/4.6 " jakub at gcc dot gnu dot org
  51 siblings, 0 replies; 53+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-04-05 14:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #50 from rguenth at gcc dot gnu dot org  2010-04-05 14:23 -------
(In reply to comment #49)
> At least the tree-vrp.c bit did not get applied (as of trunk r157950)
> 

Yup, my fault.  I looked at the wrong patch.  Thus, the first comment
applies - maybe stage1 with lots of cleanups.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug tree-optimization/42108] [4.4/4.5/4.6 Regression] 50% performance regression
  2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
                   ` (50 preceding siblings ...)
  2010-04-05 14:23 ` rguenth at gcc dot gnu dot org
@ 2010-04-30  8:55 ` jakub at gcc dot gnu dot org
  51 siblings, 0 replies; 53+ messages in thread
From: jakub at gcc dot gnu dot org @ 2010-04-30  8:55 UTC (permalink / raw)
  To: gcc-bugs



-- 

jakub at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.4.4                       |4.4.5


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2010-04-30  8:55 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-19 16:01 [Bug fortran/42108] New: Performance drop from 4.3 to 4.4/4.5 sfilippone at uniroma2 dot it
2009-11-19 16:01 ` [Bug fortran/42108] " sfilippone at uniroma2 dot it
2009-11-19 16:50 ` [Bug tree-optimization/42108] [4.4/4.5 Regression] Vectorizer cannot deal with PAREN_EXPR gracefully, 50% performance regression rguenth at gcc dot gnu dot org
2009-11-19 17:17 ` sfilippone at uniroma2 dot it
2009-11-19 17:30 ` rguenther at suse dot de
2009-11-19 19:42 ` sfilippone at uniroma2 dot it
2009-11-19 19:53 ` toon at moene dot org
2009-11-19 22:33 ` anlauf at gmx dot de
2009-11-20  8:32 ` sfilippone at uniroma2 dot it
2009-11-20 13:45 ` dominiq at lps dot ens dot fr
2009-11-20 14:04 ` sfilippone at uniroma2 dot it
2009-11-20 14:12 ` sfilippone at uniroma2 dot it
2009-11-20 14:14 ` rguenth at gcc dot gnu dot org
2009-11-20 19:45 ` toon at moene dot org
2009-11-20 23:48 ` rguenth at gcc dot gnu dot org
2009-11-21 12:11 ` toon at moene dot org
2009-11-21 12:19 ` rguenther at suse dot de
2009-11-21 13:58 ` rguenth at gcc dot gnu dot org
2009-11-23  9:02 ` irar at il dot ibm dot com
2009-11-27 11:23 ` rguenth at gcc dot gnu dot org
2009-11-30  8:53 ` irar at il dot ibm dot com
2009-11-30  8:54 ` irar at il dot ibm dot com
2009-11-30 10:13 ` rguenther at suse dot de
2009-11-30 12:21 ` irar at il dot ibm dot com
2009-12-04 14:25 ` [Bug tree-optimization/42108] [4.4/4.5 Regression] " dominiq at lps dot ens dot fr
2009-12-13 23:48 ` matz at gcc dot gnu dot org
2009-12-14  4:55 ` matz at gcc dot gnu dot org
2009-12-14  5:26 ` matz at gcc dot gnu dot org
2009-12-14 10:51 ` dominiq at lps dot ens dot fr
2009-12-14 11:21 ` dominiq at lps dot ens dot fr
2009-12-14 11:23 ` rguenth at gcc dot gnu dot org
2009-12-14 11:50 ` rguenth at gcc dot gnu dot org
2009-12-14 12:27 ` rguenther at suse dot de
2009-12-14 12:30 ` rguenth at gcc dot gnu dot org
2009-12-14 12:58 ` rguenth at gcc dot gnu dot org
2009-12-14 16:58 ` matz at gcc dot gnu dot org
2009-12-15  7:10 ` tkoenig at gcc dot gnu dot org
2009-12-15 11:08 ` rguenth at gcc dot gnu dot org
2009-12-18 15:43 ` rguenth at gcc dot gnu dot org
2009-12-18 21:04 ` dominiq at lps dot ens dot fr
2009-12-18 21:40 ` matz at gcc dot gnu dot org
2009-12-18 23:44 ` rguenth at gcc dot gnu dot org
2009-12-19 11:25 ` rguenth at gcc dot gnu dot org
2009-12-19 19:29 ` rguenth at gcc dot gnu dot org
2009-12-19 19:41 ` rguenth at gcc dot gnu dot org
2009-12-19 21:10 ` rguenth at gcc dot gnu dot org
2010-01-21 13:15 ` jakub at gcc dot gnu dot org
2010-04-05 12:53 ` steven at gcc dot gnu dot org
2010-04-05 12:54 ` rguenther at suse dot de
2010-04-05 12:57 ` rguenther at suse dot de
2010-04-05 13:02 ` steven at gcc dot gnu dot org
2010-04-05 14:23 ` rguenth at gcc dot gnu dot org
2010-04-30  8:55 ` [Bug tree-optimization/42108] [4.4/4.5/4.6 " jakub at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).