public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/23322] New: performance regression, possibly related to caching
@ 2005-08-11  0:57 danalis at cis dot udel dot edu
  2005-08-11  0:58 ` [Bug target/23322] " danalis at cis dot udel dot edu
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: danalis at cis dot udel dot edu @ 2005-08-11  0:57 UTC (permalink / raw)
  To: gcc-bugs

We ran the bench++ suite, mentioned as well in
http://gcc.gnu.org/ml/gcc/2005-08/msg00197.html ,
using -O2 and we noticed one more interesting case.

Namely, S000005m runs slower (x4) when compiled with
g++-4.0.1 than when compiled with either g++-2.95.3,
or g++-4.1.0-20050723.
This regression does not occur when -O3 is used.
Apparently, it is related to the existance of a
*dead* cerr, as changing the cerr to printf makes
the regression go away, and the assembly shorter.

Is this a caching effect?

-- 
           Summary: performance regression, possibly related to caching
           Product: gcc
           Version: 4.0.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: danalis at cis dot udel dot edu
                CC: gcc-bugs at gcc dot gnu dot org
GCC target triplet: i686-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/23322] performance regression, possibly related to caching
  2005-08-11  0:57 [Bug target/23322] New: performance regression, possibly related to caching danalis at cis dot udel dot edu
@ 2005-08-11  0:58 ` danalis at cis dot udel dot edu
  2005-08-11 10:30 ` [Bug rtl-optimization/23322] [4.1 regression] " rguenth at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: danalis at cis dot udel dot edu @ 2005-08-11  0:58 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From danalis at cis dot udel dot edu  2005-08-11 00:58 -------
Created an attachment (id=9466)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9466&action=view)
Source code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/23322] [4.1 regression] performance regression, possibly related to caching
  2005-08-11  0:57 [Bug target/23322] New: performance regression, possibly related to caching danalis at cis dot udel dot edu
  2005-08-11  0:58 ` [Bug target/23322] " danalis at cis dot udel dot edu
@ 2005-08-11 10:30 ` rguenth at gcc dot gnu dot org
  2005-08-11 12:01 ` [Bug target/23322] " pinskia at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2005-08-11 10:30 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From rguenth at gcc dot gnu dot org  2005-08-11 10:30 -------
I cannot confirm your observations, instead, with -O2 timings are about the
same for 4.0.2 (20050728) and 4.1.0 (20050803), while with -O3 the 4.0.2
compiler seems to be about 2x faster even if the tree optimizers do a better
job in the 4.1 case.

One difference is:

(4.1)
.L37:
        movl    $0, (%eax)
        movl    $1074266112, 4(%eax)
        addl    $8, %eax
        cmpl    %eax, %edx
        jne     .L37

vs.

(4.0)
        fldl    init_value
.L16:
        fstl    (%eax)
        addl    $8, %eax
        cmpl    %eax, %ecx
        jne     .L16

(known bug, I think - andrew will know the PR)

The other one is

(4.0)
.L8:
        fldz
        xorl    %eax, %eax
        fstl    -16(%ebp)
        .p2align 4,,15
.L11:
        faddl   (%ebx,%eax,8)
        incl    %eax
        cmpl    %edx, %eax
        fstl    -16(%ebp)
        jne     .L11
        fstp    %st(0)
        jmp     .L10

vs.

(4.1)
        fldz
        xorl    %eax, %eax
        fstpl   -16(%ebp)
        jmp     .L31
        .p2align 4,,7
.L43:
        fstp    %st(0)
.L31:
        fldl    -16(%ebp)
        faddl   (%ebx,%eax,8)
        incl    %eax
        cmpl    %edx, %eax
        fstl    -16(%ebp)
        jne     .L43

which certainly explains the big difference.  This is just

<L37>:;
  result = 0.0;
  n = 0;

<L8>:;
  result = MEM[base: first, index: (double *) n, step: 8B] + result;
  n = n + 1;
  if (n != D.34008) goto <L8>; else goto <L34>;

btw. or

static double test0(double* first, double* last) {
    double result = 0;
    for (int n = 0; n < last - first; ++n) result += first[n];
    return result;
}

Note that compiling this function stand-alone both produce identical
(good) assembly:

        fldz
        xorl    %eax, %eax
        .p2align 4,,15
.L5:
        faddl   (%ecx,%eax,8)
        incl    %eax
        cmpl    %edx, %eax
        jne     .L5

so it looks to me that RTL optimization goes berzerk and messes things up
here.  The cerr effect may have to to sth with aliasing (though again at
the RTL level, I think).  IVOPTs dumps show

(4.0)
  # result_67 = PHI <result_32(19), 0.0(17)>;
  # n_4 = PHI <n_66(19), 0(17)>;
<L6>:;
  D.32905_127 = (unsigned int) n_4;
  D.32906_126 = (double *) D.32905_127;
  D.32907_125 = D.32906_126 * 8B;
  D.32908_124 = first_11 + D.32907_125;
  D.32848_69 = D.32908_124;
  #   VUSE <init_value_5>;
  #   VUSE <data_12>;
  #   VUSE <Data_129>;
  #   VUSE <cerr_3>;
  D.32849_64 = *D.32848_69;
  result_32 = D.32849_64 + result_67;
  n_66 = n_4 + 1;
  if (n_66 != D.32844_140) goto <L34>; else goto <L35>;

(4.1)
  # n_105 = PHI <n_66(11), 0(9)>;
  # result_103 = PHI <result_65(11), 0.0(9)>;
<L8>:;
  D.34086_22 = (double *) n_105;
  #   VUSE <cerr_13>;
  #   VUSE <data_36>;
  #   VUSE <Data_27>;
  D.34003_64 = MEM[base: first_11, index: D.34086_22, step: 8B];
  result_65 = D.34003_64 + result_103;
  n_66 = n_105 + 1;
  if (n_66 != D.34008_102) goto <L33>; else goto <L34>;

which shows there's no real difference in tree-level alias information.
For the separate function we do

  # n_27 = PHI <n_19(3), 0(1)>;
  # result_25 = PHI <result_18(3), 0.0(1)>;
<L0>:;
  D.1814_2 = (double *) n_27;
  #   VUSE <TMT.8_20>;
  D.1750_17 = MEM[base: first_7, index: D.1814_2, step: 8B];
  result_18 = D.1750_17 + result_25;
  n_19 = n_27 + 1;
  if (n_19 != D.1744_24) goto <L9>; else goto <L10>;

though.  I'll make this rtl-optimization until someone tries another architecture.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |rtl-optimization
           Keywords|                            |missed-optimization
            Summary|performance regression,     |[4.1 regression] performance
                   |possibly related to caching |regression, possibly related
                   |                            |to caching


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/23322] [4.1 regression] performance regression, possibly related to caching
  2005-08-11  0:57 [Bug target/23322] New: performance regression, possibly related to caching danalis at cis dot udel dot edu
  2005-08-11  0:58 ` [Bug target/23322] " danalis at cis dot udel dot edu
  2005-08-11 10:30 ` [Bug rtl-optimization/23322] [4.1 regression] " rguenth at gcc dot gnu dot org
@ 2005-08-11 12:01 ` pinskia at gcc dot gnu dot org
  2005-08-11 14:01 ` pinskia at gcc dot gnu dot org
  2005-08-11 14:56 ` roger at eyesopen dot com
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2005-08-11 12:01 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From pinskia at gcc dot gnu dot org  2005-08-11 12:01 -------
This is reg stack going funny so this is a target issue.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |target


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/23322] [4.1 regression] performance regression, possibly related to caching
  2005-08-11  0:57 [Bug target/23322] New: performance regression, possibly related to caching danalis at cis dot udel dot edu
                   ` (2 preceding siblings ...)
  2005-08-11 12:01 ` [Bug target/23322] " pinskia at gcc dot gnu dot org
@ 2005-08-11 14:01 ` pinskia at gcc dot gnu dot org
  2005-08-11 14:56 ` roger at eyesopen dot com
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2005-08-11 14:01 UTC (permalink / raw)
  To: gcc-bugs



-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at eyesopen dot com
   Target Milestone|---                         |4.1.0
            Version|4.0.1                       |4.1.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/23322] [4.1 regression] performance regression, possibly related to caching
  2005-08-11  0:57 [Bug target/23322] New: performance regression, possibly related to caching danalis at cis dot udel dot edu
                   ` (3 preceding siblings ...)
  2005-08-11 14:01 ` pinskia at gcc dot gnu dot org
@ 2005-08-11 14:56 ` roger at eyesopen dot com
  4 siblings, 0 replies; 6+ messages in thread
From: roger at eyesopen dot com @ 2005-08-11 14:56 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From roger at eyesopen dot com  2005-08-11 14:56 -------
I'll take a look, but on first inspection this looks more like a register
allocation issue than a reg-stack problem.  In the first (4.0) case, the
accumulator "result" is assigned a hard register in the loop, whilst in
the second (4.1) it is being placed in memory, at -16(%ebp).  This may also
explain why extracting that loop into a stand-alone function produces
optimal/original code, as the register allocator gets less confused by other
influences in the function.  The extracted code is also even better than 4.0's,
as it avoids writing "result" to memory on each iteration (store sinking).

The second failure does show an interesting reg-stack/reg-alloc interaction
though.  The "hot" accumulator value is live on the backedge and the exit
edge of the loop but not on the incoming edge.  Clearly, the best fix is to
make this value live on the incoming edge, but failing that it is actually
better to prevent it being live on the back and exit edges, and add compensation
code after the loop.  i.e. if the store to result in the loop used fstpl, you
wouldn't need to fstp %st(0) on each loop iteration, but would instead need a
compensating fldl after the loop.

I'm not sure how easy it would be to teach GCC's register allocation to take
these considerations into account, or failing that, whether reg-stack could be
tweaked/hacked to locally fix this up.  But the fundamental problem is that
reg-alloc should assign result to a hard resigster as it clearly knows there
are enough available in that block.

reg-stack.c is just doing what its told, and in this case its being told to
do something stupid.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|                            |1
   Last reconfirmed|0000-00-00 00:00:00         |2005-08-11 14:56:31
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-08-11 14:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-11  0:57 [Bug target/23322] New: performance regression, possibly related to caching danalis at cis dot udel dot edu
2005-08-11  0:58 ` [Bug target/23322] " danalis at cis dot udel dot edu
2005-08-11 10:30 ` [Bug rtl-optimization/23322] [4.1 regression] " rguenth at gcc dot gnu dot org
2005-08-11 12:01 ` [Bug target/23322] " pinskia at gcc dot gnu dot org
2005-08-11 14:01 ` pinskia at gcc dot gnu dot org
2005-08-11 14:56 ` roger at eyesopen dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).