public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Some statement counts for gcc
@ 2002-08-25 16:04 Brad Lucier
  2002-08-26  0:17 ` Zack Weinberg
  0 siblings, 1 reply; 5+ messages in thread
From: Brad Lucier @ 2002-08-25 16:04 UTC (permalink / raw)
  To: gcc; +Cc: Brad Lucier

As far as I know, the algorithm problems in -fnew-ra haven't been fixed
yet, and I wanted to help by giving some more data.

Here are some statement counts for 

banach-119% gcc/xgcc -v
Reading specs from /home/c/lucier/local/gcc-test/lib/gcc-lib/sparcv9-sun-solaris2.8/3.3/specs
Configured with: ../configure --prefix=/home/c/lucier/local/gcc-test --enable-languages=c --enable-checking=no --enable-coverage sparcv9-sun-solaris2.8
Thread model: posix
gcc version 3.3 20020823 (experimental)

I ran this job

banach-113% gcc/cc1 -fnew-ra -m64 -O1 -fschedule-insns2 -fno-strict-aliasing -fno-math-errno -mcpu=ultrasparc -mtune=ultrasparc _repl.i
 {GC 5344k -> 1146k} ___H__20___repl {GC 27541k -> 9605k} {GC 14111k -> 9356k} {GC 15065k -> 10374k} {GC 16938k -> 10335k} {GC 18900k -> 10618k} {GC 14579k -> 11902k} {GC 20160k -> 12347k} ___init_proc ____20___repl
Execution times (seconds)
 garbage collection    :   3.68 ( 1%) usr   0.01 ( 0%) sys   2.00 ( 0%) wall
 cfg construction      :   3.94 ( 1%) usr   0.35 ( 4%) sys   5.00 ( 1%) wall
 cfg cleanup           :  13.53 ( 3%) usr   0.00 ( 0%) sys  13.50 ( 3%) wall
 trivially dead code   :   1.58 ( 0%) usr   0.00 ( 0%) sys   1.50 ( 0%) wall
 life analysis         :  37.20 ( 7%) usr   0.00 ( 0%) sys  37.00 ( 7%) wall
 life info update      :  10.31 ( 2%) usr   0.00 ( 0%) sys   9.50 ( 2%) wall
 preprocessing         :   1.21 ( 0%) usr   1.17 (15%) sys   4.50 ( 1%) wall
 lexical analysis      :   0.70 ( 0%) usr   2.21 (28%) sys   4.00 ( 1%) wall
 parser                :   5.55 ( 1%) usr   1.33 (17%) sys   4.50 ( 1%) wall
 expand                :   1.46 ( 0%) usr   0.05 ( 1%) sys   1.50 ( 0%) wall
 varconst              :   0.69 ( 0%) usr   0.03 ( 0%) sys   0.50 ( 0%) wall
 integration           :   0.37 ( 0%) usr   0.01 ( 0%) sys   0.50 ( 0%) wall
 jump                  :   7.99 ( 2%) usr   0.00 ( 0%) sys   8.00 ( 2%) wall
 CSE                   :   3.98 ( 1%) usr   0.00 ( 0%) sys   4.50 ( 1%) wall
 loop analysis         :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 branch prediction     : 165.11 (33%) usr   0.07 ( 1%) sys 165.50 (33%) wall
 flow analysis         :   0.72 ( 0%) usr   0.00 ( 0%) sys   1.00 ( 0%) wall
 combiner              :   6.41 ( 1%) usr   0.00 ( 0%) sys   6.00 ( 1%) wall
 if-conversion         :   0.87 ( 0%) usr   0.00 ( 0%) sys   1.00 ( 0%) wall
 local alloc           : 204.74 (41%) usr   2.77 (34%) sys 208.50 (41%) wall
 global alloc          :   5.34 ( 1%) usr   0.00 ( 0%) sys   6.00 ( 1%) wall
 reload CSE regs       :   9.18 ( 2%) usr   0.00 ( 0%) sys   8.50 ( 2%) wall
 flow 2                :   0.34 ( 0%) usr   0.00 ( 0%) sys   0.50 ( 0%) wall
 if-conversion 2       :   0.90 ( 0%) usr   0.00 ( 0%) sys   1.00 ( 0%) wall
 rename registers      :   3.23 ( 1%) usr   0.00 ( 0%) sys   4.00 ( 1%) wall
 scheduling 2          :   3.33 ( 1%) usr   0.00 ( 0%) sys   3.50 ( 1%) wall
 delay branch sched    :   1.89 ( 0%) usr   0.00 ( 0%) sys   2.50 ( 0%) wall
 shorten branches      :   0.24 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 final                 :   1.08 ( 0%) usr   0.01 ( 0%) sys   1.00 ( 0%) wall
 rest of compilation   :   1.97 ( 0%) usr   0.00 ( 0%) sys   1.50 ( 0%) wall
 TOTAL                 : 497.61             8.03           507.00

where _repl.i.gz is at

http://www.math.purdue.edu/~lucier/_repl.i.gz

The complete gcov files are at

http://www.math.purdue.edu/~lucier/gcovs.tgz

I wanted to compile

http://www.math.purdue.edu/~lucier/_io.i.gz

but at 5-6 hours per run, using up to 3.8 GB swap, and with me making one
mistake or another three times in a row, I thought I'd start with the smaller
file ;-).

I sorted the executed lines in decreasing order, see

http://www.math.purdue.edu/~lucier/sorted-lines.gz

A surprising (to me) number of lines in real.c are executed; I might look
there to see what's going on.

Brad

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Some statement counts for gcc
  2002-08-25 16:04 Some statement counts for gcc Brad Lucier
@ 2002-08-26  0:17 ` Zack Weinberg
  2002-08-26  6:18   ` Brad Lucier
  0 siblings, 1 reply; 5+ messages in thread
From: Zack Weinberg @ 2002-08-26  0:17 UTC (permalink / raw)
  To: Brad Lucier; +Cc: gcc, Jan Hubicka

On Sun, Aug 25, 2002 at 06:04:11PM -0500, Brad Lucier wrote:

>  branch prediction     : 165.11 (33%) usr   0.07 ( 1%) sys 165.50 (33%) wall
...
> A surprising (to me) number of lines in real.c are executed; I might look
> there to see what's going on.

The branch predictor uses emulated floating point numbers internally.
Jan Hubicka has explained why this is currently necessary -- can't
find the message at the moment though.  However, real.c is indeed
quite slow; I suspect that accounts entirely for the amount of time
spent in this patch

If I remember correctly this code has a very complicated flow graph,
and branch prediction may not help much; perhaps the right thing is
to detect code like this and disable that optimization.

zw

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Some statement counts for gcc
  2002-08-26  0:17 ` Zack Weinberg
@ 2002-08-26  6:18   ` Brad Lucier
  2002-08-29 16:54     ` Zack Weinberg
  0 siblings, 1 reply; 5+ messages in thread
From: Brad Lucier @ 2002-08-26  6:18 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Brad Lucier, gcc, Jan Hubicka

> 
> On Sun, Aug 25, 2002 at 06:04:11PM -0500, Brad Lucier wrote:
> 
> >  branch prediction     : 165.11 (33%) usr   0.07 ( 1%) sys 165.50 (33%) wall
> ...
> > A surprising (to me) number of lines in real.c are executed; I might look
> > there to see what's going on.
> 
> The branch predictor uses emulated floating point numbers internally.
> Jan Hubicka has explained why this is currently necessary -- can't
> find the message at the moment though.  However, real.c is indeed
> quite slow; I suspect that accounts entirely for the amount of time
> spent in this patch

The problem is that gcc's fp code generator on x87 is broken enough that
you can get different results for the same expression, hence the use
of the simulator which does not use extended-precision arithmetic
by default.  I'm not really sure that the simulator is unreasonably slow.

> If I remember correctly this code has a very complicated flow graph,
> and branch prediction may not help much; perhaps the right thing is
> to detect code like this and disable that optimization.

This has been the response to several of my recent observations about
gcc's algorithms, etc.  I'd prefer that if there are problems they be
fixed rather than papered over by a -fbrad's_code_don't_optimize flag.

At any rate, here's some data for compiling the larger file:

banach-139% gcc/cc1 -fnew-ra -m64 -O1 -fschedule-insns2 -fno-strict-aliasing -fno-math-errno -mcpu=ultrasparc -mtune=ultrasparc _io.i
 ___H__20___io {GC 75305k -> 25172k} {GC 42902k -> 24245k} {GC 32413k -> 24597k} {GC 35635k -> 27127k} {GC 45565k -> 27034k} {GC 50659k -> 26850k} {GC 37567k -> 30050k} {GC 52691k -> 31806k} ___init_proc ____20___io
Execution times (seconds)
 garbage collection    :  10.79 ( 0%) usr   0.24 ( 0%) sys  21.50 ( 0%) wall
 cfg construction      :  29.83 ( 0%) usr   4.03 ( 1%) sys  34.50 ( 0%) wall
 cfg cleanup           :  99.46 ( 0%) usr   0.03 ( 0%) sys  99.00 ( 0%) wall
 trivially dead code   :   5.48 ( 0%) usr   0.00 ( 0%) sys   5.50 ( 0%) wall
 life analysis         : 435.03 ( 2%) usr   0.08 ( 0%) sys 442.50 ( 2%) wall
 life info update      :  58.31 ( 0%) usr   0.00 ( 0%) sys  58.50 ( 0%) wall
 preprocessing         :   2.93 ( 0%) usr   2.57 ( 1%) sys   5.00 ( 0%) wall
 lexical analysis      :   1.64 ( 0%) usr   5.18 ( 1%) sys  10.00 ( 0%) wall
 parser                :  12.39 ( 0%) usr   2.87 ( 1%) sys  13.50 ( 0%) wall
 expand                :   4.24 ( 0%) usr   0.17 ( 0%) sys   5.00 ( 0%) wall
 varconst              :   1.49 ( 0%) usr   0.01 ( 0%) sys   1.50 ( 0%) wall
 integration           :   1.04 ( 0%) usr   0.02 ( 0%) sys   1.00 ( 0%) wall
 jump                  :  64.02 ( 0%) usr   0.00 ( 0%) sys  63.50 ( 0%) wall
 CSE                   :  12.37 ( 0%) usr   0.00 ( 0%) sys  12.50 ( 0%) wall
 loop analysis         :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 branch prediction     :1211.11 ( 5%) usr   1.25 ( 0%) sys1214.50 ( 5%) wall
 flow analysis         :   3.76 ( 0%) usr   0.00 ( 0%) sys   3.50 ( 0%) wall
 combiner              :  18.08 ( 0%) usr   0.00 ( 0%) sys  18.50 ( 0%) wall
 if-conversion         :   5.82 ( 0%) usr   0.00 ( 0%) sys   6.00 ( 0%) wall
 local alloc           :20603.26 (91%) usr 399.06 (96%) sys21695.50 (91%) wall
 global alloc          :  16.60 ( 0%) usr   0.05 ( 0%) sys  22.00 ( 0%) wall
 reload CSE regs       :  62.53 ( 0%) usr   0.06 ( 0%) sys  69.00 ( 0%) wall
 flow 2                :   1.11 ( 0%) usr   0.00 ( 0%) sys   0.50 ( 0%) wall
 if-conversion 2       :   5.83 ( 0%) usr   0.01 ( 0%) sys   6.50 ( 0%) wall
 rename registers      :  10.91 ( 0%) usr   0.00 ( 0%) sys  11.00 ( 0%) wall
 scheduling 2          :   9.92 ( 0%) usr   0.02 ( 0%) sys  10.00 ( 0%) wall
 delay branch sched    :  14.78 ( 0%) usr   0.00 ( 0%) sys  14.50 ( 0%) wall
 shorten branches      :   0.85 ( 0%) usr   0.00 ( 0%) sys   1.00 ( 0%) wall
 final                 :   4.50 ( 0%) usr   0.05 ( 0%) sys   5.00 ( 0%) wall
 rest of compilation   :   8.71 ( 0%) usr   0.02 ( 0%) sys   9.50 ( 0%) wall
 TOTAL                 :22716.96           415.73          23860.50

Although the GC statistics indicate not much memory use, this code took up
to 3.8 GB of swap when running.

This is for the file at

http://www.math.purdue.edu/~lucier/_io.i.gz

Here we see an excessive amount of time spent in the new register allocator;
the most-executed lines in the ra-* files are at the end of this message.

The complete .c.gcov files for this run are at

http://www.math.purdue.edu/~lucier/gcovs_io.tgz

The ra-*.c.gcov files are at

http://www.math.purdue.edu/~lucier/ra-gcovs_io.tgz

and the executed lines sorted in decreasing order of execution are at

http://www.math.purdue.edu/~lucier/ra-sorted-lines.gz

Brad

ra-build.c.gcov: 79915683: 1625:  if (web1 == web2 || TEST_BIT (igraph, index))
ra-build.c.gcov: 79915683: 1623:  unsigned int index = igraph_index (id1, id2);
ra-build.c.gcov: 79915683: 1622:  unsigned int id1 = web1->id, id2 = web2->id;
ra-build.c.gcov: 79915683: 1621:{
ra-colorize.c.gcov: 79134101:  741:  for (wl = v->conflict_list; wl; wl = wl->next)
ra-colorize.c.gcov: 79111468:  764:       for (i = 0; i < nregs; i++)
ra-colorize.c.gcov: 79111383:  802:       if (pweb->type != SELECT && pweb->type != COALESCED)
ra-colorize.c.gcov: 79111383:  799:           if (u->type != PRECOLORED)
ra-colorize.c.gcov: 79111383:  769:             record_conflict (web, pweb);
ra-colorize.c.gcov: 79111383:  768:           if (wl->sub == NULL)
ra-colorize.c.gcov: 79111383:  766:           if (u->type == PRECOLORED)
ra-colorize.c.gcov: 79111383:  755:       if (u->type == PRECOLORED)
ra-colorize.c.gcov: 79111383:  754:       int nregs = 1 + v->add_hardregs;
ra-colorize.c.gcov: 79111383:  753:       struct web *web = u;
ra-colorize.c.gcov: 79111383:  751:      if (1)
ra-colorize.c.gcov: 79111383:  743:      struct web *pweb = wl->t;
ra-colorize.c.gcov: 79111298:  800:             break;
ra-build.c.gcov: 69833156: 1636:  if ((web1->type == PRECOLORED
ra-build.c.gcov: 69833156: 1633:    return;
ra-build.c.gcov: 69833156: 1631:  if ((web1->regno < FIRST_PSEUDO_REGISTER && fixed_regs[web1->regno])
ra-build.c.gcov: 69833156: 1627:  if (id1 == id2)
ra-build.c.gcov: 69833156: 1626:    return;
ra-build.c.gcov: 69827720: 1652:  if (web1->type != PRECOLORED && web2->type != PRECOLORED
ra-build.c.gcov: 69827615: 1665:  add_conflict_edge (web2, web1);
ra-build.c.gcov: 69827615: 1664:  add_conflict_edge (web1, web2);
ra-build.c.gcov: 69827615: 1663:  SET_BIT (igraph, index);
ra-rewrite.c.gcov: 66002982: 1141:          for (d = (pass) ? WEBS(SPILLED) : WEBS(COALESCED); d; d = d->next)
ra-rewrite.c.gcov: 65990988: 1145:              if (aweb->type != SPILLED)
ra-rewrite.c.gcov: 65990988: 1144:              struct web *aweb = alias (web);
ra-rewrite.c.gcov: 65990988: 1143:              struct web *web = DLIST_WEB (d);ra-rewrite.c.gcov: 65972997: 1146:                continue;
ra-colorize.c.gcov: 58624158: 1759:      for (nn = web2->conflict_list; nn && !wide_p; nn = nn->next)
ra-colorize.c.gcov: 58583640: 1760:     if (alias (nn->t)->add_hardregs)
ra-colorize.c.gcov: 55463179:  431:  if (web->num_conflicts < NUM_REGS (web) && before >= NUM_REGS (web))
ra-colorize.c.gcov: 55463179:  430:  web->num_conflicts -= dec;
ra-colorize.c.gcov: 55463179:  429:  int before = web->num_conflicts;
ra-colorize.c.gcov: 55463179:  428:{
ra-colorize.c.gcov: 54992227:  803:         decrement_degree (pweb, 1 + v->add_hardregs);
ra-colorize.c.gcov: 46572957: 1306:  for (wl = web->conflict_list; wl; wl = wl->next)
ra-colorize.c.gcov: 46540347: 1313:      if (ptarget->type != COLORED && ptarget->type != PRECOLORED
ra-colorize.c.gcov: 46540347: 1312:      w = sl ? sl->t : wl->t;
ra-colorize.c.gcov: 46540347: 1311:      IOR_HARD_REG_SET (bias, ptarget->bias_colors);
ra-colorize.c.gcov: 46540347: 1310:      struct sub_conflict *sl = wl->sub;
ra-colorize.c.gcov: 46540347: 1309:      struct web *ptarget = alias (wl->t);
ra-colorize.c.gcov: 46540347: 1308:      struct web *w;
ra-colorize.c.gcov: 46512519:  480:      for (wl = web->conflict_list; wl; wl = wl->next)
ra-colorize.c.gcov: 46479912:  483:       if (pweb->type != SELECT && pweb->type != COALESCED)
ra-colorize.c.gcov: 46479912:  482:       struct web *pweb = wl->t;
ra-rewrite.c.gcov: 19823503:  814:    for (; size--;)
ra-build.c.gcov: 16465236:  546:  return r1;
ra-build.c.gcov: 16465236:  488:  if (r1 != r2)
ra-build.c.gcov: 16465236:  487:{
ra-colorize.c.gcov: 11909293: 1959:  if (web1->spill_cost > web2->spill_cost)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Some statement counts for gcc
  2002-08-26  6:18   ` Brad Lucier
@ 2002-08-29 16:54     ` Zack Weinberg
  2002-08-30  3:54       ` Jan Hubicka
  0 siblings, 1 reply; 5+ messages in thread
From: Zack Weinberg @ 2002-08-29 16:54 UTC (permalink / raw)
  To: Brad Lucier; +Cc: gcc, Jan Hubicka

On Mon, Aug 26, 2002 at 08:18:15AM -0500, Brad Lucier wrote:
> 
> The problem is that gcc's fp code generator on x87 is broken enough that
> you can get different results for the same expression, hence the use
> of the simulator which does not use extended-precision arithmetic
> by default.

Right.

>  I'm not really sure that the simulator is unreasonably slow.

33% of user time spent in branch prediction seems like something worth
at least looking at, to me.

> > If I remember correctly this code has a very complicated flow graph,
> > and branch prediction may not help much; perhaps the right thing is
> > to detect code like this and disable that optimization.
> 
> This has been the response to several of my recent observations about
> gcc's algorithms, etc.  I'd prefer that if there are problems they be
> fixed rather than papered over by a -fbrad's_code_don't_optimize flag.

In general I agree with you.  However, do I remember correctly that
this is the inner loop of a threaded (Forth sense) interpreter?  Lots
of tiny blocks with computed-goto edges both in and out?  There really
isn't much good branch prediction can do on code like that.

> Although the GC statistics indicate not much memory use, this code took up
> to 3.8 GB of swap when running.

Treat the GC statistics with a pile of salt; they only count data
still live at end of compilation.  Also, I think ra.c has a lot of
local data allocated with xmalloc, which GC doesn't see at all.

zw

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Some statement counts for gcc
  2002-08-29 16:54     ` Zack Weinberg
@ 2002-08-30  3:54       ` Jan Hubicka
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Hubicka @ 2002-08-30  3:54 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Brad Lucier, gcc, Jan Hubicka

> On Mon, Aug 26, 2002 at 08:18:15AM -0500, Brad Lucier wrote:
> > 
> > The problem is that gcc's fp code generator on x87 is broken enough that
> > you can get different results for the same expression, hence the use
> > of the simulator which does not use extended-precision arithmetic
> > by default.
> 
> Right.
> 
> >  I'm not really sure that the simulator is unreasonably slow.
> 
> 33% of user time spent in branch prediction seems like something worth
> at least looking at, to me.

We got a plan to implement the userlevel .h file that will do FP on
volatile arguments, (poor mans -fcaller-save) and replace current
emulation.  I guess it will speed up by at least 10% making the problem
mood.  I didn't have time to implement this yet, but I will try to do so
for faster compiler branch.

What happends is that the code recursivly traces the loop graph and the
Brad's testcase contains a lot of nested loops.
> 
> > > If I remember correctly this code has a very complicated flow graph,
> > > and branch prediction may not help much; perhaps the right thing is
> > > to detect code like this and disable that optimization.
> > 
> > This has been the response to several of my recent observations about
> > gcc's algorithms, etc.  I'd prefer that if there are problems they be
> > fixed rather than papered over by a -fbrad's_code_don't_optimize flag.
> 
> In general I agree with you.  However, do I remember correctly that
> this is the inner loop of a threaded (Forth sense) interpreter?  Lots
> of tiny blocks with computed-goto edges both in and out?  There really
> isn't much good branch prediction can do on code like that.

You would be surprised, but especially in the case of large complex
functions the code to identify hot blocks on basic block branch
prediction works very well.
I was tracing code in natural very similar to Brad's testcase and it did
worked with about 90% sucess (ie 90% of hot blocks of the function
(based on profile) were marked as hot by prediction code too).
THis is because the programs largery just obey the loop tree.

Honza
> 
> > Although the GC statistics indicate not much memory use, this code took up
> > to 3.8 GB of swap when running.
> 
> Treat the GC statistics with a pile of salt; they only count data
> still live at end of compilation.  Also, I think ra.c has a lot of
> local data allocated with xmalloc, which GC doesn't see at all.
> 
> zw

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-08-30  3:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-25 16:04 Some statement counts for gcc Brad Lucier
2002-08-26  0:17 ` Zack Weinberg
2002-08-26  6:18   ` Brad Lucier
2002-08-29 16:54     ` Zack Weinberg
2002-08-30  3:54       ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).