Re: if-conversion a performance bottleneck

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: if-conversion a performance bottleneck
@ 2000-05-04 22:32 Mike Stump
  2000-05-04 22:35 ` Richard Henderson
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Stump @ 2000-05-04 22:32 UTC (permalink / raw)
  To: lucier, rth; +Cc: gcc, matzmich

> Date: Thu, 4 May 2000 22:16:17 -0700
> From: Richard Henderson <rth@cygnus.com>

> > make -j bootstrap

> Err.. don't do that?  Try make -j3 instead.

Unfortunately -j3 gives you not three jobs, but an exponential cascade
of 3 jobs per recursive make level.  Yes, I have read the paper
`Recursive make considered harmful'.  :-)

I found that -j3 -l10 will at least try and limit them from expanding
too much, which is useful if your swap limited (just how did those 2
emacen grow to be 60M each, and netscape inflate to 90M? :-().

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 22:32 if-conversion a performance bottleneck Mike Stump
@ 2000-05-04 22:35 ` Richard Henderson
  2000-05-05  6:12   ` Brad Lucier
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Henderson @ 2000-05-04 22:35 UTC (permalink / raw)
  To: Mike Stump; +Cc: lucier, gcc, matzmich

On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> Unfortunately -j3 gives you not three jobs, but an exponential cascade
> of 3 jobs per recursive make level.

I know.  But 3 or 15 is still a lot better than 1000, which might
be what you get with just -j.

> I found that -j3 -l10 will at least try and limit them from expanding
> too much, which is useful if your swap limited (just how did those 2
> emacen grow to be 60M each, and netscape inflate to 90M? :-().

That's what gigabytes of RAM are for.  ;-)


r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 22:35 ` Richard Henderson
@ 2000-05-05  6:12   ` Brad Lucier
  2000-05-05 10:37     ` Richard Henderson
  2000-05-05 13:35     ` Gerald Pfeifer
  0 siblings, 2 replies; 24+ messages in thread
From: Brad Lucier @ 2000-05-05  6:12 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Mike Stump, lucier, gcc, matzmich

> On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> > Unfortunately -j3 gives you not three jobs, but an exponential cascade
> > of 3 jobs per recursive make level.
> 
> I know.  But 3 or 15 is still a lot better than 1000, which might
> be what you get with just -j.
> 
> > I found that -j3 -l10 will at least try and limit them from expanding
> > too much, which is useful if your swap limited (just how did those 2
> > emacen grow to be 60M each, and netscape inflate to 90M? :-().
> 
> That's what gigabytes of RAM are for.  ;-)

I have gigabytes of RAM; I've watched what happens when I say, e.g.,

make -j9 bootstrap

or

make -j 9 bootstrap

or whatever, and it doesn't seem to actually set off a bunch of parallel
jobs---definitely, by the time the stage1 compiler starts compiling
things, I'm down to one job at a time.  That's with the 2.2.13 kernel
on alpha with make 3.77; is this a known problem?  Should I upgrade
make?

BTW, the maximum load with make -j is < 80 (without libgcc) and
things seem to stay rational because make gets less time as the
load rises and can't spawn jobs at such a high rate.

Brad

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-05  6:12   ` Brad Lucier
@ 2000-05-05 10:37     ` Richard Henderson
  2000-05-05 13:35     ` Gerald Pfeifer
  1 sibling, 0 replies; 24+ messages in thread
From: Richard Henderson @ 2000-05-05 10:37 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Mike Stump, gcc, matzmich

On Fri, May 05, 2000 at 08:12:34AM -0500, Brad Lucier wrote:
> make -j 9 bootstrap
> 
> or whatever, and it doesn't seem to actually set off a bunch of parallel
> jobs--

make MAKE='make -j4' bootstrap -- the -j doesn't normally get
passed down to submakes.

> Should I upgrade make?

Apparently both of us should.  I'm interested in this "build server"
thingy...


r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-05  6:12   ` Brad Lucier
  2000-05-05 10:37     ` Richard Henderson
@ 2000-05-05 13:35     ` Gerald Pfeifer
  1 sibling, 0 replies; 24+ messages in thread
From: Gerald Pfeifer @ 2000-05-05 13:35 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, Mike Stump, gcc, matzmich

On Fri, 5 May 2000, Brad Lucier wrote:
> make -j 9 bootstrap
> 
> or whatever, and it doesn't seem to actually set off a bunch of parallel
> jobs---definitely, by the time the stage1 compiler starts compiling
> things, I'm down to one job at a time.  That's with the 2.2.13 kernel
> on alpha with make 3.77; is this a known problem?  Should I upgrade
> make?

Please do so and let us know the results! 

It would be very useful to document properly, how to use parallel
makes for GCC -- perhaps you could come up with a patch for our
documentation at?

  http://gcc.gnu.org/install/build.html

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  9:36                   ` Joe Buck
  2000-05-10  9:52                     ` Jeffrey A Law
@ 2000-05-10 14:17                     ` Joern Rennecke
  1 sibling, 0 replies; 24+ messages in thread
From: Joern Rennecke @ 2000-05-10 14:17 UTC (permalink / raw)
  To: Joe Buck
  Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc

> No, -l does not work and never did.  The reason is that the load value
> increases very slowly as processes are launched; as a result, if you
> say make -j -l, a very large number of processes get launched initially,
> and it may take 10 seconds or so for the load to climb to the level
> specified in -l -- but then the load climbs to 20x that figure or more.
> At this point, no more processes get launched until at least a minute
> after all of the first batch dies, and then another process storm is
> launched.  This is becuase the number checked by -l is based on an average
> load over a full minute, not an instantaneous figure.
> 
> Solution: get make 3.79 and use -j<N>.  The GNU make folks should
> deprecate -l and eventually remove it.

I wouldn't go so far.  It seems clear that -l is not useful to make make
react to the load it creates itself, but you might use it instead or in
addition to nice to control how much a batch job taxes a system that is
used for interactive stuff as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  9:52                     ` Jeffrey A Law
@ 2000-05-10 10:49                       ` Joe Buck
  0 siblings, 0 replies; 24+ messages in thread
From: Joe Buck @ 2000-05-10 10:49 UTC (permalink / raw)
  To: law; +Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc

>   In message < 200005101628.JAA25911@possibly.synopsys.com >you write:
>   > Solution: get make 3.79 and use -j<N>.  The GNU make folks should
>   > deprecate -l and eventually remove it.
> Most definitely.  I upgraded to 3.79 a few days ago and have really liked
> the way -j works in the new version.  I strongly recommend anyone building
> gcc on an SMP machine upgrade to make-3.79.

Even a 1-processor machine can benefit from -j2, as the compiler will
spend a significant portion of its time reading or writing disk, leaving
idle CPU for the other process.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  9:36                   ` Joe Buck
@ 2000-05-10  9:52                     ` Jeffrey A Law
  2000-05-10 10:49                       ` Joe Buck
  2000-05-10 14:17                     ` Joern Rennecke
  1 sibling, 1 reply; 24+ messages in thread
From: Jeffrey A Law @ 2000-05-10  9:52 UTC (permalink / raw)
  To: Joe Buck
  Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc

  In message < 200005101628.JAA25911@possibly.synopsys.com >you write:
  > Solution: get make 3.79 and use -j<N>.  The GNU make folks should
  > deprecate -l and eventually remove it.
Most definitely.  I upgraded to 3.79 a few days ago and have really liked
the way -j works in the new version.  I strongly recommend anyone building
gcc on an SMP machine upgrade to make-3.79.


jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  8:30                 ` Andreas Schwab
@ 2000-05-10  9:36                   ` Joe Buck
  2000-05-10  9:52                     ` Jeffrey A Law
  2000-05-10 14:17                     ` Joern Rennecke
  0 siblings, 2 replies; 24+ messages in thread
From: Joe Buck @ 2000-05-10  9:36 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Richard Henderson, Brad Lucier, Michael Matz, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]

> Richard Henderson <rth@cygnus.com> writes:
> 
> |> On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
> |> > BTW, with the introduction of libgcc1, I can no longer
> |> > 
> |> > make -j bootstrap
> |> > 
> |> > on my Linux box---I run out of processes with the standard value of
> |> > NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.
> |> 
> |> Err.. don't do that?  Try make -j3 instead.
> 
> Or say make -j -l <some small value> to limit depending on the load.

No, -l does not work and never did.  The reason is that the load value
increases very slowly as processes are launched; as a result, if you
say make -j -l, a very large number of processes get launched initially,
and it may take 10 seconds or so for the load to climb to the level
specified in -l -- but then the load climbs to 20x that figure or more.
At this point, no more processes get launched until at least a minute
after all of the first batch dies, and then another process storm is
launched.  This is becuase the number checked by -l is based on an average
load over a full minute, not an instantaneous figure.

Solution: get make 3.79 and use -j<N>.  The GNU make folks should
deprecate -l and eventually remove it.

> Andreas.
> 
> -- 
> Andreas Schwab                                  "And now for something
> SuSE Labs                                        completely different."
> Andreas.Schwab@suse.de
> SuSE GmbH, SchanzÃ¤ckerstr. 10, D-90443 NÃ¼rnberg
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 22:16               ` Richard Henderson
@ 2000-05-10  8:30                 ` Andreas Schwab
  2000-05-10  9:36                   ` Joe Buck
  0 siblings, 1 reply; 24+ messages in thread
From: Andreas Schwab @ 2000-05-10  8:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Brad Lucier, Michael Matz, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 704 bytes --]

Richard Henderson <rth@cygnus.com> writes:

|> On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
|> > BTW, with the introduction of libgcc1, I can no longer
|> > 
|> > make -j bootstrap
|> > 
|> > on my Linux box---I run out of processes with the standard value of
|> > NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.
|> 
|> Err.. don't do that?  Try make -j3 instead.

Or say make -j -l <some small value> to limit depending on the load.

Andreas.

-- 
Andreas Schwab                                  "And now for something
SuSE Labs                                        completely different."
Andreas.Schwab@suse.de
SuSE GmbH, SchanzÃ¤ckerstr. 10, D-90443 NÃ¼rnberg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
@ 2000-05-05 13:46 Brad Lucier
  0 siblings, 0 replies; 24+ messages in thread
From: Brad Lucier @ 2000-05-05 13:46 UTC (permalink / raw)
  To: lucier, pfeifer; +Cc: rth, mrs, gcc, matzmich

> On Fri, 5 May 2000, Brad Lucier wrote:
> > make -j 9 bootstrap
> > 
> > or whatever, and it doesn't seem to actually set off a bunch of parallel
> > jobs---definitely, by the time the stage1 compiler starts compiling
> > things, I'm down to one job at a time.  That's with the 2.2.13 kernel
> > on alpha with make 3.77; is this a known problem?  Should I upgrade
> > make?
> 
> Please do so and let us know the results! 
> 
> It would be very useful to document properly, how to use parallel
> makes for GCC -- perhaps you could come up with a patch for our
> documentation at?
> 
>   http://gcc.gnu.org/install/build.html

After upgrading make from version 3.77 to version 3.79,

make -j 9 bootstrap

works as I think it should---it consistently has about 9 processes
running in various subdirectories throughout the build process.

Brad

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
                           ` (2 preceding siblings ...)
  2000-05-04 11:55         ` Richard Henderson
@ 2000-05-05 12:01         ` Jeffrey A Law
  3 siblings, 0 replies; 24+ messages in thread
From: Jeffrey A Law @ 2000-05-05 12:01 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

  In message < Pine.SOL.4.10.10005030741450.1121-200000@platon >you write:
  >   This message is in MIME format.  The first part should be readable text,
  >   while the remaining parts are likely unreadable without MIME-aware tools.
  >   Send mail to mime@docserver.cac.washington.edu for more info.
  > 
  > ---559023410-33463914-957333180=:1121
  > Content-Type: TEXT/PLAIN; charset=US-ASCII
  > 
  > Hi,
  > 
  > On Tue, 2 May 2000, Michael Matz wrote:
  > > >  time   seconds   seconds    calls  ms/call  ms/call  name    
  > > >  57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
  > > >  12.83 136.59 24.92  15025 1.66 1.66 sbitmap_intersection_of_preds
  > > 
  > > I once had faster versions of these two functions, if I get home I'll see
  > > if they make any difference on your input data.
  > 
  > Before fiddling with sbitmap_intersection_of_xx() I first reworked
  > compute_flow_dominators() to also behave normally in calculating the
  > post_doms. At least the order of work-queue initialization was wrong (in
  > post_dom the changes are propagating from the _end_). I then also
  > implemented a poor man's topological sort for cyclic graphs ;), which
  > again gave a better performance. (I also did this once for doms, but there
  > it didn't make a great difference on Brads test cases)
  > 
  > Please try the attached diff (against actual CVS) if they make also a
  > difference for you ;)
Like Richard, I'd like to see you submit this as a function which can be
called from multiple locations in the compiler.

We've got a number of routines that _might_ benefit from this code, but
I'd also like to see some more general benchmarking.  I don't want to
see us slow down the compiler for the common cases just to make Brad's
one test run faster.

jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
@ 2000-05-04 23:28 Mathias Froehlich
  0 siblings, 0 replies; 24+ messages in thread
From: Mathias Froehlich @ 2000-05-04 23:28 UTC (permalink / raw)
  To: gcc

> On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> > Unfortunately -j3 gives you not three jobs, but an exponential cascade
> > of 3 jobs per recursive make level.

> I know.  But 3 or 15 is still a lot better than 1000, which might
> be what you get with just -j.

> > I found that -j3 -l10 will at least try and limit them from expanding
> > too much, which is useful if your swap limited (just how did those 2
> > emacen grow to be 60M each, and netscape inflate to 90M? :-().

> That's what gigabytes of RAM are for.  ;-)

Or use a recent version of gmake (>=3.78.1), which implements a so
called "jobserver mode". This limits ther _overall_ number of working
make jobs to the number given by -j. This does also work for recursive
makes. 

Try it out it works great ...

  Regards,

     Mathias Froehlich

-- 
Mathias Fr"ohlich              e-mail: frohlich@na.uni-tuebingen.de
Institut f"ur Mathematik, Universit"at T"ubingen, D-72076 T"ubingen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 19:39             ` Brad Lucier
@ 2000-05-04 22:16               ` Richard Henderson
  2000-05-10  8:30                 ` Andreas Schwab
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Henderson @ 2000-05-04 22:16 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Michael Matz, gcc

On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
> BTW, with the introduction of libgcc1, I can no longer
> 
> make -j bootstrap
> 
> on my Linux box---I run out of processes with the standard value of
> NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.

Err.. don't do that?  Try make -j3 instead.


r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 11:46           ` Michael Matz
@ 2000-05-04 19:39             ` Brad Lucier
  2000-05-04 22:16               ` Richard Henderson
  0 siblings, 1 reply; 24+ messages in thread
From: Brad Lucier @ 2000-05-04 19:39 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

> Btw. do you have also slightly smaller test cases (a quarter or so)? At
> home I only have a slow machine and waiting 10 minutes everytime to see,
> if one change was worthwhile is, erhh... slow.

Try

http://www.math.purdue.edu/~lucier/_t-c-1.i.gz

I get

popov-1089% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-1.i
...
 TOTAL                 :  31.78             0.36            32.13

BTW, with the introduction of libgcc1, I can no longer

make -j bootstrap

on my Linux box---I run out of processes with the standard value of
NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.

Brad

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
  2000-05-03  5:50         ` Brad Lucier
  2000-05-03 20:05         ` Brad Lucier
@ 2000-05-04 11:55         ` Richard Henderson
  2000-05-05 12:01         ` Jeffrey A Law
  3 siblings, 0 replies; 24+ messages in thread
From: Richard Henderson @ 2000-05-04 11:55 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, gcc

On Wed, May 03, 2000 at 07:53:00AM +0200, Michael Matz wrote:
> I then also implemented a poor man's topological sort for cyclic graphs
> ;), which again gave a better performance. (I also did this once for doms,
> but there it didn't make a great difference on Brads test cases)

Would you split this off into a new function?  Perhaps called
order_bb_for_backward_flow or something?  This could be useful elsewhere.



r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-03 20:05         ` Brad Lucier
@ 2000-05-04 11:46           ` Michael Matz
  2000-05-04 19:39             ` Brad Lucier
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Matz @ 2000-05-04 11:46 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, gcc

Hi,

On Wed, 3 May 2000, Brad Lucier wrote:
> Your changes to flow.c have cut the number of calls to 
> sbitmap_intersection_of_succs from 40604 to 24225, so they are definitely
> worthwhile.

Yes, that was the purpose of the patch ;)
Anyway, I saw some issues of that patch with one of your other test files,
where the runtime exploded, and on my HDD at home I have another version
which even more reduces the time. If I find time, I'll rewrite the sbitmap
routines for large homogeneous bitmaps, clean up all that stuff and see if
anything comes out of it ;)

Btw. do you have also slightly smaller test cases (a quarter or so)? At
home I only have a slow machine and waiting 10 minutes everytime to see,
if one change was worthwhile is, erhh... slow.

>  rest of compilation   :   1.07 ( 0%) usr   0.00 ( 0%) sys   1.07 ( 0%) wall
>  TOTAL                 : 226.10             1.27           227.32

God! For me that was ca. 1110 seconds ;) (Now I'm down to 580)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
  2000-05-03  5:50         ` Brad Lucier
@ 2000-05-03 20:05         ` Brad Lucier
  2000-05-04 11:46           ` Michael Matz
  2000-05-04 11:55         ` Richard Henderson
  2000-05-05 12:01         ` Jeffrey A Law
  3 siblings, 1 reply; 24+ messages in thread
From: Brad Lucier @ 2000-05-03 20:05 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

> Please try the attached diff (against actual CVS) if they make also a
> difference for you ;)

Your changes to flow.c have cut the number of calls to 
sbitmap_intersection_of_succs from 40604 to 24225, so they are definitely
worthwhile.  Bootstrapped on alphaev6-unknown-linux-gnu.

Brad

Here are the new statistics with your changes:
Execution times (seconds)
 garbage collection    :   1.30 ( 1%) usr   0.00 ( 0%) sys   1.30 ( 1%) wall
 parser                :   6.45 ( 3%) usr   0.19 (15%) sys   6.64 ( 3%) wall
 varconst              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :  20.06 ( 9%) usr   0.84 (66%) sys  20.90 ( 9%) wall
 CSE                   :   2.77 ( 1%) usr   0.00 ( 0%) sys   2.77 ( 1%) wall
 global CSE            :   5.51 ( 2%) usr   0.01 ( 1%) sys   5.52 ( 2%) wall
 loop analysis         :   0.24 ( 0%) usr   0.00 ( 0%) sys   0.24 ( 0%) wall
 CSE 2                 :   2.29 ( 1%) usr   0.00 ( 0%) sys   2.29 ( 1%) wall
 flow analysis         :  52.11 (23%) usr   0.06 ( 5%) sys  52.16 (23%) wall
 combiner              :   2.80 ( 1%) usr   0.00 ( 0%) sys   2.80 ( 1%) wall
 if-conversion         :  44.42 (20%) usr   0.02 ( 2%) sys  44.43 (20%) wall
 regmove               :   0.50 ( 0%) usr   0.00 ( 0%) sys   0.50 ( 0%) wall
 scheduling            :   6.26 ( 3%) usr   0.01 ( 1%) sys   6.27 ( 3%) wall
 local alloc           :   1.49 ( 1%) usr   0.00 ( 0%) sys   1.49 ( 1%) wall
 global alloc          :   2.78 ( 1%) usr   0.07 ( 6%) sys   2.86 ( 1%) wall
 reload CSE regs       :   7.92 ( 4%) usr   0.01 ( 1%) sys   7.93 ( 3%) wall
 flow 2                :  19.36 ( 9%) usr   0.00 ( 0%) sys  19.35 ( 9%) wall
 if-conversion 2       :  36.38 (16%) usr   0.01 ( 1%) sys  36.38 (16%) wall
 peephole 2            :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall
 schedulding 2         :   8.88 ( 4%) usr   0.00 ( 0%) sys   8.88 ( 4%) wall
 shorten branches      :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall
 final                 :   3.22 ( 1%) usr   0.00 ( 0%) sys   3.22 ( 1%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   1.07 ( 0%) usr   0.00 ( 0%) sys   1.07 ( 0%) wall
 TOTAL                 : 226.10             1.27           227.32


Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 42.54     69.80    69.80    24225     2.88     2.88  sbitmap_intersection_of_succs
 17.24     98.09    28.29    15025     1.88     1.88  sbitmap_intersection_of_preds
  5.03    106.34     8.25       42   196.36   196.36  mark_critical_edges
  3.98    112.86     6.52 25085231     0.00     0.00  bitmap_operation
  3.56    118.70     5.84    10248     0.57     0.62  compute_block_backward_dependences
  3.44    124.35     5.65        9   627.60 11546.70  compute_flow_dominators
  2.37    128.24     3.89       21   185.36   185.93  delete_unreachable_blocks
  2.31    132.03     3.79        6   631.02  1760.19  calculate_global_regs_live
...
-----------------------------------------------
                1.88   32.76       3/9           flow_loops_find [9]
                3.77   65.51       6/9           if_convert [8]
[6]     63.3    5.65   98.27       9         compute_flow_dominators [6]
               69.80    0.00   24225/24225       sbitmap_intersection_of_succs [7]
               28.29    0.00   15025/15025       sbitmap_intersection_of_preds [10]
                0.16    0.00   39259/39259       sbitmap_a_and_b [135]
                0.02    0.00       9/27          sbitmap_vector_alloc [255]
                0.00    0.00       9/18          sbitmap_vector_zero [771]
                0.00    0.00       9/74716       sbitmap_zero [725]
                0.00    0.00       9/9           sbitmap_vector_ones [1365]
-----------------------------------------------
               69.80    0.00   24225/24225       compute_flow_dominators [6]
[7]     42.5   69.80    0.00   24225         sbitmap_intersection_of_succs [7]
                0.00    0.00   24225/39250       sbitmap_copy [612]
-----------------------------------------------
                0.00   69.40      12/12          rest_of_compilation [5]
[8]     42.3    0.00   69.40      12         if_convert [8]
                3.77   65.51       6/9           compute_flow_dominators [6]
                0.00    0.06      12/43          free_basic_block_vars [112]
                0.03    0.00      12/48          compute_bb_for_insn [147]
                0.00    0.01   20509/20509       find_if_header [391]
                0.01    0.00       6/27          sbitmap_vector_alloc [255]
                0.00    0.00       1/10258       update_life_info [12]
                0.00    0.00       1/37          allocate_reg_info [540]
                0.00    0.00       1/20500       count_or_remove_death_notes [36]
                0.00    0.00       1/995         sbitmap_alloc [679]
                0.00    0.00       2/145008      max_reg_num [374]
                0.00    0.00       1/74716       sbitmap_zero [725]
                0.00    0.00      12/5225        get_max_uid [1134]
-----------------------------------------------
                2.96   36.17       3/3           rest_of_compilation [5]
[9]     23.8    2.96   36.17       3         flow_loops_find [9]
                1.88   32.76       3/9           compute_flow_dominators [6]
                0.76    0.00     964/964         flow_loop_exits_find [38]
                0.75    0.00       1/1           flow_depth_first_order_compute [39]
                0.01    0.00       3/27          sbitmap_vector_alloc [255]
                0.00    0.00     964/964         flow_loop_pre_header_find [627]
                0.00    0.00     966/995         sbitmap_alloc [679]
                0.00    0.00       3/3           flow_loops_tree_build [747]
                0.00    0.00     964/964         flow_loop_nodes_find [791]
                0.00    0.00     964/964         sbitmap_last_set_bit [834]
                0.00    0.00       2/74716       sbitmap_zero [725]
                0.00    0.00     964/964         sbitmap_first_set_bit [1174]
                0.00    0.00       3/3           flow_loops_level_compute [1435]
-----------------------------------------------
               28.29    0.00   15025/15025       compute_flow_dominators [6]
[10]    17.2   28.29    0.00   15025         sbitmap_intersection_of_preds [10]
                0.00    0.00   15025/39250       sbitmap_copy [612]
-----------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
@ 2000-05-03  5:50         ` Brad Lucier
  2000-05-03 20:05         ` Brad Lucier
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Brad Lucier @ 2000-05-03  5:50 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

> Please try the attached diff (against actual CVS) if they make also a
> difference for you ;)

The diff didn't apply cleanly, I don't know why, so I applied it by hand.
I didn't have time this morning to install a new profiled version, so
I ran make bootstrap with the usual BOOT_CFLAGS.

Your patch did help.  Here is a comparison between the a profiled version
without your patch and an unprofiled version with your patch:

popov-726% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.71 ( 0%) usr   0.00 ( 1%) sys   0.71 ( 0%) wall
 parser                :   6.13 ( 3%) usr   0.19 (25%) sys   6.32 ( 3%) wall
 varconst              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   6.45 ( 3%) usr   0.39 (51%) sys   6.84 ( 3%) wall
 CSE                   :   2.20 ( 1%) usr   0.00 ( 1%) sys   2.20 ( 1%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  63.17 (27%) usr   0.06 ( 8%) sys  63.21 (27%) wall
 combiner              :   2.81 ( 1%) usr   0.00 ( 0%) sys   2.81 ( 1%) wall
 if-conversion         :  56.39 (24%) usr   0.01 ( 2%) sys  56.39 (24%) wall
 local alloc           :   1.24 ( 1%) usr   0.00 ( 0%) sys   1.24 ( 1%) wall
 global alloc          :   2.35 ( 1%) usr   0.05 ( 8%) sys   2.41 ( 1%) wall
 reload CSE regs       :   3.68 ( 2%) usr   0.00 ( 0%) sys   3.68 ( 2%) wall
 flow 2                :  29.86 (13%) usr   0.00 ( 1%) sys  29.86 (13%) wall
 if-conversion 2       :  58.60 (25%) usr   0.01 ( 2%) sys  58.60 (25%) wall
 shorten branches      :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall
 final                 :   2.68 ( 1%) usr   0.00 ( 1%) sys   2.70 ( 1%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   1.02 ( 0%) usr   0.00 ( 0%) sys   1.02 ( 0%) wall
 TOTAL                 : 237.52             0.77           238.47
popov-727% gcc -v
Reading specs from /export/u10/gcc-2.95.1/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.95.1/specs
gcc version 2.95.1 19990816 (release)
popov-728% /export/u10/egcs-test/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.32 ( 0%) usr   0.00 ( 1%) sys   0.32 ( 0%) wall
 parser                :   2.92 ( 2%) usr   0.18 (25%) sys   3.10 ( 2%) wall
 varconst              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   4.56 ( 3%) usr   0.38 (52%) sys   4.94 ( 3%) wall
 CSE                   :   0.81 ( 1%) usr   0.00 ( 0%) sys   0.81 ( 1%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  51.69 (32%) usr   0.06 ( 8%) sys  51.74 (32%) wall
 combiner              :   0.99 ( 1%) usr   0.00 ( 0%) sys   0.99 ( 1%) wall
 if-conversion         :  33.69 (21%) usr   0.01 ( 2%) sys  33.70 (21%) wall
 local alloc           :   0.54 ( 0%) usr   0.00 ( 0%) sys   0.54 ( 0%) wall
 global alloc          :   1.11 ( 1%) usr   0.05 ( 7%) sys   1.16 ( 1%) wall
 reload CSE regs       :   3.15 ( 2%) usr   0.00 ( 1%) sys   3.15 ( 2%) wall
 flow 2                :  23.85 (15%) usr   0.00 ( 0%) sys  23.84 (15%) wall
 if-conversion 2       :  33.45 (21%) usr   0.01 ( 2%) sys  33.45 (21%) wall
 shorten branches      :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall
 final                 :   2.46 ( 2%) usr   0.00 ( 0%) sys   2.46 ( 2%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   0.43 ( 0%) usr   0.00 ( 0%) sys   0.43 ( 0%) wall
 TOTAL                 : 160.09             0.73           160.78

I can get more information this afternoon (this evening in Germany :-).

Brad

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02  5:45     ` Michael Matz
@ 2000-05-02 22:58       ` Michael Matz
  2000-05-03  5:50         ` Brad Lucier
                           ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Michael Matz @ 2000-05-02 22:58 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, gcc

Hi,

On Tue, 2 May 2000, Michael Matz wrote:
> >  time   seconds   seconds    calls  ms/call  ms/call  name    
> >  57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
> >  12.83 136.59 24.92  15025 1.66 1.66 sbitmap_intersection_of_preds
> 
> I once had faster versions of these two functions, if I get home I'll see
> if they make any difference on your input data.

Before fiddling with sbitmap_intersection_of_xx() I first reworked
compute_flow_dominators() to also behave normally in calculating the
post_doms. At least the order of work-queue initialization was wrong (in
post_dom the changes are propagating from the _end_). I then also
implemented a poor man's topological sort for cyclic graphs ;), which
again gave a better performance. (I also did this once for doms, but there
it didn't make a great difference on Brads test cases)

Please try the attached diff (against actual CVS) if they make also a
difference for you ;)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-01 18:01   ` Brad Lucier
@ 2000-05-02  5:45     ` Michael Matz
  2000-05-02 22:58       ` Michael Matz
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Matz @ 2000-05-02  5:45 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, gcc

On Mon, 1 May 2000, Brad Lucier wrote:
> Each sample counts as 0.000976562 seconds.
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls  ms/call  ms/call  name    
>  57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
>  12.83 136.59 24.92  15025 1.66 1.66 sbitmap_intersection_of_preds

I once had faster versions of these two functions, if I get home I'll see
if they make any difference on your input data.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-01 15:29 ` Richard Henderson
@ 2000-05-01 18:01   ` Brad Lucier
  2000-05-02  5:45     ` Michael Matz
  0 siblings, 1 reply; 24+ messages in thread
From: Brad Lucier @ 2000-05-01 18:01 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Brad Lucier, gcc

> Oops.  That call was supposed to be ifdef ENABLE_CHECKING.
> With that fixed, are compilation times mostly sane?
> r~

Yes, that helped a lot :-), but if-conversion still takes almost half
the CPU time to compile that file:

popov-211% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.70 ( 0%) usr   0.00 ( 0%) sys   0.70 ( 0%) wall
 parser                :   6.25 ( 3%) usr   0.18 (22%) sys   6.43 ( 3%) wall
 varconst              :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   6.50 ( 3%) usr   0.38 (48%) sys   6.88 ( 3%) wall
 CSE                   :   2.28 ( 1%) usr   0.00 ( 0%) sys   2.28 ( 1%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  58.86 (25%) usr   0.11 (14%) sys  58.99 (25%) wall
 combiner              :   2.87 ( 1%) usr   0.00 ( 0%) sys   2.87 ( 1%) wall
 if-conversion         :  59.53 (26%) usr   0.02 ( 3%) sys  59.59 (26%) wall
 local alloc           :   1.24 ( 1%) usr   0.00 ( 0%) sys   1.24 ( 1%) wall
 global alloc          :   2.37 ( 1%) usr   0.05 ( 6%) sys   2.42 ( 1%) wall
 reload CSE regs       :   3.88 ( 2%) usr   0.00 ( 0%) sys   3.88 ( 2%) wall
 flow 2                :  27.86 (12%) usr   0.00 ( 0%) sys  27.86 (12%) wall
 if-conversion 2       :  55.02 (24%) usr   0.01 ( 2%) sys  55.02 (24%) wall
 shorten branches      :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall
 final                 :   2.77 ( 1%) usr   0.01 ( 1%) sys   2.78 ( 1%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   1.02 ( 0%) usr   0.00 ( 0%) sys   1.02 ( 0%) wall
 TOTAL                 : 231.37             0.81           232.20

Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 57.51    111.67   111.67    40604     2.75     2.75  sbitmap_intersection_of_succs
 12.83    136.59    24.92    15025     1.66     1.66  sbitmap_intersection_of_preds
  9.58    155.19    18.60    20981     0.89     0.89  for_each_successor_phi
  3.78    162.54     7.35 25079015     0.00     0.00  bitmap_operation
  1.99    166.40     3.86        6   643.72  5013.82  calculate_global_regs_live
  1.71    169.73     3.33       18   184.84   184.84  mark_critical_edges
  1.57    172.78     3.05        9   338.87 15538.80  compute_flow_dominators
  1.49    175.67     2.89        3   963.87 16982.59  flow_loops_find
...
-----------------------------------------------
                1.02   45.60       3/9           flow_loops_find [9]
                2.03   91.20       6/9           if_convert [8]
[6]     72.0    3.05  136.80       9         compute_flow_dominators [6]
              111.67    0.00   40604/40604       sbitmap_intersection_of_succs [7]
               24.92    0.00   15025/15025       sbitmap_intersection_of_preds [12]
                0.18    0.00   55638/55638       sbitmap_a_and_b [77]
                0.02    0.00       9/21          sbitmap_vector_alloc [217]
                0.00    0.00       9/9           sbitmap_vector_ones [744]
                0.00    0.00       9/12          sbitmap_vector_zero [758]
                0.00    0.00       9/31743       sbitmap_zero [752]
-----------------------------------------------
              111.67    0.00   40604/40604       compute_flow_dominators [6]
[7]     57.5  111.67    0.00   40604         sbitmap_intersection_of_succs [7]
                0.00    0.00   40604/55629       sbitmap_copy [470]
-----------------------------------------------
                0.00   96.36       9/9           rest_of_compilation [5]
[8]     49.6    0.00   96.36       9         if_convert [8]
                2.03   91.20       6/9           compute_flow_dominators [6]
                0.00    3.05       1/10          update_life_info [10]
                0.03    0.00       9/18          compute_bb_for_insn [190]
                0.00    0.02       9/37          free_basic_block_vars [135]
                0.00    0.02   15384/15384       find_if_header [297]
                0.01    0.00       6/21          sbitmap_vector_alloc [217]
                0.00    0.00       1/25          allocate_reg_info [481]
                0.00    0.00       9/5192        get_max_uid [727]
                0.00    0.00       2/64864       max_reg_num [454]
                0.00    0.00       1/31743       sbitmap_zero [752]
                0.00    0.00       1/974         sbitmap_alloc [1076]
                0.00    0.00       1/1           count_or_remove_death_notes [1385]
-----------------------------------------------
                2.89   48.06       3/3           rest_of_compilation [5]
[9]     26.2    2.89   48.06       3         flow_loops_find [9]
                1.02   45.60       3/9           compute_flow_dominators [6]
                0.71    0.00     964/964         flow_loop_exits_find [31]
                0.71    0.00       1/1           flow_depth_first_order_compute [32]
                0.01    0.00       3/21          sbitmap_vector_alloc [217]
                0.00    0.00       3/3           flow_loops_tree_build [604]
                0.00    0.00     964/964         flow_loop_nodes_find [698]
                0.00    0.00     964/964         sbitmap_first_set_bit [739]
                0.00    0.00     964/964         flow_loop_pre_header_find [738]
                0.00    0.00       2/31743       sbitmap_zero [752]
                0.00    0.00     966/974         sbitmap_alloc [1076]
                0.00    0.00     964/964         sbitmap_last_set_bit [1078]
                0.00    0.00       3/3           flow_loops_level_compute [1331]
-----------------------------------------------
                0.00    3.05       1/10          if_convert [8]
                0.00    9.14       3/10          rest_of_compilation [5]
                0.00   18.28       6/10          life_analysis [14]
[10]    15.7    0.00   30.47      10         update_life_info [10]
                3.86   26.22       6/6           calculate_global_regs_live [11]
                0.02    0.35   15374/27234       propagate_block [34]
                0.00    0.00   15374/27239       free_propagate_block_info [389]
                0.01    0.00   15374/72751       bitmap_copy [262]
                0.00    0.00    5125/5125        verify_local_live_at_start [573]
                0.00    0.00      10/257366      bitmap_clear [283]
                0.00    0.00      10/124375      bitmap_initialize [407]
-----------------------------------------------
                3.86   26.22       6/6           update_life_info [10]
[11]    15.5    3.86   26.22       6         calculate_global_regs_live [11]
               18.60    0.00   20981/20981       for_each_successor_phi [13]
                7.30    0.00 24929375/25079015     bitmap_operation [15]
                0.01    0.27   11855/27234       propagate_block [34]
                0.01    0.00   24210/72751       bitmap_copy [262]
                0.00    0.00   52694/257366      bitmap_clear [283]
                0.00    0.00   11855/27239       free_propagate_block_info [389]
                0.00    0.00   11855/11855       bitmap_equal_p [525]
                0.00    0.00   20981/488036      bitmap_set_bit [167]
                0.00    0.00   10261/124375      bitmap_initialize [407]
-----------------------------------------------
               24.92    0.00   15025/15025       compute_flow_dominators [6]
[12]    12.8   24.92    0.00   15025         sbitmap_intersection_of_preds [12]
                0.00    0.00   15025/55629       sbitmap_copy [470]
-----------------------------------------------
               18.60    0.00   20981/20981       calculate_global_regs_live [11]
[13]     9.6   18.60    0.00   20981         for_each_successor_phi [13]
-----------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-01 10:54 Brad Lucier
@ 2000-05-01 15:29 ` Richard Henderson
  2000-05-01 18:01   ` Brad Lucier
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Henderson @ 2000-05-01 15:29 UTC (permalink / raw)
  To: Brad Lucier; +Cc: gcc

On Mon, May 01, 2000 at 12:54:35PM -0500, Brad Lucier wrote:
> Perhaps with the calls to verify_flow_info, this is a temporary effect,
> I don't know.

Oops.  That call was supposed to be ifdef ENABLE_CHECKING.
With that fixed, are compilation times mostly sane?



r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* if-conversion a performance bottleneck
@ 2000-05-01 10:54 Brad Lucier
  2000-05-01 15:29 ` Richard Henderson
  0 siblings, 1 reply; 24+ messages in thread
From: Brad Lucier @ 2000-05-01 10:54 UTC (permalink / raw)
  To: gcc; +Cc: lucier

The patch

http://gcc.gnu.org/ml/gcc-patches/2000-04/msg00248.html

killed one of the biggest time hogs in gcc, but now Richard
Henderson's if-conversion pass has affected performance adversely.
Perhaps with the calls to verify_flow_info, this is a temporary effect,
I don't know.

On this file

http://www.math.purdue.edu/~lucier/_t-c-2.i.gz

called with these options

/export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -fno-math-errno -mieee -g -mcpu=ev6 -O1 _t-c-2.i > & cc1.out &

with this compiler

popov-41% gcc -v
Reading specs from /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/specs
gcc version 2.96 20000430 (experimental)

I get the following performance on a 500 MHz alpha 21264:

 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 47533k -> 10293k} {GC 15386k -> 10788k} {GC 14832k -> 11255k} ___init_proc {GC 24248k -> 2107k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.75 ( 0%) usr   0.00 ( 0%) sys   0.75 ( 0%) wall
 parser                :   6.55 ( 0%) usr   0.18 (15%) sys   6.73 ( 0%) wall
 varconst              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   6.81 ( 0%) usr   0.40 (32%) sys   7.21 ( 0%) wall
 CSE                   :   2.29 ( 0%) usr   0.00 ( 0%) sys   2.29 ( 0%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  61.00 ( 2%) usr   0.07 ( 6%) sys  61.05 ( 2%) wall
 combiner              :   2.82 ( 0%) usr   0.00 ( 0%) sys   2.82 ( 0%) wall
 if-conversion         :2306.07 (64%) usr   0.44 (35%) sys2306.74 (64%) wall
 local alloc           :   1.31 ( 0%) usr   0.00 ( 0%) sys   1.31 ( 0%) wall
 global alloc          :   2.42 ( 0%) usr   0.04 ( 4%) sys   2.46 ( 0%) wall
 reload CSE regs       :   3.62 ( 0%) usr   0.00 ( 0%) sys   3.62 ( 0%) wall
 flow 2                :  30.20 ( 1%) usr   0.00 ( 0%) sys  30.20 ( 1%) wall
 if-conversion 2       :1196.27 (33%) usr   0.09 ( 7%) sys1196.01 (33%) wall
 shorten branches      :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall
 final                 :   2.69 ( 0%) usr   0.00 ( 1%) sys   2.70 ( 0%) wall
 symout                :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall
 rest of compilation   :   1.06 ( 0%) usr   0.00 ( 0%) sys   1.06 ( 0%) wall
 TOTAL                 :3624.12             1.27          3625.23

Some of the details are:

Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 49.50    192.40   192.40        9 21378.26 21381.98  verify_flow_info
 28.10    301.64   109.24    40604     2.69     2.69  sbitmap_intersection_of_succs
  6.75    327.90    26.25    15025     1.75     1.75  sbitmap_intersection_of_preds
  5.38    348.79    20.89    20981     1.00     1.00  for_each_successor_phi
  1.87    356.06     7.27 25079015     0.00     0.00  bitmap_operation
...
-----------------------------------------------
                0.00  288.44       9/9           rest_of_compilation [5]
[6]     74.2    0.00  288.44       9         if_convert [6]
              192.40    0.03       9/9           verify_flow_info [7]
                2.17   90.46       6/9           compute_flow_dominators [8]
                0.00    3.27       1/10          update_life_info [11]
                0.00    0.03       9/37          free_basic_block_vars [99]
                0.03    0.00       9/18          compute_bb_for_insn [177]
                0.00    0.02   15384/15384       find_if_header [331]
                0.01    0.00       6/21          sbitmap_vector_alloc [222]
                0.00    0.00       1/25          allocate_reg_info [536]
                0.00    0.00       2/64864       max_reg_num [371]
                0.00    0.00       9/5201        get_max_uid [1093]
                0.00    0.00       1/974         sbitmap_alloc [1124]
                0.00    0.00       1/31743       sbitmap_zero [1070]
                0.00    0.00       1/1           count_or_remove_death_notes [1414]
-----------------------------------------------
              192.40    0.03       9/9           if_convert [6]
[7]     49.5  192.40    0.03       9         verify_flow_info [7]
                0.00    0.03   15008/26961       returnjump_p [183]
                0.00    0.00       3/87970       condjump_p [388]
                0.00    0.00       9/5201        get_max_uid [1093]
                0.00    0.00       9/231         get_insns [1157]
-----------------------------------------------
                1.09   45.23       3/9           flow_loops_find [10]
                2.17   90.46       6/9           if_convert [6]
[8]     35.7    3.26  135.69       9         compute_flow_dominators [8]
              109.24    0.01   40604/40604       sbitmap_intersection_of_succs [9]
               26.25    0.00   15025/15025       sbitmap_intersection_of_preds [13]
                0.17    0.00   55638/55638       sbitmap_a_and_b [88]
                0.02    0.00       9/21          sbitmap_vector_alloc [222]
                0.00    0.00       9/9           sbitmap_vector_ones [688]
                0.00    0.00       9/12          sbitmap_vector_zero [788]
                0.00    0.00       9/31743       sbitmap_zero [1070]
-----------------------------------------------
              109.24    0.01   40604/40604       compute_flow_dominators [8]
[9]     28.1  109.24    0.01   40604         sbitmap_intersection_of_succs [9]
                0.01    0.00   40604/55629       sbitmap_copy [396]
-----------------------------------------------


Brad Lucier

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2000-05-10 14:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-04 22:32 if-conversion a performance bottleneck Mike Stump
2000-05-04 22:35 ` Richard Henderson
2000-05-05  6:12   ` Brad Lucier
2000-05-05 10:37     ` Richard Henderson
2000-05-05 13:35     ` Gerald Pfeifer
  -- strict thread matches above, loose matches on Subject: below --
2000-05-05 13:46 Brad Lucier
2000-05-04 23:28 Mathias Froehlich
2000-05-01 10:54 Brad Lucier
2000-05-01 15:29 ` Richard Henderson
2000-05-01 18:01   ` Brad Lucier
2000-05-02  5:45     ` Michael Matz
2000-05-02 22:58       ` Michael Matz
2000-05-03  5:50         ` Brad Lucier
2000-05-03 20:05         ` Brad Lucier
2000-05-04 11:46           ` Michael Matz
2000-05-04 19:39             ` Brad Lucier
2000-05-04 22:16               ` Richard Henderson
2000-05-10  8:30                 ` Andreas Schwab
2000-05-10  9:36                   ` Joe Buck
2000-05-10  9:52                     ` Jeffrey A Law
2000-05-10 10:49                       ` Joe Buck
2000-05-10 14:17                     ` Joern Rennecke
2000-05-04 11:55         ` Richard Henderson
2000-05-05 12:01         ` Jeffrey A Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).