Faster compilation speed

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Faster compilation speed
@ 2002-08-09 12:17 Mike Stump
  2002-08-09 13:04 ` Noel Yap
                   ` (6 more replies)
  0 siblings, 7 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-09 12:17 UTC (permalink / raw)
  To: gcc

I'd like to introduce lots of various changes to improve compiler 
speed.  I thought I should send out an email and see if others think 
this would be good to have in the tree.  Also, if it is, I'd like to 
solicit any ideas others have for me to pursue.  I'd be happy to do all 
the hard work, if you come up with the ideas!  The target is to be 6x 
faster.

The first realization I came to is that the only existing control for 
such things is -O[123], and having thought about it, I think it would 
be best to retain and use those flags.  For minimal user impact, I 
think it would be good to not perturb existing users of -O[0123] too 
much, or at leaast, not at first.  If we wanted to change them, I think 
-O0 should be the `fast' version, -O1 should be what -O0 does now with 
some additions around the edges, and -O2 and -O3 also slide over (at 
least one).  What do you think, slide them all over one or more, or 
just make -O0 do less, or...?  Maybe we have a -O0.0 to mean compile 
very quickly?

Another question would be how many knobs should we have?  At first, I 
am inclined to say just one.  If we want, we can later break them out 
into more choices.  I am mainly interested in a single knob at this 
point.

Another question is, what should the lower limit be on uglifying code 
for the sake of compilation speed.

Below are some concrete ideas so others can get a feel for the types of 
changes, and to comment on the flag and how it is used.
While I give a specific example, I'm more interested in the upper level 
comments, than discussion of not combining temp slots.

The use of a macro preprocessor symbol allows us to replace it with 0 
or 1, should we want to obtain a compiler that is unconditionally 
faster, or one that doesn't have any extra code in it.

This change yields a 0.9% speed improvement when compiling expr.c.  Not 
much, but if the compiler were 6x faster, this would be 5.5% change in 
compilation speed.  The resulting code is worse, but not by much.

So, let the discussion begin...


Doing diffs in flags.h.~1~:
*** flags.h.~1~ Fri Aug  9 10:17:36 2002
--- flags.h     Fri Aug  9 10:37:58 2002
*************** extern int flag_signaling_nans;
*** 696,699 ****
--- 696,705 ----
  #define HONOR_SIGN_DEPENDENT_ROUNDING(MODE) \
    (MODE_HAS_SIGN_DEPENDENT_ROUNDING (MODE) && 
!flag_unsafe_math_optimizations)

+ /* Nonzero for compiling as fast as we can.  */
+
+ extern int flag_speed_compile;
+
+ #define SPEEDCOMPILE flag_speed_compile
+
  #endif /* ! GCC_FLAGS_H */
--------------
Doing diffs in function.c.~1~:
*** function.c.~1~      Fri Aug  9 10:17:36 2002
--- function.c  Fri Aug  9 10:37:58 2002
*************** free_temp_slots ()
*** 1198,1203 ****
--- 1198,1206 ----
  {
    struct temp_slot *p;

+   if (SPEEDCOMPILE)
+     return;
+
    for (p = temp_slots; p; p = p->next)
      if (p->in_use && p->level == temp_slot_level && ! p->keep
        && p->rtl_expr == 0)
*************** free_temps_for_rtl_expr (t)
*** 1214,1219 ****
--- 1217,1225 ----
  {
    struct temp_slot *p;

+   if (SPEEDCOMPILE)
+     return;
+
    for (p = temp_slots; p; p = p->next)
      if (p->rtl_expr == t)
        {
*************** pop_temp_slots ()
*** 1301,1311 ****
  {
    struct temp_slot *p;

!   for (p = temp_slots; p; p = p->next)
!     if (p->in_use && p->level == temp_slot_level && p->rtl_expr == 0)
!       p->in_use = 0;

!   combine_temp_slots ();

    temp_slot_level--;
  }
--- 1307,1320 ----
  {
    struct temp_slot *p;

!   if (! SPEEDCOMPILE)
!     {
!       for (p = temp_slots; p; p = p->next)
!       if (p->in_use && p->level == temp_slot_level && p->rtl_expr == 
0)
!         p->in_use = 0;

!       combine_temp_slots ();
!     }

    temp_slot_level--;
  }
--------------
Doing diffs in toplev.c.~1~:
*** toplev.c.~1~        Fri Aug  9 10:17:40 2002
--- toplev.c    Fri Aug  9 11:31:50 2002
*************** int flag_new_regalloc = 0;
*** 894,899 ****
--- 894,903 ----

  int flag_tracer = 0;

+ /* If nonzero, speed-up the compile as fast as we can.  */
+
+ int flag_speed_compile = 0;
+
  /* Values of the -falign-* flags: how much to align labels in code.
     0 means `use default', 1 means `don't align'.
     For each variable, there is an _log variant which is the power
*************** display_help ()
*** 3679,3684 ****
--- 3683,3689 ----

    printf (_("  -O[number]              Set optimization level to 
[number]\n"));
    printf (_("  -Os                     Optimize for space rather than 
speed\n"));
+   printf (_("  -Of                     Compile as fast as 
possible\n"));
    for (i = LAST_PARAM; i--;)
      {
        const char *description = compiler_params[i].help;
*************** parse_options_and_default_flags (argc, a
*** 4772,4777 ****
--- 4777,4786 ----
              /* Optimizing for size forces optimize to be 2.  */
              optimize = 2;
            }
+         else if ((p[0] == 'f') && (p[1] == 0))
+           {
+             flag_speed_compile = 1;
+           }
          else
            {
              const int optimize_val = read_integral_parameter (p, p - 
2, -1);
--------------


^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 12:17 Faster compilation speed Mike Stump
@ 2002-08-09 13:04 ` Noel Yap
  2002-08-09 13:10   ` Matt Austern
                     ` (3 more replies)
  2002-08-09 13:10 ` Aldy Hernandez
                   ` (5 subsequent siblings)
  6 siblings, 4 replies; 256+ messages in thread
From: Noel Yap @ 2002-08-09 13:04 UTC (permalink / raw)
  To: Mike Stump, gcc

Build speeds are most helped by minimizing the number
of files opened and closed during the build.  I think
a good start would be to have preprocessed header
files.  My idea would be to add options to cpp that
would have it produce preprocessed files.  Doing so
would allow it to be easily integrated into a build
system like "make".

At first, I think all that's really needed is a cpp
option, say --preprocess-includes, that just goes
through and preprocesses the #include directives (eg
it doesn't preprocess #define's, #if's, ...).

Conceivably, this would also require some other
option, possibly --preprocessed-header-file-path, so
that it can recognize when to use existing
preprocessed header files.

MTC,
Noel
--- Mike Stump <mrs@apple.com> wrote:
> I'd like to introduce lots of various changes to
> improve compiler 
> speed.  I thought I should send out an email and see
> if others think 
> this would be good to have in the tree.  Also, if it
> is, I'd like to 
> solicit any ideas others have for me to pursue.  I'd
> be happy to do all 
> the hard work, if you come up with the ideas!  The
> target is to be 6x 
> faster.
> 
> The first realization I came to is that the only
> existing control for 
> such things is -O[123], and having thought about it,
> I think it would 
> be best to retain and use those flags.  For minimal
> user impact, I 
> think it would be good to not perturb existing users
> of -O[0123] too 
> much, or at leaast, not at first.  If we wanted to
> change them, I think 
> -O0 should be the `fast' version, -O1 should be what
> -O0 does now with 
> some additions around the edges, and -O2 and -O3
> also slide over (at 
> least one).  What do you think, slide them all over
> one or more, or 
> just make -O0 do less, or...?  Maybe we have a -O0.0
> to mean compile 
> very quickly?
> 
> Another question would be how many knobs should we
> have?  At first, I 
> am inclined to say just one.  If we want, we can
> later break them out 
> into more choices.  I am mainly interested in a
> single knob at this 
> point.
> 
> Another question is, what should the lower limit be
> on uglifying code 
> for the sake of compilation speed.
> 
> Below are some concrete ideas so others can get a
> feel for the types of 
> changes, and to comment on the flag and how it is
> used.
> While I give a specific example, I'm more interested
> in the upper level 
> comments, than discussion of not combining temp
> slots.
> 
> The use of a macro preprocessor symbol allows us to
> replace it with 0 
> or 1, should we want to obtain a compiler that is
> unconditionally 
> faster, or one that doesn't have any extra code in
> it.
> 
> This change yields a 0.9% speed improvement when
> compiling expr.c.  Not 
> much, but if the compiler were 6x faster, this would
> be 5.5% change in 
> compilation speed.  The resulting code is worse, but
> not by much.
> 
> So, let the discussion begin...
> 
> 
> Doing diffs in flags.h.~1~:
> *** flags.h.~1~ Fri Aug  9 10:17:36 2002
> --- flags.h     Fri Aug  9 10:37:58 2002
> *************** extern int flag_signaling_nans;
> *** 696,699 ****
> --- 696,705 ----
>    #define HONOR_SIGN_DEPENDENT_ROUNDING(MODE) \
>      (MODE_HAS_SIGN_DEPENDENT_ROUNDING (MODE) && 
> !flag_unsafe_math_optimizations)
> 
> + /* Nonzero for compiling as fast as we can.  */
> +
> + extern int flag_speed_compile;
> +
> + #define SPEEDCOMPILE flag_speed_compile
> +
>    #endif /* ! GCC_FLAGS_H */
> --------------
> Doing diffs in function.c.~1~:
> *** function.c.~1~      Fri Aug  9 10:17:36 2002
> --- function.c  Fri Aug  9 10:37:58 2002
> *************** free_temp_slots ()
> *** 1198,1203 ****
> --- 1198,1206 ----
>    {
>      struct temp_slot *p;
> 
> +   if (SPEEDCOMPILE)
> +     return;
> +
>      for (p = temp_slots; p; p = p->next)
>        if (p->in_use && p->level == temp_slot_level
> && ! p->keep
>          && p->rtl_expr == 0)
> *************** free_temps_for_rtl_expr (t)
> *** 1214,1219 ****
> --- 1217,1225 ----
>    {
>      struct temp_slot *p;
> 
> +   if (SPEEDCOMPILE)
> +     return;
> +
>      for (p = temp_slots; p; p = p->next)
>        if (p->rtl_expr == t)
>          {
> *************** pop_temp_slots ()
> *** 1301,1311 ****
>    {
>      struct temp_slot *p;
> 
> !   for (p = temp_slots; p; p = p->next)
> !     if (p->in_use && p->level == temp_slot_level
> && p->rtl_expr == 0)
> !       p->in_use = 0;
> 
> !   combine_temp_slots ();
> 
>      temp_slot_level--;
>    }
> --- 1307,1320 ----
>    {
>      struct temp_slot *p;
> 
> !   if (! SPEEDCOMPILE)
> !     {
> !       for (p = temp_slots; p; p = p->next)
> !       if (p->in_use && p->level == temp_slot_level
> && p->rtl_expr == 
> 0)
> !         p->in_use = 0;
> 
> !       combine_temp_slots ();
> !     }
> 
>      temp_slot_level--;
>    }
> --------------
> Doing diffs in toplev.c.~1~:
> *** toplev.c.~1~        Fri Aug  9 10:17:40 2002
> --- toplev.c    Fri Aug  9 11:31:50 2002
> *************** int flag_new_regalloc = 0;
> *** 894,899 ****
> --- 894,903 ----
> 
>    int flag_tracer = 0;
> 
> + /* If nonzero, speed-up the compile as fast as we
> can.  */
> +
> + int flag_speed_compile = 0;
> +
>    /* Values of the -falign-* flags: how much to
> align labels in code.
>       0 means `use default', 1 means `don't align'.
>       For each variable, there is an _log variant
> which is the power
> *************** display_help ()
> *** 3679,3684 ****
> --- 3683,3689 ----
> 
>      printf (_("  -O[number]              Set
> optimization level to 
> [number]\n"));
>      printf (_("  -Os                     Optimize
> for space rather than 
> speed\n"));
> +   printf (_("  -Of                     Compile as
> fast as 
> possible\n"));
>      for (i = LAST_PARAM; i--;)
>        {
>          const char *description =
> compiler_params[i].help;
> *************** parse_options_and_default_flags
> (argc, a
> *** 4772,4777 ****
> --- 4777,4786 ----
>                /* Optimizing for size forces
> optimize to be 2.  */
>                optimize = 2;
>              }
> +         else if ((p[0] == 'f') && (p[1] == 0))
> +           {
> +             flag_speed_compile = 1;
> +           }
>            else
>              {
>                const int optimize_val =
> read_integral_parameter (p, p - 
> 2, -1);
> --------------
> 


__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 12:17 Faster compilation speed Mike Stump
  2002-08-09 13:04 ` Noel Yap
@ 2002-08-09 13:10 ` Aldy Hernandez
  2002-08-09 15:28   ` Mike Stump
  2002-08-09 14:29 ` Neil Booth
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 256+ messages in thread
From: Aldy Hernandez @ 2002-08-09 13:10 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

>>>>> "Mike" == Mike Stump <mrs@apple.com> writes:

 > + /* Nonzero for compiling as fast as we can.  */
 > +
 > + extern int flag_speed_compile;
 > +
 > + #define SPEEDCOMPILE flag_speed_compile

So, you want to introduce a flag to do faster compilation?  Why not
spend your time making the current infrastructure faster?

Aldy

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 13:04 ` Noel Yap
@ 2002-08-09 13:10   ` Matt Austern
  2002-08-09 14:22   ` Neil Booth
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 256+ messages in thread
From: Matt Austern @ 2002-08-09 13:10 UTC (permalink / raw)
  To: Noel Yap; +Cc: Mike Stump, gcc

On Friday, August 9, 2002, at 01:04 PM, Noel Yap wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 13:04 ` Noel Yap
  2002-08-09 13:10   ` Matt Austern
@ 2002-08-09 14:22   ` Neil Booth
  2002-08-09 14:44     ` Noel Yap
  2002-08-09 15:13   ` Stan Shebs
  2002-08-09 18:57   ` Linus Torvalds
  3 siblings, 1 reply; 256+ messages in thread
From: Neil Booth @ 2002-08-09 14:22 UTC (permalink / raw)
  To: Noel Yap; +Cc: Mike Stump, gcc

Noel Yap wrote:-

> At first, I think all that's really needed is a cpp
> option, say --preprocess-includes, that just goes
> through and preprocesses the #include directives (eg
> it doesn't preprocess #define's, #if's, ...).

Heh, if only life were this easy.  If you actually think about what CPP
does, you'd realize this is a no-go.  Two immediate issues:

1) #include can take a macro as argument
2) #include can appear in preprocessor conditional blocks.  You
   only know whether they are processed if you know the correct value
   of the #if.  This often depends on macro expansions, and correct
   processing of prior includes.  Of course, #defines appear in
   conditional blocks too, so this is kind of important to get right.

There are no easy shortcuts here: to preprocess something properly,
you have to do *everything* the preprocessor does "normally".  There
are no shortcuts, not even trivial ones.

We *do* do too many stats and opens though; when I get time I'll post
my ideas about this.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 12:17 Faster compilation speed Mike Stump
  2002-08-09 13:04 ` Noel Yap
  2002-08-09 13:10 ` Aldy Hernandez
@ 2002-08-09 14:29 ` Neil Booth
  2002-08-09 15:02   ` Nathan Sidwell
  2002-08-12 12:11   ` Mike Stump
  2002-08-09 14:51 ` Stan Shebs
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 256+ messages in thread
From: Neil Booth @ 2002-08-09 14:29 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

Mike Stump wrote:-

> I'd like to introduce lots of various changes to improve compiler 
> speed.

Just my opinion, Mike, but I think a lot of current slowness is due to
redo-ing too many things, and not taking advantage of ordering or whatever
technique so that conclusions deduced from internal representations are
made in a logical, efficent way.  (e.g. I think we try to constant fold
things that we've already tried to constant fold and failed, repeatedly,
and we don't do the constant folding we do do in an optimal way.  I could
be wrong, though; I've not looked in detail).  I cannot explain this
clearly, or with any specific example, but IMO we work far too hard to
do what we do.  I'd like to see this cleaned up instead.

For example, see some of Mark's recent patches.  I think we could continue
doing that for ages.  I also believe that using Bison (and our
ill-considered extensions like attributes pretty much anywhere) don't
help efficiency.  We could probably do better in the C front end with
a tree representation that is closer to C than the current
multi-language form of trees.

What worries me about PCH and similar schemes is it's too easy to fix
the symptoms, rather than the real reasons for the slowness.  As a
result, such things might never be fixed.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 14:22   ` Neil Booth
@ 2002-08-09 14:44     ` Noel Yap
  2002-08-09 15:14       ` Neil Booth
  0 siblings, 1 reply; 256+ messages in thread
From: Noel Yap @ 2002-08-09 14:44 UTC (permalink / raw)
  To: Neil Booth; +Cc: Mike Stump, gcc

--- Neil Booth <neil@daikokuya.co.uk> wrote:
> Heh, if only life were this easy.  If you actually
> think about what CPP
> does, you'd realize this is a no-go.  Two immediate
> issues:
> 
> 1) #include can take a macro as argument

Yes, what I suggest certainly won't work for this
situation.

OTOH, how many times is this really used?  Would it be
such a sin to say that one cannot do the preprocessing
I suggested if one has macros for #include arguments?

> 2) #include can appear in preprocessor conditional
> blocks.  You
>    only know whether they are processed if you know
> the correct value
>    of the #if.  This often depends on macro
> expansions, and correct
>    processing of prior includes.  Of course,
> #defines appear in
>    conditional blocks too, so this is kind of
> important to get right.

I don't see this as too big a problem.  Just output a
file like:
#if COND
/* contents of header file
#endif

In fact, doing it this way has the advantage that
several builds, not necessarily agreeing on the value
of COND, can use the file.

> There are no easy shortcuts here: to preprocess
> something properly,
> you have to do *everything* the preprocessor does
> "normally".  There
> are no shortcuts, not even trivial ones.

I think one needn't preprocess everything perfectly in
order to gain significant advantages.  Would you say
that what I suggest is better than what we have now?

If an ideal solution is being worked on, I'd opt for
that.  OTOH, I think this solution has been in the
works for at least a couple of years now.  I think the
--preprocess-includes option should be very simple to
do.

> We *do* do too many stats and opens though; when I
> get time I'll post
> my ideas about this.

I'm sure my ideas are far from ideal so I'm looking
forward to yours.

Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 12:17 Faster compilation speed Mike Stump
                   ` (2 preceding siblings ...)
  2002-08-09 14:29 ` Neil Booth
@ 2002-08-09 14:51 ` Stan Shebs
  2002-08-09 15:03   ` David Edelsohn
  2002-08-09 15:26   ` Geoff Keating
  2002-08-09 14:59 ` Timothy J. Wood
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-09 14:51 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

Mike Stump wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 12:17 Faster compilation speed Mike Stump
                   ` (3 preceding siblings ...)
  2002-08-09 14:51 ` Stan Shebs
@ 2002-08-09 14:59 ` Timothy J. Wood
  2002-08-16 13:31   ` Problem with PFE approach [Was: Faster compilation speed] Timothy J. Wood
  2002-08-09 16:01 ` Faster compilation speed Richard Henderson
  2002-08-10 17:48 ` Aaron Lehmann
  6 siblings, 1 reply; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-09 14:59 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

On Friday, August 9, 2002, at 12:17  PM, Mike Stump wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 14:29 ` Neil Booth
@ 2002-08-09 15:02   ` Nathan Sidwell
  2002-08-09 17:05     ` Stan Shebs
  2002-08-10  2:21     ` Gabriel Dos Reis
  2002-08-12 12:11   ` Mike Stump
  1 sibling, 2 replies; 256+ messages in thread
From: Nathan Sidwell @ 2002-08-09 15:02 UTC (permalink / raw)
  To: Neil Booth; +Cc: Mike Stump, gcc

Neil Booth wrote:

> Just my opinion, Mike, but I think a lot of current slowness is due to
> redo-ing too many things, and not taking advantage of ordering or whatever
> technique so that conclusions deduced from internal representations are
> made in a logical, efficent way.  (e.g. I think we try to constant fold
> things that we've already tried to constant fold and failed, repeatedly,
> and we don't do the constant folding we do do in an optimal way.  I could
> be wrong, though; I've not looked in detail).  I cannot explain this
Yup, redoing things seems to happen a lot in the c++ front end.
The type conversion machinery seems to work a lot like
	if (complicated fn to try conversion 1)
	  complicated fn to do conversion 1
	else if (complicated fn to try conversion 2)
	  complicated fn to do conversion 2
	...
unifying static_cast, (cast), const_cast, implicit_conversion, overload
arg resolution might be a win.

I think you might be right about fold-const. That's recursive itself,
so we should only need to call that when we really need to flatten
a const, rather than after every new operation.

As you'll have noticed I'm tweaking the coverage machinery to
try and find hotspots and deadspots. My immediate plan for this
is to
a) fix .da files so they don't grow indefinitly large - nearly done
b) add some kind of __builtin_unexpected (), to mark expected
dead code
c) write some perl scripts to munge the gcov output

I hope some of that is useful to others.

nathan

-- 
Dr Nathan Sidwell   ::   http://www.codesourcery.com   ::   CodeSourcery LLC
         'But that's a lie.' - 'Yes it is. What's your point?'
nathan@codesourcery.com : http://www.cs.bris.ac.uk/~nathan/ : nathan@acm.org

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 14:51 ` Stan Shebs
@ 2002-08-09 15:03   ` David Edelsohn
  2002-08-09 15:43     ` Stan Shebs
  2002-08-09 16:43     ` Alan Lehotsky
  2002-08-09 15:26   ` Geoff Keating
  1 sibling, 2 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-09 15:03 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Mike Stump, gcc

>>>>> Stan Shebs writes:

Stan> I think it suffices to have -O0 mean "go as fast as possible".  From time to
Stan> time, I've noticed that there's been a temptation to try to sneak in a 
Stan> little
Stan> optimization even at -O0, presumably with the assumption that the time
Stan> penalty was negligible.  (There are users who complain that -O0 should
Stan> do some amount of optimization, but IMHO we should ignore them.)

	Saying "do not run any optimization at -O0" shows a tremendous
lack of understanding or investigation.  One wants minimal optimization
even at -O0 to decrease the size of the IL representation of the function
being compiled.  The little bit of computation to perform trivial
optimization more than makes up for itself with the decreased size of the
IL that needs to be processed to generate the output.

	One needs to be careful about which optimizations are run, but
with the right choices it definitely is a net win.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 13:04 ` Noel Yap
  2002-08-09 13:10   ` Matt Austern
  2002-08-09 14:22   ` Neil Booth
@ 2002-08-09 15:13   ` Stan Shebs
  2002-08-09 15:18     ` Neil Booth
                       ` (2 more replies)
  2002-08-09 18:57   ` Linus Torvalds
  3 siblings, 3 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-09 15:13 UTC (permalink / raw)
  To: Noel Yap; +Cc: Mike Stump, gcc

Noel Yap wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 14:44     ` Noel Yap
@ 2002-08-09 15:14       ` Neil Booth
  2002-08-10 15:54         ` Noel Yap
  0 siblings, 1 reply; 256+ messages in thread
From: Neil Booth @ 2002-08-09 15:14 UTC (permalink / raw)
  To: Noel Yap; +Cc: Mike Stump, gcc

Noel Yap wrote:-

> I don't see this as too big a problem.  Just output a
> file like:
> #if COND
> /* contents of header file
> #endif
> 
> In fact, doing it this way has the advantage that
> several builds, not necessarily agreeing on the value
> of COND, can use the file.

Hmm, and what about header guards?  Infinite recursion?

> I think one needn't preprocess everything perfectly in
> order to gain significant advantages.  Would you say
> that what I suggest is better than what we have now?

Correctness is paramount; if it's not correct it's no
good.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:13   ` Stan Shebs
@ 2002-08-09 15:18     ` Neil Booth
  2002-08-10 16:12       ` Noel Yap
  2002-08-09 15:19     ` Ziemowit Laski
  2002-08-10 16:07     ` Noel Yap
  2 siblings, 1 reply; 256+ messages in thread
From: Neil Booth @ 2002-08-09 15:18 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Noel Yap, Mike Stump, gcc

Stan Shebs wrote:-

> Is this assertion based on empirical measurement, and if so, for what
> source code and what system?  For instance, the longest source file
> in GCC is about 15K lines, and at -O2, only a small percentage of
> time is spent messing with files.  If I use -save-temps on cp/decl.c on
> one of my (Linux) machines, I get a total time of about 38 sec from
> source to asm.  If I just compile decl.i, it's about 37 sec, so that's
> 1 sec for *all* preprocessing, including all file opening/closing.

Yes, it's very rare that preprocessing is more than 2% of -O2 time;
it's often less than 1%.  IMO that says more about the efficiency
of the rest than of CPP.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:13   ` Stan Shebs
  2002-08-09 15:18     ` Neil Booth
@ 2002-08-09 15:19     ` Ziemowit Laski
  2002-08-09 15:25       ` Neil Booth
  2002-08-10 16:16       ` Noel Yap
  2002-08-10 16:07     ` Noel Yap
  2 siblings, 2 replies; 256+ messages in thread
From: Ziemowit Laski @ 2002-08-09 15:19 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Ziemowit Laski, Noel Yap, Mike Stump, gcc

On Friday, August 9, 2002, at 03:12 , Stan Shebs wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:19     ` Ziemowit Laski
@ 2002-08-09 15:25       ` Neil Booth
  2002-08-10 16:16       ` Noel Yap
  1 sibling, 0 replies; 256+ messages in thread
From: Neil Booth @ 2002-08-09 15:25 UTC (permalink / raw)
  To: Ziemowit Laski; +Cc: Stan Shebs, Noel Yap, Mike Stump, gcc

Ziemowit Laski wrote:-

> >Is this assertion based on empirical measurement, and if so, for what
> >source code and what system?  For instance, the longest source file
> >in GCC is about 15K lines, and at -O2, only a small percentage of
> >time is spent messing with files.  If I use -save-temps on cp/decl.c on
> >one of my (Linux) machines, I get a total time of about 38 sec from
> >source to asm.  If I just compile decl.i, it's about 37 sec, so that's
> >1 sec for *all* preprocessing, including all file opening/closing.
> 
> Since the preprocessor is integrated, I don't think you can separate
> the timings in this way. :(  A 'gcc3 -E cp/decl.c -o decl.i' would
> probably be more meaningful.

It is separated with the timing stuff.

Your test is not good: it tests time to output.  It is well-known
that current CPP output is quite slow; on Linux this is largely a
Glibc problem.  CPP output can be 50% of preprocessing time, which
when you think about it is quite illogical.  However, it can be
made much faster, and I will do this eventually.

Since we use an integrated CPP, timing output is kind of irrelevant
(and vastly overstates CPP time).  Current CPP provides tokens to
the parser far, far faster than cccp did via a temporary file and
a duplicated lexer in the front end (not to mention other advantages,
like precise token location information).

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 14:51 ` Stan Shebs
  2002-08-09 15:03   ` David Edelsohn
@ 2002-08-09 15:26   ` Geoff Keating
  2002-08-09 16:06     ` Stan Shebs
  2002-08-12 15:55     ` Mike Stump
  1 sibling, 2 replies; 256+ messages in thread
From: Geoff Keating @ 2002-08-09 15:26 UTC (permalink / raw)
  To: Stan Shebs; +Cc: gcc

Stan Shebs <shebs@apple.com> writes:

> Mike Stump wrote:
> 
> >
> > The first realization I came to is that the only existing control
> > for such things is -O[123], and having thought about it, I think it
> > would be best to retain and use those flags.  For minimal user
> > impact, I think it would be good to not perturb existing users of
> > -O[0123] too much, or at leaast, not at first.  If we wanted to
> > change them, I think -O0 should be the `fast' version, -O1 should be
> > what -O0 does now with some additions around the edges, and -O2 and
> > -O3 also slide over (at least one).  What do you think, slide them
> > all over one or more, or just make -O0 do less, or...?  Maybe we
> > have a -O0.0 to mean compile very quickly?
> 
> I think it suffices to have -O0 mean "go as fast as possible".

Note that that's different to what it means now, which is "I want the
debugger to not surprise me."

-- 
- Geoffrey Keating <geoffk@geoffk.org> <geoffk@redhat.com>

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 13:10 ` Aldy Hernandez
@ 2002-08-09 15:28   ` Mike Stump
  2002-08-09 16:00     ` Aldy Hernandez
  2002-08-09 19:07     ` David Edelsohn
  0 siblings, 2 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-09 15:28 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: gcc

On Friday, August 9, 2002, at 01:15 PM, Aldy Hernandez wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:03   ` David Edelsohn
@ 2002-08-09 15:43     ` Stan Shebs
  2002-08-09 16:43     ` Alan Lehotsky
  1 sibling, 0 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-09 15:43 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Mike Stump, gcc

David Edelsohn wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:28   ` Mike Stump
@ 2002-08-09 16:00     ` Aldy Hernandez
  2002-08-09 16:26       ` Stan Shebs
  2002-08-12 16:05       ` Mike Stump
  2002-08-09 19:07     ` David Edelsohn
  1 sibling, 2 replies; 256+ messages in thread
From: Aldy Hernandez @ 2002-08-09 16:00 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

> Let's take my combine elision patch.  This patch makes the compiler 
> generate worse code.  The way in which it is worse, is that more stack 
> space is used.  How much more, well, my initial guess is that it is 
> less than 10% worse.  Not too bad.  Maybe users would care, maybe they 

I assume you have already looked at the horrendity of the code
presently generated by -O0.  It's pretty unusable as it is.  Who would
really want to use gcc under the influence of "worse than -O0"?
Really.

> I hope that explains my thinking a little bit more.  Comments?  
> Anything sound wrong?  And unforeseen dangers?

Off the top of my head, if you insist on this approach, at least
guarantee that generated code is no worse to debug.  That is the only
reason *I* use -O0, to debug.

Cheers.
Aldy

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 12:17 Faster compilation speed Mike Stump
                   ` (4 preceding siblings ...)
  2002-08-09 14:59 ` Timothy J. Wood
@ 2002-08-09 16:01 ` Richard Henderson
  2002-08-10 17:48 ` Aaron Lehmann
  6 siblings, 0 replies; 256+ messages in thread
From: Richard Henderson @ 2002-08-09 16:01 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

On Fri, Aug 09, 2002 at 12:17:32PM -0700, Mike Stump wrote:
> Another question is, what should the lower limit be on uglifying code 
> for the sake of compilation speed.

You'll find that really ugly code will compile slower than
code that has been optimized some simply due to the fact
that you emit less assembly, and therefore do less I/O.

As for not re-using temp slots, sure I guess that's something
we can do at -O0.  I don't see a need for the new command-line
switch though.

r~

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:26   ` Geoff Keating
@ 2002-08-09 16:06     ` Stan Shebs
  2002-08-09 16:14       ` Terry Flannery
  2002-08-09 16:29       ` Phil Edwards
  2002-08-12 15:55     ` Mike Stump
  1 sibling, 2 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-09 16:06 UTC (permalink / raw)
  To: Geoff Keating; +Cc: gcc

Geoff Keating wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:06     ` Stan Shebs
@ 2002-08-09 16:14       ` Terry Flannery
  2002-08-09 16:29         ` Neil Booth
  2002-08-09 16:29       ` Phil Edwards
  1 sibling, 1 reply; 256+ messages in thread
From: Terry Flannery @ 2002-08-09 16:14 UTC (permalink / raw)
  To: Stan Shebs, Geoff Keating; +Cc: gcc

IMHO, a new flag should be introduced, for example, -Of for maximum compile
speed, and no surprises when debugging. -O0 should be minimal optimizations,
and -O[s1-3] should remain as they are.
I use the preprocessor to generate a preprocessed version of all the system
header I use, into one header, and #include that in my program's header
(with the flags to dump macros) , saving some time when building. If there
was some support for pre-compiled headers, I'm sure that the compiler would
be much faster.

Terry

----- Original Message -----
From: "Stan Shebs" <shebs@apple.com>
To: "Geoff Keating" <geoffk@geoffk.org>
Cc: <gcc@gcc.gnu.org>
Sent: Saturday, August 10, 2002 12:05 AM
Subject: Re: Faster compilation speed


> Geoff Keating wrote:
>
> >Stan Shebs <shebs@apple.com> writes:
> >
> >>Mike Stump wrote:
> >>
> >>>The first realization I came to is that the only existing control
> >>>for such things is -O[123], and having thought about it, I think it
> >>>would be best to retain and use those flags.  For minimal user
> >>>impact, I think it would be good to not perturb existing users of
> >>>-O[0123] too much, or at leaast, not at first.  If we wanted to
> >>>change them, I think -O0 should be the `fast' version, -O1 should be
> >>>what -O0 does now with some additions around the edges, and -O2 and
> >>>-O3 also slide over (at least one).  What do you think, slide them
> >>>all over one or more, or just make -O0 do less, or...?  Maybe we
> >>>have a -O0.0 to mean compile very quickly?
> >>>
> >>I think it suffices to have -O0 mean "go as fast as possible".
> >>
> >
> >Note that that's different to what it means now, which is "I want the
> >debugger to not surprise me."
> >
> There's been a little bit of a drift over the years - -O0 used to be
> "no opts at all", -O1 was "not too surprising for the debugger", and
> -O2 was all-out.  I remember some pressure from Cygnus customers to
> make -O0 do more optimization, sometimes out of stupidity, but in the
> legitimate cases because the -O0 code was too slow and/or large to
> fit on the target embedded system, even for debugging.
>
> So what *should* we do with -O0 optimizations that measurably
> slow down the compiler?
>
> Stan
>
>
>

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:00     ` Aldy Hernandez
@ 2002-08-09 16:26       ` Stan Shebs
  2002-08-09 16:31         ` Aldy Hernandez
                           ` (2 more replies)
  2002-08-12 16:05       ` Mike Stump
  1 sibling, 3 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-09 16:26 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Mike Stump, gcc

Aldy Hernandez wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:14       ` Terry Flannery
@ 2002-08-09 16:29         ` Neil Booth
  0 siblings, 0 replies; 256+ messages in thread
From: Neil Booth @ 2002-08-09 16:29 UTC (permalink / raw)
  To: Terry Flannery; +Cc: Stan Shebs, Geoff Keating, gcc

Terry Flannery wrote:-

> IMHO, a new flag should be introduced, for example, -Of for maximum compile
> speed, and no surprises when debugging. -O0 should be minimal optimizations,
> and -O[s1-3] should remain as they are.
> I use the preprocessor to generate a preprocessed version of all the system
> header I use, into one header, and #include that in my program's header
> (with the flags to dump macros) , saving some time when building. If there
> was some support for pre-compiled headers, I'm sure that the compiler would
> be much faster.

How much time (%-wise) does it save?

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:06     ` Stan Shebs
  2002-08-09 16:14       ` Terry Flannery
@ 2002-08-09 16:29       ` Phil Edwards
  2002-08-12 16:24         ` Mike Stump
  1 sibling, 1 reply; 256+ messages in thread
From: Phil Edwards @ 2002-08-09 16:29 UTC (permalink / raw)
  To: Stan Shebs; +Cc: gcc

On Fri, Aug 09, 2002 at 04:05:16PM -0700, Stan Shebs wrote:
> So what *should* we do with -O0 optimizations that measurably
> slow down the compiler?

How "minimal" can an optimization be, if it measurably slows down the
compiler?  If it slows things down, let's just move it to -O1/-O2.


Personally, "fastest compile possible" usually just means -fsyntax-only.
I have a hard time wanting to do anything with ad-hoc output.

Phil

-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:26       ` Stan Shebs
@ 2002-08-09 16:31         ` Aldy Hernandez
  2002-08-09 16:51           ` Stan Shebs
  2002-08-09 17:36         ` Daniel Berlin
  2002-08-12 16:23         ` Mike Stump
  2 siblings, 1 reply; 256+ messages in thread
From: Aldy Hernandez @ 2002-08-09 16:31 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Mike Stump, gcc

> OK, then to really rub it in, CW runs much faster than GCC, even on
> that slow Darwin OS :-), and that's with its non-optimizing case being

Hey, no fair.  You know my complaints are strictly in the filesystem
:).

> Sacrificing -O0 optimization is just a desperation move, since
> we don't seem to have many other ideas about how to make GCC as
> fast as CW.

Ah, the truth comes out.  So... Don't you think that if we spent more
time getting the infrastructure faster, -O0 will improve as well?

Either way, I ain't going to vote against a faster -O0.  At least
it speeds up my development cycle, since I program by building cc1,
inspecting assembly, and repeating cycle :).

Aldy

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:03   ` David Edelsohn
  2002-08-09 15:43     ` Stan Shebs
@ 2002-08-09 16:43     ` Alan Lehotsky
  2002-08-09 16:49       ` Matt Austern
  1 sibling, 1 reply; 256+ messages in thread
From: Alan Lehotsky @ 2002-08-09 16:43 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Stan Shebs, Mike Stump, gcc

At 6:03 PM -0400 8/9/02, David Edelsohn wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:43     ` Alan Lehotsky
@ 2002-08-09 16:49       ` Matt Austern
  2002-08-10  2:24         ` Gabriel Dos Reis
  0 siblings, 1 reply; 256+ messages in thread
From: Matt Austern @ 2002-08-09 16:49 UTC (permalink / raw)
  To: Alan Lehotsky; +Cc: David Edelsohn, Stan Shebs, Mike Stump, gcc

On Friday, August 9, 2002, at 04:17 PM, Alan Lehotsky wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:31         ` Aldy Hernandez
@ 2002-08-09 16:51           ` Stan Shebs
  2002-08-09 16:54             ` Aldy Hernandez
                               ` (3 more replies)
  0 siblings, 4 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-09 16:51 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Mike Stump, gcc

Aldy Hernandez wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:51           ` Stan Shebs
@ 2002-08-09 16:54             ` Aldy Hernandez
  2002-08-09 17:44             ` Daniel Berlin
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 256+ messages in thread
From: Aldy Hernandez @ 2002-08-09 16:54 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Mike Stump, gcc

> I don't think Mike mentioned it, but speeding up the compiler has
> become our group's top priority, and every idea is on the table
> right now.  The 6x goal sounds extreme, but it helps keep in mind
> that one or two or even a dozen 5% improvements will not be
> sufficient to attain parity with the competition.

Fair enough.  Game on, and good luck.

And please don't keep your changes in your tree, and then have them
become obsolete in 4 months when you try to merge :)

Aldy

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:02   ` Nathan Sidwell
@ 2002-08-09 17:05     ` Stan Shebs
  2002-08-10  2:21     ` Gabriel Dos Reis
  1 sibling, 0 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-09 17:05 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: Neil Booth, Mike Stump, gcc

Nathan Sidwell wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:26       ` Stan Shebs
  2002-08-09 16:31         ` Aldy Hernandez
@ 2002-08-09 17:36         ` Daniel Berlin
  2002-08-12 16:23         ` Mike Stump
  2 siblings, 0 replies; 256+ messages in thread
From: Daniel Berlin @ 2002-08-09 17:36 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Aldy Hernandez, Mike Stump, gcc

On Fri, 9 Aug 2002, Stan Shebs wrote:

> Aldy Hernandez wrote:
> 
> >>Let's take my combine elision patch.  This patch makes the compiler 
> >>generate worse code.  The way in which it is worse, is that more stack 
> >>space is used.  How much more, well, my initial guess is that it is 
> >>less than 10% worse.  Not too bad.  Maybe users would care, maybe they 
> >>
> >
> >I assume you have already looked at the horrendity of the code
> >presently generated by -O0.  It's pretty unusable as it is.  Who would
> >really want to use gcc under the influence of "worse than -O0"?
> >Really.
> >
> OK, then to really rub it in, CW runs much faster than GCC, even on
> that slow Darwin OS :-), and that's with its non-optimizing case being
> about halfway between GCC's -O0 and -O1, and works well with the
> debugger still.
> 
> Sacrificing -O0 optimization is just a desperation move, since
> we don't seem to have many other ideas about how to make GCC as
> fast as CW.

Look, there are, in reality, two things that make our compiler slower 
than metrowerks, even at -O0

First is parsing.
The bison parser is just not fast. It never will be.
Period.

The second is expansion from tree to RTL.
It's not fast either.  The timings don't always tell the real story. There 
are cases where expansion is occuring when the timevar isn't pushed (IE 
other things that call expand_*, where * = anything but _body, where the 
timevar is pushed).

The solutions to the first is already in progress (give me a clean, 
working hand-written parser, that can compile libstdc++, and i'll happily make it 
go real fast.  I was just starting to when the branch was abandoned.).

Codewarrior, for comparison sake, uses a backtracking recursive descent 
parser for it's C++ compiler.

The second is hard to solve in a way people would like.  The fastest way 
to solve the problem is to do native code generation off the tree at -O0, 
avoiding any optimizations whatsoever.

This is, of course, not easy to do with our current MD files.
We really would need a *burg like tool and associated descriptions.
You could do debugging output without too much difficulty. Most of the 
debug_* functions operate on trees anyway.

PFE solves our first problem as well, but not the second one.  We still 
have to *generate* the code.

But there still have to be better answers than trying to avoid the backend 
entirely.  If our backend is so godawfully bad that we have to start 
skipping entire "normal" phases (IE not  optimizations to speed up code, 
or things that are done in plenty of other compilers at -O0), then we 
really *do* need to rearchitect them, and maybe more.
Not just directed speed ups.
At some point, it becomes easier to redo it from scratch well.
Particularly when nobody today understands why anyone thought it was a 
good idea to do it the way it's done now.

 --Dan

> 
> Stan
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:51           ` Stan Shebs
  2002-08-09 16:54             ` Aldy Hernandez
@ 2002-08-09 17:44             ` Daniel Berlin
  2002-08-09 18:35               ` David S. Miller
  2002-08-09 18:25             ` David S. Miller
  2002-08-10 10:02             ` Neil Booth
  3 siblings, 1 reply; 256+ messages in thread
From: Daniel Berlin @ 2002-08-09 17:44 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Aldy Hernandez, Mike Stump, gcc

On Fri, 9 Aug 2002, Stan Shebs wrote:

> Aldy Hernandez wrote:
> 
> >[...]
> >
> >   So... Don't you think that if we spent more
> >time getting the infrastructure faster, -O0 will improve as well?
> >
> Well sure, it should be part of the plan.
> 
> One of my suspicions is that the massive use of macros in tree
> and RTL is concealing excessive pointer chasing, because they
> don't show up in either profile or coverage numbers

Ding ding, you have another winner.

I actually benched this once, by functionizing some often used macros.

The timings were horrendous.
But what can we do to increase cache locality, or get rid of these 
problems?


> is taking the macros that we function-ized for debugging purposes
> (Ira posted it to gcc-patches some time ago, but nobody wanted it
> because dwarf2 macro debugging was going to be available RSN), and
> will build a (slow) GCC that will do it all through function calls.
> That should yield a much more interesting profile.
> 
> I don't think Mike mentioned it, but speeding up the compiler has
> become our group's top priority, and every idea is on the table
> right now.  The 6x goal sounds extreme, but it helps keep in mind
> that one or two or even a dozen 5% improvements will not be
> sufficient to attain parity with the competition.

I think part of the problem is that the timings gcc itself outputs aren't 
completely accurate, because sometimes we go around the calls that would 
push the timevar.

> Stan
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:51           ` Stan Shebs
  2002-08-09 16:54             ` Aldy Hernandez
  2002-08-09 17:44             ` Daniel Berlin
@ 2002-08-09 18:25             ` David S. Miller
  2002-08-13  0:50               ` Loren James Rittle
  2002-08-10 10:02             ` Neil Booth
  3 siblings, 1 reply; 256+ messages in thread
From: David S. Miller @ 2002-08-09 18:25 UTC (permalink / raw)
  To: gcc

All of these attempts of taking care of "low hanging fruit"
are great.  But these efforts should not make us ignore the
real problems GCC has.

For example, I'm convinced that teaching all the RTL code "how to
count" and thus obviating garbage collection all together, would be
the biggest win ever.  (I'm saying RTL should have reference counts,
if someone didn't catch what I meant)

Someone, I think Stan Shebs, mentioned pointer chasing,
and that's another great area of exploration.

The problem is that most people don't want to, or has the time to, sit
down and do such far reaching changes necessary to fix these toplevel
problems.

This is exactly what makes things such as a "flag_go_fast" option so
appealing.  :-(

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 17:44             ` Daniel Berlin
@ 2002-08-09 18:35               ` David S. Miller
  2002-08-09 18:39                 ` Aldy Hernandez
  0 siblings, 1 reply; 256+ messages in thread
From: David S. Miller @ 2002-08-09 18:35 UTC (permalink / raw)
  To: dberlin; +Cc: shebs, aldyh, mrs, gcc

   From: Daniel Berlin <dberlin@dberlin.org>
   Date: Fri, 9 Aug 2002 20:44:00 -0400 (EDT)

   The timings were horrendous.
   But what can we do to increase cache locality, or get rid of these
   problems?

And TLB locality...  I propose two possible solutions.

1) Reference count these objects properly, and stop being at the
   mercy of the garbage collector.

2) Make RTL/TREE layout less pointer driven.

I read elsewhere today someone saying that garbage collecting is for
people who cannot count, and after trying to beat GCC's GC into
submission for a few weeks I couldn't agree more :-)  And for this
reason if I had the time right now I'd probably tackle #1 first.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 18:35               ` David S. Miller
@ 2002-08-09 18:39                 ` Aldy Hernandez
  2002-08-09 18:59                   ` David S. Miller
  2002-08-09 20:01                   ` Per Bothner
  0 siblings, 2 replies; 256+ messages in thread
From: Aldy Hernandez @ 2002-08-09 18:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: dberlin, shebs, mrs, gcc

> 2) Make RTL/TREE layout less pointer driven.

For the clueless, ahem me, could you go into more detail on this?

Thanks.

Aldy

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 13:04 ` Noel Yap
                     ` (2 preceding siblings ...)
  2002-08-09 15:13   ` Stan Shebs
@ 2002-08-09 18:57   ` Linus Torvalds
  2002-08-09 19:12     ` Phil Edwards
                       ` (2 more replies)
  3 siblings, 3 replies; 256+ messages in thread
From: Linus Torvalds @ 2002-08-09 18:57 UTC (permalink / raw)
  To: yap_noel, gcc

In article < 20020809200413.46719.qmail@web21403.mail.yahoo.com > you write:
>Build speeds are most helped by minimizing the number
>of files opened and closed during the build.

I _seriously_ doubt that.

Opening (and even reading) a cached file is not an expensive operation,
not compared to the kinds of run-times gcc has.  We're talking a few
microseconds per file open at a low level.  Even parsing it should not
be that expensive, especially if the preprocessor is any good (and from
all I've seen, these days it _is_ good).

I strongly suspect that what makes gcc slow is that it has absolutely
horrible cache behaviour, a big VM footprint, and chases pointers in
that badly cached area all of the time.

And that, in turn, is probably impossible to fix as long as gcc uses
garbage collection for most of its internal memory management.  There
just aren't all that many worse ways to f*ck up your cache behaviour
than by using lots of allocations and lazy GC to manage your memory. 

The problem with bad cache behaviour is that you don't get nice spikes
in specific places that you can try to optimize - the cost ends up being
spread all over the places that touch the data structures. 

The problem with trying to avoid GC is that if you do that you have to
be careful about your reference counts, and I doubt the gcc people want
to be that careful, especially considering that the code-base right now
is not likely to be very easy to convert.

(Plus the fact that GC proponents absolutely refuse to see the error of
their ways, and will flame me royally for even _daring_ to say that GC
sucks donkey brains through a straw from a performance standpoint.  If
order to work with refcounting, you need to have the mentality that
every single data structure with a non-local lifetime needs to have the
count as it's major member)

			Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 18:39                 ` Aldy Hernandez
@ 2002-08-09 18:59                   ` David S. Miller
  2002-08-09 20:01                   ` Per Bothner
  1 sibling, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-09 18:59 UTC (permalink / raw)
  To: aldyh; +Cc: dberlin, shebs, mrs, gcc

   From: Aldy Hernandez <aldyh@redhat.com>
   Date: Fri, 9 Aug 2002 18:45:00 -0700

   > 2) Make RTL/TREE layout less pointer driven.

   For the clueless, ahem me, could you go into more detail on this?

Embed RTL object info instead of using pointers to other RTL objects.

It's about as far a reaching change as reference counting RTL and
killing off garbage collection.  The reason #2 is so far reaching is
that it would require changing several of the semantics of shared RTL
and also getting rid of the places that just randomly stick new RTL
all over the place.

Garbage collection is just an excuse to be lazy with how we manage
RTL objects in GCC.

Further consideration suggests that you can approach either solution
in at least two stages.  The first stage is somehow documenting in
the code each spot where we rewrite existing RTL.  That makes the
rest of the work a bit easier.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:28   ` Mike Stump
  2002-08-09 16:00     ` Aldy Hernandez
@ 2002-08-09 19:07     ` David Edelsohn
  1 sibling, 0 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-09 19:07 UTC (permalink / raw)
  To: Mike Stump, Stan Shebs; +Cc: gcc

	In regard to the benefit of some optimization at -O0, please see
http://gcc.gnu.org/ml/gcc-patches/2000-01/msg00690.html ("The Death of
Stupid").

	Other comercial compilers are able to focus on compilation speed
at -O0 with some small, appropriate optimization.  They also efficiently
produce extremely good code with full optimization enabled.  They do not
need an additional -fquick-compile flag.

	GCC does not have much low-hanging fruit left.  IMHO, playing
these speed-up games distracts interested developers from addressing the
fundamental design problems which slow down GCC.  The underlying problems
have been mentioned in this discussion.  If we begin to attack them now,
we may have them ready for GCC 3.4.  If we keep looking for easy
solutions, GCC is going to remain at a disadvantage.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 18:57   ` Linus Torvalds
@ 2002-08-09 19:12     ` Phil Edwards
  2002-08-09 19:34     ` Kevin Atkinson
  2002-08-10 19:20     ` Noel Yap
  2 siblings, 0 replies; 256+ messages in thread
From: Phil Edwards @ 2002-08-09 19:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yap_noel, gcc

On Fri, Aug 09, 2002 at 06:56:58PM -0700, Linus Torvalds wrote:
> In article < 20020809200413.46719.qmail@web21403.mail.yahoo.com > you write:
> >Build speeds are most helped by minimizing the number
> >of files opened and closed during the build.
> 
> I _seriously_ doubt that.

To be fair, when listing "things we can do to speed up the build," most
people don't include tinkering with the guts of the compiler.  Statements
like that of the original poster are correct when the compiler cannot be
touched, and in fact many textbooks say exactly that:  minimize the number
of files opened (or more generally, system calls) to speed the build.
(The lesson is typically something about multiple include guard macros or
proper makefile dependancies.)  So let's not be too harsh.

When we're allowed to hack on the compiler source itself, of course,
those statements go right out the window.  :-)

Phil

-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 18:57   ` Linus Torvalds
  2002-08-09 19:12     ` Phil Edwards
@ 2002-08-09 19:34     ` Kevin Atkinson
  2002-08-09 20:28       ` Linus Torvalds
  2002-08-10 19:20     ` Noel Yap
  2 siblings, 1 reply; 256+ messages in thread
From: Kevin Atkinson @ 2002-08-09 19:34 UTC (permalink / raw)
  To: gcc

On Fri, 9 Aug 2002, Linus Torvalds wrote:

> And that, in turn, is probably impossible to fix as long as gcc uses
> garbage collection for most of its internal memory management.  There
> just aren't all that many worse ways to f*ck up your cache behaviour
> than by using lots of allocations and lazy GC to manage your memory. 

Excuse the interruption, but from what I read a good generational garbage 
collector can be just as fast as manually managing memory?  Is this not 
the case?  If so could some one point me to some information regarding 
why?  I am not trying to argue with anyone as I really don't know that 
much about GC except from what I read in a few papers.

Sorry, I was reading this thread and that point struct me by surprise.

--- 
http://kevin.atkinson.dhs.org

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 18:39                 ` Aldy Hernandez
  2002-08-09 18:59                   ` David S. Miller
@ 2002-08-09 20:01                   ` Per Bothner
  1 sibling, 0 replies; 256+ messages in thread
From: Per Bothner @ 2002-08-09 20:01 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: gcc

Aldy Hernandez wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 19:34     ` Kevin Atkinson
@ 2002-08-09 20:28       ` Linus Torvalds
  2002-08-09 21:12         ` Daniel Berlin
  2002-08-10  6:32         ` Robert Lipe
  0 siblings, 2 replies; 256+ messages in thread
From: Linus Torvalds @ 2002-08-09 20:28 UTC (permalink / raw)
  To: kevin, gcc

In article < Pine.LNX.4.44.0208092227500.2273-100000@kevin-pc.atkinson.dhs.org > you write:
>On Fri, 9 Aug 2002, Linus Torvalds wrote:
>
>> And that, in turn, is probably impossible to fix as long as gcc uses
>> garbage collection for most of its internal memory management.  There
>> just aren't all that many worse ways to f*ck up your cache behaviour
>> than by using lots of allocations and lazy GC to manage your memory. 
>
>Excuse the interruption, but from what I read a good generational garbage 
>collector can be just as fast as manually managing memory?

All the papers I've seen on it are total jokes.  But maybe I've looked
at the wrong ones. 

One fundamental fact on modern hardware is that data cache locality is
good, and not being in the cache sucks.  This is not likely to change. 
In particular, this means that if you allocate stuff, you want to re-use
the stuff you just freed _as_soon_as_possible_ - preferably before the
previously dirty data has ever even been evicted from the cache, so that
you can re-use the thing to avoid reading it in, but also to avoid
writing out stale data. 

This implies that any lazy de-allocation is bad. When a piece of memory
is free, you want to de-allocate it _immediately_, so that the next
allocation gets to re-use it and gets the cache footprint "for free".

Generational garabage collectors tend to never re-use hot objects, and
often do the copying between generations making things even worse on the
cache.  Compaction helps subsequent use somewhat, but is in itself
inherently costly, and the indirection (or fixup) implied by it can
limit other optimization. 

Sure, by being lazy you can sometimes win in icache footprint (and in
instruction count - a lot of the "GC is fast" papers seem to rely on the
fact that you can do other optimizations if you're lazy), but you lose
big in dirty dcache footprint.  And since dcache is much more expensive
than instructions, you're better off doing explicit memory management
with refcounting (optionally helped by the programming language, of
course.  You can make exact refcounting be your "GC" with some language
support). 

However, there's another, more fundamental issue.  It's the _mindset_. 
The GC mindset tends to go hand-in-hand with pointer chasing, while
people who use explicit allocators tend to be happier with doing things
like "realloc()" and trying to use arrays and indexes instead of linked
lists and just generally trying to avoid allocating lots of small
things.  Which tends to be better on the cache. 

Yes, I generalize. Don't we all?

For example, if you have an _explicit_ refcounting system, then it is
quite natural to have operations like "copy-on-write", where if you
decide to change a tree node you do something like

	copy_on_write(node_t **np)
	{

		note_t *node = *np;
		if (node->count > 1)
			newnode = copy_alloc(node);
			*np = newnode;
			node->count--;
			node = newnode;
		}
		return node;
	}

and then before you change a tree node you do

	node = copy_on_write(&tree->node);
	.. we now know we are the exclusive owners of "node" ..

which tends to be very efficient - it allows sharing, even if sharing is
often not the common case (and doesn't do any extra allocations for the
common case of an access that was already exclusively owned).

(If you want to be thread-safe you need to be more careful yet, and have
thread-safe "get_node()/put_node()" actions etc.  Most applications
don't need to be that careful, but you'll see a _lot_ of this inside an
operating system). 

In contrast, in a GC system where you do _not_ have access to the
explicit refcounting, you tend to always copy the node, just because you
don't know if the original node might be shared through another tree or
not.  Even if sharing ends up not being the most common case.  So you do
a lot of extra work, and you end up with even more cache pressure. 

Are the GC systems that do refcounting internally _and_ expose the
information upwards to the user? I bet there are. But the fact is, the
rest of them (99.9%) give those few well-done GC's a bad name.

"So what about circular data structures? Refcounting doesn't work for
them".  Right.  Don't do them.  Or handle them very very carefully (ie
there can be a "head" that gets special handling and keeps the others
alive). Compilers certainly almost always end up working with DAG's, not
cyclic structures. Make it a rule.

Does it take more effort? Yes.  The advantage of GC is that it is
automatic.  But CG apologists should just admit that it causes bad
problems and often _encourages_ people to write code that performs
badly. 

I really think it's the mindset that is the biggest problem.  A GC
system with explicitly visible reference counts (and immediate freeing)
with language support to make it easier to get the refcounts right
(things like automatically incrementing the refcounts when passing the
object off to others) wouldn't necessarily be painful to use, and would
clearly offer all the advantages of just doing it all by hand. 

That's not the world we live in, though.

		Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 20:28       ` Linus Torvalds
@ 2002-08-09 21:12         ` Daniel Berlin
  2002-08-09 21:52           ` Linus Torvalds
  2002-08-10  6:32         ` Robert Lipe
  1 sibling, 1 reply; 256+ messages in thread
From: Daniel Berlin @ 2002-08-09 21:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: kevin, gcc

> 
> "So what about circular data structures? Refcounting doesn't work for
> them".  Right.  Don't do them.  Or handle them very very carefully (ie
> there can be a "head" that gets special handling and keeps the others
> alive). Compilers certainly almost always end up working with DAG's, not
> cyclic structures. Make it a rule.
Sorry, there are cases that make this impossible to do (IOW we can't make 
it a rule).
But another option is to do what Python does.
Have a reference cycle GC that just handles breaking cycles.
Run it explicitly at times, or much like we do ggc_collect now.
Reference cycles can only possibly occur in container objects, so you 
only have to deal with the overhead of cycle-breaking there.

--Dan

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 21:12         ` Daniel Berlin
@ 2002-08-09 21:52           ` Linus Torvalds
  0 siblings, 0 replies; 256+ messages in thread
From: Linus Torvalds @ 2002-08-09 21:52 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: kevin, gcc

On Sat, 10 Aug 2002, Daniel Berlin wrote:
> > 
> > "So what about circular data structures? Refcounting doesn't work for
> > them".  Right.  Don't do them.  Or handle them very very carefully (ie
> > there can be a "head" that gets special handling and keeps the others
> > alive). Compilers certainly almost always end up working with DAG's, not
> > cyclic structures. Make it a rule.
>
> Sorry, there are cases that make this impossible to do (IOW we can't make 
> it a rule).

Hmm. I can't imagine what is there that is inherently cyclic, but breaking 
the cycles might be more painful than it's worth, so I'll take your word 
for it.

Things like data structure definitions (which clearly can be cyclic thanks
to pointers to themselves) can often be resolved trivially with nesting
rules (ie if you can show that the lifetime of type A is a superset of the
lifetime of B, then you don't actually need to refcount a backpointer from
B to A).

For the obvious example that I can think of (ie just a structure
definition containing a pointer to itself - possibly indirectly via other
structures), that type lifetime nesting is inherent in the C type scopes,
for example. For type X to have been able to contain a pointer to type Y,
Y must have had a larger scope than X, so the pointer from one type
structure to another never needs refcounting in a C compiler.

(This, btw, is why I don't believe in automated GC systems - even if they
use refcounting internally. It's simply fairly hard to tell a GC system
simple rules like when you need to ref-count, and when you don't.  If you
just always ref-count on assignment, you _will_ get the obvious circular
references, simply because you miss the high-level picture).

But other cases might certainly be much more painful, so I certainly agree
with you:

> But another option is to do what Python does.
> Have a reference cycle GC that just handles breaking cycles.
> Run it explicitly at times, or much like we do ggc_collect now.
> Reference cycles can only possibly occur in container objects, so you 
> only have to deal with the overhead of cycle-breaking there.

Nothing says you can't mix the two approaches, no. If the subset of
allocations you need to worry about from a GC standpoint is relatively
small, the cache efficiency advantages of refcounting clearly don't
matter, and the disadvantages can be disproportional.

			Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:02   ` Nathan Sidwell
  2002-08-09 17:05     ` Stan Shebs
@ 2002-08-10  2:21     ` Gabriel Dos Reis
  1 sibling, 0 replies; 256+ messages in thread
From: Gabriel Dos Reis @ 2002-08-10  2:21 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: Neil Booth, Mike Stump, gcc

Nathan Sidwell <nathan@codesourcery.com> writes:

| unifying static_cast, (cast), const_cast, implicit_conversion, overload
| arg resolution might be a win.

We might get correctness at the same time.

[...]

| I hope some of that is useful to others.

Definitely.

-- Gaby

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:49       ` Matt Austern
@ 2002-08-10  2:24         ` Gabriel Dos Reis
  0 siblings, 0 replies; 256+ messages in thread
From: Gabriel Dos Reis @ 2002-08-10  2:24 UTC (permalink / raw)
  To: Matt Austern; +Cc: Alan Lehotsky, David Edelsohn, Stan Shebs, Mike Stump, gcc

Matt Austern <austern@apple.com> writes:

| On Friday, August 9, 2002, at 04:17 PM, Alan Lehotsky wrote:
| 
| > This is DEFINITELY TRUE!
| >
| > For example, the Bliss11 compiler ACTUALLY ran faster with 
| > optimization turned on because assembling the unoptimized code 
| > actually took longer than the time running FULL optimization required 
| > for anything but the most trivial programs.
| 
| Shall we take it as a given that nobody is going to check
| in a patch for faster compilations without benchmarking
| and making sure that it really does speed things up?

Some while ago, when the compiler slowdown was a hotter issue, it was
suggested that no new optimization-related patches should be checked
in if there were no concrete evidence that they're bringing noticeable
wins.  I don't know how that turns out, though.

-- Gaby

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 20:28       ` Linus Torvalds
  2002-08-09 21:12         ` Daniel Berlin
@ 2002-08-10  6:32         ` Robert Lipe
  2002-08-10 14:26           ` Cyrille Chepelov
  1 sibling, 1 reply; 256+ messages in thread
From: Robert Lipe @ 2002-08-10  6:32 UTC (permalink / raw)
  To: gcc

Linus Torvalds wrote:

> One fundamental fact on modern hardware is that data cache locality is
> good, and not being in the cache sucks.  This is not likely to change. 

This is a fact.

Measuring this sort of thing is possible.  (Optimizing without
measuring is seldom a good idea.)  In the absence of processor pods
and bus analyzers, has anyone thrown gcc at a tool like 'valgrind' or
cachegrind?

	http://developer.kde.org/~sewardj/

RJL

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:51           ` Stan Shebs
                               ` (2 preceding siblings ...)
  2002-08-09 18:25             ` David S. Miller
@ 2002-08-10 10:02             ` Neil Booth
  3 siblings, 0 replies; 256+ messages in thread
From: Neil Booth @ 2002-08-10 10:02 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Aldy Hernandez, Mike Stump, gcc

Stan Shebs wrote:-

> One of my suspicions is that the massive use of macros in tree
> and RTL is concealing excessive pointer chasing, because they
> don't show up in either profile or coverage numbers.

Yes.  I look forward to the day when we use type-safe structures
that contain only the relevant information, rather than a "tree"
which is little more than the union of the universe, along with
compensating macros to detect type violations.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10  6:32         ` Robert Lipe
@ 2002-08-10 14:26           ` Cyrille Chepelov
  2002-08-10 17:33             ` Daniel Berlin
  2002-08-11  1:03             ` Florian Weimer
  0 siblings, 2 replies; 256+ messages in thread
From: Cyrille Chepelov @ 2002-08-10 14:26 UTC (permalink / raw)
  To: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4064 bytes --]

Le Sat, Aug 10, 2002, Ã  08:32:26AM -0500, Robert Lipe a Ã©crit:

> Linus Torvalds wrote:
> 
> > One fundamental fact on modern hardware is that data cache locality is
> > good, and not being in the cache sucks.  This is not likely to change. 
> 
> This is a fact.

> Measuring this sort of thing is possible.  (Optimizing without
> measuring is seldom a good idea.)  In the absence of processor pods
> and bus analyzers, has anyone thrown gcc at a tool like 'valgrind' or
> cachegrind?

I just did (I was forming the idea while reading the thread, but you beat me
in suggesting it before I implemented it).

I have tried on a grand total of three files, two from today's mainline CVS
(updated from anonymous about four hours ago), and one from Linux 2.5.30; as
my machine is not exactly the dual-multi-gigahertz, "HT"-interconnected
(HyperTransport ?) with gobs of memory bandwith (and what else? 64 bits?) 
monsters Linus has been bragging about recently, please bear with lack of
patience to run CG over the whole aforementioned packages...

Some detailed results here: http://www.chepelov.org/cyrille/gcc-valgrind 

Excerpt:

	java/parse.c
==17875== I   refs:      275,598,220
==17875== I1  misses:         43,600
==17875== L2i misses:         41,948
==17875== I1  miss rate:         0.1%
==17875== L2i miss rate:         0.1%
==17875== 
==17875== D   refs:      145,894,312  (94,095,162 rd + 51,799,150 wr)
==17875== D1  misses:        322,121  (   259,431 rd +     62,690 wr)
==17875== L2d misses:        313,318  (   251,817 rd +     61,501 wr)
==17875== D1  miss rate:         0.2% (       0.2%   +        0.1%  )
==17875== L2d miss rate:         0.2% (       0.2%   +        0.1%  )
==17875== 
==17875== L2 refs:           365,721  (   303,031 rd +     62,690 wr)
==17875== L2 misses:         355,266  (   293,765 rd +     61,501 wr)
==17875== L2 miss rate:          0.0% (       0.0%   +        0.1%  )

	emit-rtl.c:
==17968== I   refs:      2,315,492,628
==17968== I1  misses:        5,888,264
==17968== L2i misses:        5,481,716
==17968== I1  miss rate:          0.25%
==17968== L2i miss rate:          0.23%
==17968== 
==17968== D   refs:      1,172,342,347  (702,376,465 rd + 469,965,882 wr)
==17968== D1  misses:        7,920,482  (  6,205,391 rd +   1,715,091 wr)
==17968== L2d misses:        7,134,597  (  5,455,816 rd +   1,678,781 wr)
==17968== D1  miss rate:           0.6% (        0.8%   +         0.3%  )
==17968== L2d miss rate:           0.6% (        0.7%   +         0.3%  )
==17968== 
==17968== L2 refs:          13,808,746  ( 12,093,655 rd +   1,715,091 wr)
==17968== L2 misses:        12,616,313  ( 10,937,532 rd +   1,678,781 wr)
==17968== L2 miss rate:            0.3% (        0.3%   +         0.3%  )

	linux/kernel/signal.c:
==22924== 
==22924== I   refs:      1,020,746
==22924== I1  misses:        1,030
==22924== L2i misses:          946
==22924== I1  miss rate:      0.10%
==22924== L2i miss rate:       0.9%
==22924== 
==22924== D   refs:        480,927  (335,166 rd + 145,761 wr)
==22924== D1  misses:        2,075  (  1,535 rd +     540 wr)
==22924== L2d misses:        2,072  (  1,532 rd +     540 wr)
==22924== D1  miss rate:       0.4% (    0.4%   +     0.3%  )
==22924== L2d miss rate:       0.4% (    0.4%   +     0.3%  )
==22924== 
==22924== L2 refs:           3,105  (  2,565 rd +     540 wr)
==22924== L2 misses:         3,018  (  2,478 rd +     540 wr)
==22924== L2 miss rate:        0.2% (    0.1%   +     0.3%  )


I don't want to fuel any kind of flamewars (after all, it's only software),
but the miss rates above don't seem too horrible (maybe they are, after all).

What cachegrind doesn't show (yet ?) is if the access pattern kills
opportunities for the memory interface to use burst transfers; after all,
SDRAM also has some form of "seek time". It is possible that something's
hidden there. Also, I didn't spend much time trying to figure the proper
vg_annotate include path, so some functions appear as unknown in the
detailed cachegrind outputs. Well, that's a start.

	-- Cyrille

-- 
Grumpf.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:14       ` Neil Booth
@ 2002-08-10 15:54         ` Noel Yap
  0 siblings, 0 replies; 256+ messages in thread
From: Noel Yap @ 2002-08-10 15:54 UTC (permalink / raw)
  To: Neil Booth; +Cc: Mike Stump, gcc

--- Neil Booth <neil@daikokuya.co.uk> wrote:
> Noel Yap wrote:-
> 
> > I don't see this as too big a problem.  Just
> output a
> > file like:
> > #if COND
> > /* contents of header file
> > #endif
> > 
> > In fact, doing it this way has the advantage that
> > several builds, not necessarily agreeing on the
> value
> > of COND, can use the file.
> 
> Hmm, and what about header guards?  Infinite
> recursion?

Unless I'm missing something, header guards by
themselves shouldn't pose a problem.

You're right.  Cyclic dependencies would throw this
whole thing out of whack.  OTOH, I think such practice
needs to be avoided anyhow.

Another case related to recursive includes is where
each level of recursion would have side effects (eg
redefining a macro whose value is used in the next
recursion).  Again, I've heard this usage only once
and even the creator of such a header file said it was
a tremendous hack for programmers with no proper
education in programming (IIRC, they were physicists).

> > I think one needn't preprocess everything
> perfectly in
> > order to gain significant advantages.  Would you
> say
> > that what I suggest is better than what we have
> now?
> 
> Correctness is paramount; if it's not correct it's
> no
> good.

I apologize if my post was misunderstood.  What I
meant to say was, if it's able to preprocess, then
allow it, otherwise, don't.  IOW, those already
following common practices can take advantage of a new
feature, those that don't have what they have now.

I can certainly understand the ideals of keeping the
tool and all its features pure and working for all
possible uses.  OTOH, doing so may prevent practicle
avenues that possibly 99% of users can benefit from.

Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:13   ` Stan Shebs
  2002-08-09 15:18     ` Neil Booth
  2002-08-09 15:19     ` Ziemowit Laski
@ 2002-08-10 16:07     ` Noel Yap
  2002-08-10 16:18       ` Neil Booth
  2 siblings, 1 reply; 256+ messages in thread
From: Noel Yap @ 2002-08-10 16:07 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Mike Stump, gcc

--- Stan Shebs <shebs@apple.com> wrote:
> Noel Yap wrote:
> 
> >Build speeds are most helped by minimizing the
> number
> >of files opened and closed during the build.
> >
> Is this assertion based on empirical measurement,
> and if so, for what
> source code and what system?  For instance, the
> longest source file
> in GCC is about 15K lines, and at -O2, only a small
> percentage of
> time is spent messing with files.  If I use
> -save-temps on cp/decl.c on
> one of my (Linux) machines, I get a total time of
> about 38 sec from
> source to asm.  If I just compile decl.i, it's about
> 37 sec, so that's
> 1 sec for *all* preprocessing, including all file
> opening/closing.

This is a good question.

John Lakos in _Large-Scale C++ Software Development_
has performed a rudimentary case study.  If the
conclusions are true, then your example indicates that
there wasn't much of a difference between the number
of files used when compiling decl.c and decl.i.

The study also indicates that having #include's within
header files is the largest contributor to the problem
(since nested #include's would increase the number of
file accesses combinatorially).

As another indication that the conclusion is true,
Lakos added guards around the #include lines
themselves and found compile times to dramatically
decrease.  For example:
#if header_h
#   include <header.h>
#endif

I can go on, but I doubt others on this list would
appreciate a reprint of the chapter.  If you don't
have the book, I suggest at least finding a copy and
reading this chapter.

> Obviously, other programs will have different
> characteristics, and if
> you have one for which file opening/closing
> dominates compile time,
> that will be very interesting.  But it's bad to try
> to optimize
> something before you have numerical evidence.

I agree.

Would you agree with Lakos's findings as evidence to
this fact?

Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:18     ` Neil Booth
@ 2002-08-10 16:12       ` Noel Yap
  2002-08-10 18:00         ` Nix
  0 siblings, 1 reply; 256+ messages in thread
From: Noel Yap @ 2002-08-10 16:12 UTC (permalink / raw)
  To: Neil Booth, Stan Shebs; +Cc: Noel Yap, Mike Stump, gcc

--- Neil Booth <neil@daikokuya.co.uk> wrote:
> Stan Shebs wrote:-
> 
> > Is this assertion based on empirical measurement,
> and if so, for what
> > source code and what system?  For instance, the
> longest source file
> > in GCC is about 15K lines, and at -O2, only a
> small percentage of
> > time is spent messing with files.  If I use
> -save-temps on cp/decl.c on
> > one of my (Linux) machines, I get a total time of
> about 38 sec from
> > source to asm.  If I just compile decl.i, it's
> about 37 sec, so that's
> > 1 sec for *all* preprocessing, including all file
> opening/closing.
> 
> Yes, it's very rare that preprocessing is more than
> 2% of -O2 time;
> it's often less than 1%.  IMO that says more about
> the efficiency
> of the rest than of CPP.

I would agree if you're talking about complete builds
spanning only a few C/C++ files.  OTOH, when builds
span many hundreds of these files, build-time (not
just compile-time) starts getting bogged down on
(mostly) reopening and repreprocessing the same files
over and over.

Within our system, builds on Windows are magnitudes
faster since we're able to take advantage of
precompiled headers.  AFAIK, I legitimate study was
made studying whether to use this feature or not.

Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:19     ` Ziemowit Laski
  2002-08-09 15:25       ` Neil Booth
@ 2002-08-10 16:16       ` Noel Yap
  1 sibling, 0 replies; 256+ messages in thread
From: Noel Yap @ 2002-08-10 16:16 UTC (permalink / raw)
  To: Ziemowit Laski, Stan Shebs; +Cc: Ziemowit Laski, Noel Yap, Mike Stump, gcc

--- Ziemowit Laski <zlaski@apple.com> wrote:
> 
> On Friday, August 9, 2002, at 03:12 , Stan Shebs
> wrote:
> 
> > Noel Yap wrote:
> >
> >> Build speeds are most helped by minimizing the
> number
> >> of files opened and closed during the build.
> >>
> > Is this assertion based on empirical measurement,
> and if so, for what
> > source code and what system?  For instance, the
> longest source file
> > in GCC is about 15K lines, and at -O2, only a
> small percentage of
> > time is spent messing with files.  If I use
> -save-temps on cp/decl.c on
> > one of my (Linux) machines, I get a total time of
> about 38 sec from
> > source to asm.  If I just compile decl.i, it's
> about 37 sec, so that's
> > 1 sec for *all* preprocessing, including all file
> opening/closing.
> 
> Since the preprocessor is integrated, I don't think
> you can separate
> the timings in this way. :(  A 'gcc3 -E cp/decl.c -o
> decl.i' would
> probably be more meaningful.

This is a good point.

I think an even better study would be to replicate
John Lakos's study within one's own project.  I'd be
very interested to find out how many projects (other
than the ones I've seen) fit Lakos's "largeness" and
would, therefore, be able to take advantage of
preprocessed headers.

Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 16:07     ` Noel Yap
@ 2002-08-10 16:18       ` Neil Booth
  2002-08-10 20:27         ` Noel Yap
  0 siblings, 1 reply; 256+ messages in thread
From: Neil Booth @ 2002-08-10 16:18 UTC (permalink / raw)
  To: Noel Yap; +Cc: Stan Shebs, Mike Stump, gcc

Noel Yap wrote:-

> The study also indicates that having #include's within
> header files is the largest contributor to the problem
> (since nested #include's would increase the number of
> file accesses combinatorially).

See below for why this isn't true for most compilers now.

> As another indication that the conclusion is true,
> Lakos added guards around the #include lines
> themselves and found compile times to dramatically
> decrease.  For example:
> #if header_h
> #   include <header.h>
> #endif

This isn't the case with GCC.  I hope you're aware of that.
The first time GCC reads <header.h> it remembers if it had
header guards.  If it's ever asked to #include it again,
it checks if the guard is defined, and doesn't do anything.
The file's contents are also not cached if it has header
guards, on the assumption that the contents are unlikely to
be of interest in the future.

In other words, this kind of #include protection is ugly and
pointless (and possibly error-prone, though that would tend
to be immediately obvious).  Most compilers now implement
this optimization, but 5 or 6 years ago this wasn't the case.
I think GCC was one of the first.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 14:26           ` Cyrille Chepelov
@ 2002-08-10 17:33             ` Daniel Berlin
  2002-08-10 18:21               ` Linus Torvalds
  2002-08-10 18:28               ` Cyrille Chepelov
  2002-08-11  1:03             ` Florian Weimer
  1 sibling, 2 replies; 256+ messages in thread
From: Daniel Berlin @ 2002-08-10 17:33 UTC (permalink / raw)
  To: Cyrille Chepelov; +Cc: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1468 bytes --]

On Sat, 10 Aug 2002, Cyrille Chepelov wrote:

> Le Sat, Aug 10, 2002, Ã  08:32:26AM -0500, Robert Lipe a Ã©crit:
> 
> > Linus Torvalds wrote:
> > 
> > > One fundamental fact on modern hardware is that data cache locality is
> > > good, and not being in the cache sucks.  This is not likely to change. 
> > 
> > This is a fact.
> 
> > Measuring this sort of thing is possible.  (Optimizing without
> > measuring is seldom a good idea.)  In the absence of processor pods
> > and bus analyzers, has anyone thrown gcc at a tool like 'valgrind' or
> > cachegrind?
> 
> I just did (I was forming the idea while reading the thread, but you beat me
> in suggesting it before I implemented it).
> 
> I have tried on a grand total of three files, two from today's mainline CVS
> (updated from anonymous about four hours ago), and one from Linux 2.5.30; as
> my machine is not exactly the dual-multi-gigahertz, "HT"-interconnected
> (HyperTransport ?) with gobs of memory bandwith (and what else? 64 bits?) 
> monsters Linus has been bragging about recently, please bear with lack of
> patience to run CG over the whole aforementioned packages...

The numbers I get on a p4 with cachegrind are *much* worse in all cases.

The miss rates are all >2%, which is a far cry from 0.1% and 0.0%.

Are you sure you have valgrind configured right for your cache?

I'm going to do this the *real* way, using the performance monitoring 
counters on my p4, and get *real* numbers.
--Dan

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 12:17 Faster compilation speed Mike Stump
                   ` (5 preceding siblings ...)
  2002-08-09 16:01 ` Faster compilation speed Richard Henderson
@ 2002-08-10 17:48 ` Aaron Lehmann
  2002-08-12 10:36   ` Dale Johannesen
  6 siblings, 1 reply; 256+ messages in thread
From: Aaron Lehmann @ 2002-08-10 17:48 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

On Fri, Aug 09, 2002 at 12:17:32PM -0700, Mike Stump wrote:
> I'd like to introduce lots of various changes to improve compiler 
> speed.

Just adding my two cents to the discussion - I saw many ideas
presented in this thread that look promising, but one thing that I
didn't see mentioned was gcc's extensive sanity checking. There are
many tests which will produce an internal compiler error when merited.
This is great tool for debugging, but most of these errors should be
impossible to reach. Does anyone know how much overhead this sanity
checking in general causes, and whether there are any sanity checks
that are unusually expensive and should be considered for removal?

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 16:12       ` Noel Yap
@ 2002-08-10 18:00         ` Nix
  2002-08-10 20:36           ` Noel Yap
  2002-08-12 15:08           ` Mike Stump
  0 siblings, 2 replies; 256+ messages in thread
From: Nix @ 2002-08-10 18:00 UTC (permalink / raw)
  To: Noel Yap; +Cc: Neil Booth, gcc

[Cc: list trimmed]
On Sat, 10 Aug 2002, Noel Yap spake:
> I would agree if you're talking about complete builds
> spanning only a few C/C++ files.  OTOH, when builds
> span many hundreds of these files, build-time (not
> just compile-time) starts getting bogged down on
> (mostly) reopening and repreprocessing the same files
> over and over.
> 
> Within our system, builds on Windows are magnitudes
> faster since we're able to take advantage of
> precompiled headers.

Are you sure that this isn't because GCC is having to parse the headers
over and over again, while the precompiled system can avoid that
overhead?

Especially for C++ header files (which tend to be large, complex,
interdependent, and include a lot of code), the parsing and compilation
time *vastly* dominates the preprocessing time.

Example, with GCC-3.1, with a `hello world' iostreams-using program...

The code:

#include <iostream>

int main (void)
 {
  std::cout << "Hello world";
  return 0;
 }

Time spent preprocessing (distorted by the slowness of cpp's output
routines):

nix@loki 62 /tmp% time c++ -E -ftime-report hello.C >/dev/null

real    0m1.424s
user    0m0.710s
sys     0m0.100s

Time spent preprocessing and parsing (roughly; cpp's output routines are
still slow; on the trunk much less time will be spent preprocessing
because the integrated preprocessor doesn't have to do any output at all
there, instead feeding a token stream to the rest of the compiler):

nix@loki 60 /tmp% c++ -ftime-report -fsyntax-only hello.C 

Execution times (seconds)
 garbage collection    :   1.16 (12%) usr   0.08 ( 6%) sys   2.19 (13%) wall
 preprocessing         :   1.04 (11%) usr   0.29 (20%) sys   2.10 (12%) wall
 lexical analysis      :   0.99 (10%) usr   0.28 (20%) sys   1.87 (11%) wall
 parser                :   6.12 (65%) usr   0.75 (53%) sys  10.85 (63%) wall
 varconst              :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.10 ( 1%) wall
 TOTAL                 :   9.44             1.42            17.21

(oddly, preprocessing took *longer* than it did using -E, which I'd not
 expected; but, still parsing vastly dominates preprocessing, and this isn't
 going near e.g. the STL headers)

Complete run, with optimization:

nix@loki 66 /tmp% c++ -O2 -ftime-report -o hello hello.C

Execution times (seconds)
 garbage collection    :   1.10 (11%) usr   0.11 ( 9%) sys   1.74 (11%) wall
 cfg cleanup           :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 life analysis         :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall
 preprocessing         :   1.12 (11%) usr   0.22 (18%) sys   2.04 (13%) wall
 lexical analysis      :   0.98 (10%) usr   0.22 (18%) sys   1.93 (12%) wall
 parser                :   6.46 (65%) usr   0.63 (53%) sys   9.98 (62%) wall
 expand                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 varconst              :   0.08 ( 1%) usr   0.00 ( 0%) sys   0.12 ( 1%) wall
 CSE                   :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall
 CSE 2                 :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall
 regmove               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall
 global alloc          :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall
 flow 2                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 rename registers      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall
 scheduling 2          :   0.00 ( 0%) usr   0.01 ( 1%) sys   0.02 ( 0%) wall
 final                 :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 TOTAL                 :   9.96             1.20            16.16

Now obviously with a less toy example the time consumed optimizing would
rise; but that doesn't affect my point, that the lion's share of time
spent in C++ header files is parsing time, and that speeding up the
preprocessor will have limited effect now (thanks to Zack and Neil
speeding it up so much already :) ).

-- 
`There's something satisfying about killing JWZ over and over again.'
                                        -- 1i, personal communication

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 17:33             ` Daniel Berlin
@ 2002-08-10 18:21               ` Linus Torvalds
  2002-08-10 18:38                 ` Daniel Berlin
  2002-08-10 18:39                 ` Cyrille Chepelov
  2002-08-10 18:28               ` Cyrille Chepelov
  1 sibling, 2 replies; 256+ messages in thread
From: Linus Torvalds @ 2002-08-10 18:21 UTC (permalink / raw)
  To: dberlin, gcc

In article < Pine.LNX.4.44.0208102031550.8641-100000@dberlin.org > you write:
>
>The numbers I get on a p4 with cachegrind are *much* worse in all cases.
>
>The miss rates are all >2%, which is a far cry from 0.1% and 0.0%.

One thing to look out for when looking at cache miss numbers is what
they actually _mean_.

That is particularly true when it comes to the percentages. Are the
percentages relative to #instructions, or #memops, or #line fetches (the
latter ends up being interesting especially for I$).

The "percentage per instruction" number is to some degree a nonsensical
number (since many instructions do not do any D$ accesses at all), but
it has the advantage that it makes the I$ and D$ misses comparable, and
it also allows you to make a quick estimation of how much time was
actually spent on cache misses. 

The _best_ number to get (and in the end, the only one that really
matters) is "cycles spent waiting on cache" and "cycles spent doing
useful work", but I don't think valgrind gives you that.  The P4
counters should do it, though. 

If you wan tto use the HW counters under Linux, get "oprofile" from
sourceforge.net. (I don't think it does P4 events yet, though)

			Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 17:33             ` Daniel Berlin
  2002-08-10 18:21               ` Linus Torvalds
@ 2002-08-10 18:28               ` Cyrille Chepelov
  2002-08-10 18:30                 ` John Levon
  1 sibling, 1 reply; 256+ messages in thread
From: Cyrille Chepelov @ 2002-08-10 18:28 UTC (permalink / raw)
  To: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2235 bytes --]

Le Sat, Aug 10, 2002, Ã  08:33:53PM -0400, Daniel Berlin a Ã©crit:

> On Sat, 10 Aug 2002, Cyrille Chepelov wrote:
> > I have tried on a grand total of three files, two from today's mainline CVS
> > (updated from anonymous about four hours ago), and one from Linux 2.5.30; as
> > my machine is not exactly the dual-multi-gigahertz, "HT"-interconnected
> > (HyperTransport ?) with gobs of memory bandwith (and what else? 64 bits?) 

(Some brave soul pointed to me that HT is more probably HyperThreading. I
stand corrected (though being LT surely entitles one to getting cooler toys
that mere mortals)).

> The numbers I get on a p4 with cachegrind are *much* worse in all cases.
> 
> The miss rates are all >2%, which is a far cry from 0.1% and 0.0%.

a-ha ! This is interesting... Did you run on the same sample files as I did,
or others ? Can you reproduce my numbers if you set --I1=65536,2,64
--D1=65536,2,64 --L2=65536,8,64 ?

> Are you sure you have valgrind configured right for your cache?

Sure, no. The cache spec numbers did look about rig... D'oh! Looks like 
Cachegrind trusts a little too faithfully what this old (A0-stepping) Duron 
says. CG believes L2 is 1 KB, whereas in fact it is 64KB.

I've just re-ran the java/parser.c test with forcing --L2=65536,8,64, and
uploaded the results (same place)

What are the first lines of output from vg_annotate on your system ?
It certainly sounds unbelievable that a Duron's cache design beats a P4's.

(there is something curious about the L2 lines from the initial output (the
last three ones). Saying that 355266 misses for 365721 refs means a 0.0%
miss rate certainly sounds strange, I've got to ask Julian about the logic
there. Looks to me that L2 failed 97% of its mission).

> I'm going to do this the *real* way, using the performance monitoring 
> counters on my p4, and get *real* numbers.

It would be very interesting to see how far off CG falls... CG does make the
implicit assumption that the process runs uninterrupted (I tried welding
cachegrind into UML, but that didn't bring me far). The real CPU will
certainly give you a more lively picture.... (the performance monitoring
counters are not per-process on Linux, are they ?)

	-- Cyrille

-- 
Grumpf.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 18:28               ` Cyrille Chepelov
@ 2002-08-10 18:30                 ` John Levon
  0 siblings, 0 replies; 256+ messages in thread
From: John Levon @ 2002-08-10 18:30 UTC (permalink / raw)
  To: gcc

On Sun, Aug 11, 2002 at 03:28:51AM +0200, Cyrille Chepelov wrote:

> It would be very interesting to see how far off CG falls... CG does make the
> implicit assumption that the process runs uninterrupted (I tried welding
> cachegrind into UML, but that didn't bring me far). The real CPU will
> certainly give you a more lively picture.... (the performance monitoring
> counters are not per-process on Linux, are they ?)

perfctr patch supports virtual counters (google first hit). I don't
remember if it has P4 support yet.

regards
john

-- 
"It is unbecoming for young men to utter maxims."
	- Aristotle

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 18:21               ` Linus Torvalds
@ 2002-08-10 18:38                 ` Daniel Berlin
  2002-08-10 18:39                 ` Cyrille Chepelov
  1 sibling, 0 replies; 256+ messages in thread
From: Daniel Berlin @ 2002-08-10 18:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: gcc

On Sat, 10 Aug 2002, Linus Torvalds wrote:

> In article < Pine.LNX.4.44.0208102031550.8641-100000@dberlin.org > you write:
> >
> >The numbers I get on a p4 with cachegrind are *much* worse in all cases.
> >
> >The miss rates are all >2%, which is a far cry from 0.1% and 0.0%.
> 
> One thing to look out for when looking at cache miss numbers is what
> they actually _mean_.

Yeah.

> The _best_ number to get (and in the end, the only one that really
> matters) is "cycles spent waiting on cache" and "cycles spent doing
> useful work", but I don't think valgrind gives you that.  The P4
> counters should do it, though. 

Yuppers.
> 
> If you wan tto use the HW counters under Linux, get "oprofile" from
> sourceforge.net. (I don't think it does P4 events yet, though)

brink and abyss do p4 events, which is what i'm using.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 18:21               ` Linus Torvalds
  2002-08-10 18:38                 ` Daniel Berlin
@ 2002-08-10 18:39                 ` Cyrille Chepelov
  1 sibling, 0 replies; 256+ messages in thread
From: Cyrille Chepelov @ 2002-08-10 18:39 UTC (permalink / raw)
  To: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]

Le Sat, Aug 10, 2002, Ã  06:20:51PM -0700, Linus Torvalds a Ã©crit:

> >The numbers I get on a p4 with cachegrind are *much* worse in all cases.
> >
> >The miss rates are all >2%, which is a far cry from 0.1% and 0.0%.
> 
> One thing to look out for when looking at cache miss numbers is what
> they actually _mean_.
> 
> That is particularly true when it comes to the percentages. Are the
> percentages relative to #instructions, or #memops, or #line fetches (the
> latter ends up being interesting especially for I$).

These are percentages relative to the number of accesses. L2 percentages are
also relative to the original number of accesses, not to the number of L1
misses.

> The _best_ number to get (and in the end, the only one that really
> matters) is "cycles spent waiting on cache" and "cycles spent doing
> useful work", but I don't think valgrind gives you that.  The P4
> counters should do it, though. 

Indeed, cachegrind won't tell you when there was a miss but the hardware was
smart enough to do something useful while it waits for the cache.
Despite this limitation, shouldn't 
	(number_of_L1_misses * N) + (number_of_L2_misses * M) * cycle_len
[where N is roughly 10 and M roughly 200, or updated figures] be a ballpark
figure of the time lost waiting for RAM to catch up?

> If you wan tto use the HW counters under Linux, get "oprofile" from
> sourceforge.net. (I don't think it does P4 events yet, though)
The site says it doesn't yet.

	-- Cyrille

-- 
Grumpf.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 18:57   ` Linus Torvalds
  2002-08-09 19:12     ` Phil Edwards
  2002-08-09 19:34     ` Kevin Atkinson
@ 2002-08-10 19:20     ` Noel Yap
  2 siblings, 0 replies; 256+ messages in thread
From: Noel Yap @ 2002-08-10 19:20 UTC (permalink / raw)
  To: Linus Torvalds, gcc

--- Linus Torvalds <torvalds@transmeta.com> wrote:
> In article
> < 20020809200413.46719.qmail@web21403.mail.yahoo.com >
> you write:
> >Build speeds are most helped by minimizing the
> number
> >of files opened and closed during the build.
> 
> I _seriously_ doubt that.

Yes, my statement is exagerated although they are not
completely truthless.

The study conducted by John Lakos and some testing
that I have conducted point to the fact that
minimizing file opens does speed up builds
significantly.

Of course, that's not to say that other courses of
action shouldn't be pursued.

> Opening (and even reading) a cached file is not an
> expensive operation,
> not compared to the kinds of run-times gcc has. 
> We're talking a few
> microseconds per file open at a low level.  Even
> parsing it should not
> be that expensive, especially if the preprocessor is
> any good (and from
> all I've seen, these days it _is_ good).

Hmm, perhaps it's time I conducted some tests again. 
I'm assuming you're talking about caching at the OS
level?

> I strongly suspect that what makes gcc slow is that
> it has absolutely
> horrible cache behaviour, a big VM footprint, and
> chases pointers in
> that badly cached area all of the time.

Maybe you're not talking about caching at the OS
level.  Caching at the compiler level will certainly
help with header files that are included multiple
times.  OTOH, caching at the OS level and/or
preprocessing header files will help with that /and/
header files that are included across compiles.

> And that, in turn, is probably impossible to fix as
> long as gcc uses
> garbage collection for most of its internal memory
> management.  There
> just aren't all that many worse ways to f*ck up your
> cache behaviour
> than by using lots of allocations and lazy GC to
> manage your memory. 
> 
> The problem with bad cache behaviour is that you
> don't get nice spikes
> in specific places that you can try to optimize -
> the cost ends up being
> spread all over the places that touch the data
> structures. 
> 
> The problem with trying to avoid GC is that if you
> do that you have to
> be careful about your reference counts, and I doubt
> the gcc people want
> to be that careful, especially considering that the
> code-base right now
> is not likely to be very easy to convert.
> 
> (Plus the fact that GC proponents absolutely refuse
> to see the error of
> their ways, and will flame me royally for even
> _daring_ to say that GC
> sucks donkey brains through a straw from a
> performance standpoint.  If
> order to work with refcounting, you need to have the
> mentality that
> every single data structure with a non-local
> lifetime needs to have the
> count as it's major member)

I'll leave it to the experts to hash this area out.

Noel


__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 16:18       ` Neil Booth
@ 2002-08-10 20:27         ` Noel Yap
  2002-08-11  0:11           ` Neil Booth
  0 siblings, 1 reply; 256+ messages in thread
From: Noel Yap @ 2002-08-10 20:27 UTC (permalink / raw)
  To: Neil Booth; +Cc: Stan Shebs, Mike Stump, gcc

--- Neil Booth <neil@daikokuya.co.uk> wrote:
> Noel Yap wrote:-
> 
> > The study also indicates that having #include's
> within
> > header files is the largest contributor to the
> problem
> > (since nested #include's would increase the number
> of
> > file accesses combinatorially).
> 
> See below for why this isn't true for most compilers
> now.
> 
> > As another indication that the conclusion is true,
> > Lakos added guards around the #include lines
> > themselves and found compile times to dramatically
> > decrease.  For example:
> > #if header_h
> > #   include <header.h>
> > #endif
> 
> This isn't the case with GCC.  I hope you're aware
> of that.
> The first time GCC reads <header.h> it remembers if
> it had
> header guards.  If it's ever asked to #include it
> again,
> it checks if the guard is defined, and doesn't do
> anything.
> The file's contents are also not cached if it has
> header
> guards, on the assumption that the contents are
> unlikely to
> be of interest in the future.
> 
> In other words, this kind of #include protection is
> ugly and
> pointless (and possibly error-prone, though that
> would tend
> to be immediately obvious).  Most compilers now
> implement
> this optimization, but 5 or 6 years ago this wasn't
> the case.
> I think GCC was one of the first.

I stand corrected.  (I'm assuming gcc doesn't do this
in cases where the header guard might have side
effects or if there's a matching #else for the
#ifndef).

Do you think precompiled headers would help build
speed across several compiles since it would be
another source to eliminate repeated file opens?

Thanks,
Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 18:00         ` Nix
@ 2002-08-10 20:36           ` Noel Yap
  2002-08-11  4:30             ` Nix
  2002-08-12 15:08           ` Mike Stump
  1 sibling, 1 reply; 256+ messages in thread
From: Noel Yap @ 2002-08-10 20:36 UTC (permalink / raw)
  To: Nix; +Cc: Neil Booth, gcc

--- Nix <nix@esperi.demon.co.uk> wrote:
> [Cc: list trimmed]
> On Sat, 10 Aug 2002, Noel Yap spake:
> > I would agree if you're talking about complete
> builds
> > spanning only a few C/C++ files.  OTOH, when
> builds
> > span many hundreds of these files, build-time (not
> > just compile-time) starts getting bogged down on
> > (mostly) reopening and repreprocessing the same
> files
> > over and over.
> > 
> > Within our system, builds on Windows are
> magnitudes
> > faster since we're able to take advantage of
> > precompiled headers.
> 
> Are you sure that this isn't because GCC is having
> to parse the headers
> over and over again, while the precompiled system
> can avoid that
> overhead?

No, I'm not sure.  In any case, whether it's due to
elimination of reparsing or elimination of reopening,
would you agree that precompiled headers should speed
up builds?

> Especially for C++ header files (which tend to be
> large, complex,
> interdependent, and include a lot of code), the
> parsing and compilation
> time *vastly* dominates the preprocessing time.

What about for us lowly C programmers?

> Example, with GCC-3.1, with a `hello world'
> iostreams-using program...
> 
> The code:
> 
> #include <iostream>
> 
> int main (void)
>  {
>   std::cout << "Hello world";
>   return 0;
>  }
> 
> Time spent preprocessing (distorted by the slowness
> of cpp's output
> routines):
> 
> nix@loki 62 /tmp% time c++ -E -ftime-report hello.C
> >/dev/null
> 
> real    0m1.424s
> user    0m0.710s
> sys     0m0.100s
> 
> Time spent preprocessing and parsing (roughly; cpp's
> output routines are
> still slow; on the trunk much less time will be
> spent preprocessing
> because the integrated preprocessor doesn't have to
> do any output at all
> there, instead feeding a token stream to the rest of
> the compiler):
> 
> nix@loki 60 /tmp% c++ -ftime-report -fsyntax-only
> hello.C 
> 
> Execution times (seconds)
>  garbage collection    :   1.16 (12%) usr   0.08 (
> 6%) sys   2.19 (13%) wall
>  preprocessing         :   1.04 (11%) usr   0.29
> (20%) sys   2.10 (12%) wall
>  lexical analysis      :   0.99 (10%) usr   0.28
> (20%) sys   1.87 (11%) wall
>  parser                :   6.12 (65%) usr   0.75
> (53%) sys  10.85 (63%) wall
>  varconst              :   0.08 ( 1%) usr   0.00 (
> 0%) sys   0.10 ( 1%) wall
>  TOTAL                 :   9.44             1.42    
>        17.21
> 
> (oddly, preprocessing took *longer* than it did
> using -E, which I'd not
>  expected; but, still parsing vastly dominates
> preprocessing, and this isn't
>  going near e.g. the STL headers)

OK.  Now let's say that that preprocessing can be used
across several compiles.  Can you see how an entire
_build_ (eg comprising of many compiles) can be sped
up?

> Complete run, with optimization:
> 
> nix@loki 66 /tmp% c++ -O2 -ftime-report -o hello
> hello.C
> 
> Execution times (seconds)
>  garbage collection    :   1.10 (11%) usr   0.11 (
> 9%) sys   1.74 (11%) wall
>  cfg cleanup           :   0.01 ( 0%) usr   0.00 (
> 0%) sys   0.01 ( 0%) wall
>  life analysis         :   0.02 ( 0%) usr   0.00 (
> 0%) sys   0.02 ( 0%) wall
>  preprocessing         :   1.12 (11%) usr   0.22
> (18%) sys   2.04 (13%) wall
>  lexical analysis      :   0.98 (10%) usr   0.22
> (18%) sys   1.93 (12%) wall
>  parser                :   6.46 (65%) usr   0.63
> (53%) sys   9.98 (62%) wall
>  expand                :   0.00 ( 0%) usr   0.00 (
> 0%) sys   0.01 ( 0%) wall
>  varconst              :   0.08 ( 1%) usr   0.00 (
> 0%) sys   0.12 ( 1%) wall
>  CSE                   :   0.02 ( 0%) usr   0.00 (
> 0%) sys   0.03 ( 0%) wall
>  CSE 2                 :   0.01 ( 0%) usr   0.00 (
> 0%) sys   0.03 ( 0%) wall
>  regmove               :   0.01 ( 0%) usr   0.00 (
> 0%) sys   0.02 ( 0%) wall
>  global alloc          :   0.02 ( 0%) usr   0.00 (
> 0%) sys   0.04 ( 0%) wall
>  flow 2                :   0.01 ( 0%) usr   0.00 (
> 0%) sys   0.01 ( 0%) wall
>  rename registers      :   0.02 ( 0%) usr   0.00 (
> 0%) sys   0.03 ( 0%) wall
>  scheduling 2          :   0.00 ( 0%) usr   0.01 (
> 1%) sys   0.02 ( 0%) wall
>  final                 :   0.01 ( 0%) usr   0.00 (
> 0%) sys   0.01 ( 0%) wall
>  TOTAL                 :   9.96             1.20    
>        16.16
> 
> Now obviously with a less toy example the time
> consumed optimizing would
> rise; but that doesn't affect my point, that the
> lion's share of time
> spent in C++ header files is parsing time, and that
> speeding up the
> preprocessor will have limited effect now (thanks to
> Zack and Neil
> speeding it up so much already :) ).

What kind of effect does it have for C?  Do you think
saving preprocessor output (of header files) can speed
up a build consisting of many, many compiles?

Thanks,
Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 20:27         ` Noel Yap
@ 2002-08-11  0:11           ` Neil Booth
  2002-08-12 12:04             ` Devang Patel
  0 siblings, 1 reply; 256+ messages in thread
From: Neil Booth @ 2002-08-11  0:11 UTC (permalink / raw)
  To: Noel Yap; +Cc: Stan Shebs, Mike Stump, gcc

Noel Yap wrote:-

> I stand corrected.  (I'm assuming gcc doesn't do this
> in cases where the header guard might have side
> effects or if there's a matching #else for the
> #ifndef).

Correct.  Header guards with side effects hardly exist
I think.  We recognize #ifndef and #if !defined with
optional parentheses.  Comments and whitespace do not
affect the optimization.  Headers with #else, #elif
at the top level, and with anything outside the guards,
or with a header guard that comes from a macro expansion
are not optimized this way.

> Do you think precompiled headers would help build
> speed across several compiles since it would be
> another source to eliminate repeated file opens?

I don't think repeated file opens are high on the list
of time eaters, particularly because of the optimization
I mentioned.  Tokenization and parsing probably take
much longer.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 14:26           ` Cyrille Chepelov
  2002-08-10 17:33             ` Daniel Berlin
@ 2002-08-11  1:03             ` Florian Weimer
  1 sibling, 0 replies; 256+ messages in thread
From: Florian Weimer @ 2002-08-11  1:03 UTC (permalink / raw)
  To: Cyrille Chepelov; +Cc: gcc

Cyrille Chepelov <cyrille@chepelov.org> writes:

> What cachegrind doesn't show (yet ?) is if the access pattern kills
> opportunities for the memory interface to use burst transfers;

By the way:

IIRC, there is some FUD by the author on the web page that the cache
simulation might be incorrect.  Maybe someone should check this before
jumping to conclusions (I'm not familiar with processor cache
architectures, that's why I can't do this, sorry).

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 20:36           ` Noel Yap
@ 2002-08-11  4:30             ` Nix
  0 siblings, 0 replies; 256+ messages in thread
From: Nix @ 2002-08-11  4:30 UTC (permalink / raw)
  To: Noel Yap; +Cc: Neil Booth, gcc

[rewrapped my quoted text]
On Sat, 10 Aug 2002, Noel Yap stated:
> --- Nix <nix@esperi.demon.co.uk> wrote:
>> Are you sure that this isn't because GCC is having to parse the
>> headers over and over again, while the precompiled system can avoid
>> that overhead?
> 
> No, I'm not sure.  In any case, whether it's due to
> elimination of reparsing or elimination of reopening,
> would you agree that precompiled headers should speed
> up builds?

Yes, but mainly (IMHO) because the `precompilation' process includes
some parsing work. The preprocessing job (compilation phases 1--4)
should be quite fast.

So speeding up *parsing* is the point here; getting rid of bison should
help fix that :)

(Maybe I'm being too pedantic here.)

>> Especially for C++ header files (which tend to be large, complex,
>> interdependent, and include a lot of code), the parsing and
>> compilation time *vastly* dominates the preprocessing time.
> 
> What about for us lowly C programmers?

(oops, sorry, I thought you were using C++, because C++ users really
*notice* time spent in headers.)

The disparity there isn't anywhere near so extreme, but it's still there
(just).

I know that even with large bodies of C code I've never been able to
spot preprocessing time; even the old cccp was damned-near instantaneous
(well, except on very memory-constrained boxes where even ls(1) was a
hassle).

[snip]
>> Now obviously with a less toy example the time consumed optimizing
>> would rise; but that doesn't affect my point, that the lion's share
>> of time spent in C++ header files is parsing time, and that speeding
>> up the preprocessor will have limited effect now (thanks to Zack and
>> Neil speeding it up so much already :) ).
> 
> What kind of effect does it have for C?  Do you think

Hm...

... from my quick check (so primitive that I'm not even going to post it
here) preprocessing and parsing seem to consume roughly equal amounts of
time, and both are far exceeded by the amount of time taken to compile
the code itself.

So there's not much need for preprocessor optimization in C as far as I
can tell.

> saving preprocessor output (of header files) can speed
> up a build consisting of many, many compiles?

Preprocessor *output*? In its current state, the output phase is the
slowest part of the preprocessor, such that feeding token streams
straight into the compiler (as 3.3-to-be will) is faster than saving it
out to disk would be :)

And for C code in particular I imagine that the larger size of the
precompiled header lumps would cause extra disk I/O time that would
exceed the time taken to parse the headers in the first place... but 
this is a guess: some of the people who've actually been working on
precompiled headers can probably answer this better :)

-- 
`There's something satisfying about killing JWZ over and over again.'
                                        -- 1i, personal communication

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 17:48 ` Aaron Lehmann
@ 2002-08-12 10:36   ` Dale Johannesen
  0 siblings, 0 replies; 256+ messages in thread
From: Dale Johannesen @ 2002-08-12 10:36 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: Dale Johannesen, Mike Stump, gcc

On Saturday, August 10, 2002, at 05:48 PM, Aaron Lehmann wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-11  0:11           ` Neil Booth
@ 2002-08-12 12:04             ` Devang Patel
  0 siblings, 0 replies; 256+ messages in thread
From: Devang Patel @ 2002-08-12 12:04 UTC (permalink / raw)
  To: Noel Yap; +Cc: Neil Booth, Stan Shebs, Mike Stump, gcc

On Sunday, August 11, 2002, at 12:08  AM, Neil Booth wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 14:29 ` Neil Booth
  2002-08-09 15:02   ` Nathan Sidwell
@ 2002-08-12 12:11   ` Mike Stump
  2002-08-12 12:41     ` David Edelsohn
  2002-08-12 19:17     ` Mike Stump
  1 sibling, 2 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-12 12:11 UTC (permalink / raw)
  To: Neil Booth; +Cc: gcc

On Friday, August 9, 2002, at 02:27 PM, Neil Booth wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 12:11   ` Mike Stump
@ 2002-08-12 12:41     ` David Edelsohn
  2002-08-12 12:47       ` Matt Austern
  2002-08-12 19:17     ` Mike Stump
  1 sibling, 1 reply; 256+ messages in thread
From: David Edelsohn @ 2002-08-12 12:41 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

>>>>> Mike Stump writes:

Mike> Instead?  Well, I cannot promise instead, but I think it is reasonable 
Mike> to look at it in addition to all the other stuff.

	If Apple wants to tackle one or more of the fundamental GCC design
problems affecting compiler performance which have been mentioned during
this discussion, I think that Apple will have a lot of support and help
from GCC developers.  This means doing the analysis of the problem,
experimenting with possible approaches, designing a solution, and
implementing that solution with the entire GCC development community.

	Fiddling around the edges, disabling functionality to save
compilation time is not likely to be effective for Apple or for the GCC
community.  The big gains are to be found in revising the design and
implementation of GCC's underlying infrastructure, not lots of little
tweaks.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 12:41     ` David Edelsohn
@ 2002-08-12 12:47       ` Matt Austern
  2002-08-12 12:56         ` David S. Miller
  0 siblings, 1 reply; 256+ messages in thread
From: Matt Austern @ 2002-08-12 12:47 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Mike Stump, gcc

On Monday, August 12, 2002, at 12:40 PM, David Edelsohn wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 12:47       ` Matt Austern
@ 2002-08-12 12:56         ` David S. Miller
  2002-08-12 13:56           ` Matt Austern
  2002-08-12 14:28           ` Stan Shebs
  0 siblings, 2 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-12 12:56 UTC (permalink / raw)
  To: austern; +Cc: dje, mrs, gcc

   From: Matt Austern <austern@apple.com>
   Date: Mon, 12 Aug 2002 12:47:30 -0700
   
   And yes, we're aware that many gains are possible only
   if we rewrite the parser or redesign the tree structure.  The
   only reason we haven't started on rewriting the parser is
   that someone else is already doing it.

So work on an attempt at RTL refcounting, the patch below is a place
to start.

Next you have to:

1) walk through the whole compiler and add all the
   proper {GET,PUT}_RTX calls.

2) find a solution for circular RTL

   I would suggest as a first pass (ie. to get some performance
   numbers), special case things like INSN_LISTs and just don't
   refcount for the references to INSNs they generate.  Likewise
   for INSN dependency lists generated by the scheduler et al.

3) bring it at least to the point where you can successfully
   get a successful build of some non-trivial source file.
   Perhaps gcc/reload.i.  Even if it requires some gross hacks
   to get it to pass through, post GC vs. refcounting performance
   numbers.

4) Almost certainly, in trying to refcount things correctly, you will
   spot real bugs in the compiler.  Please keep track of these so they
   can be fixed independant of whether the rtx refcounting is ever
   used or not.

5) If you are still bored at this point, add the machinery to use the
   RTX walking of the current garbage collector to verify the
   reference counts.  This will basically be required in order to
   make and sufficiently correctness check a final implementation.

   It would be enabled by default, so that if any refence counts
   go wrong they will be spotted with impunity.  This is part of the
   sociological aspect of these changes, namely getting people to
   think about proper resource tracking when working with RTL
   objects.  If the compiler explodes when they get it wrong, they
   will learn eventually :-)

Because if someone else doesn't do this, I will end up doing
so :-)

--- ./rtl.h.~1~	Sun Aug 11 19:04:35 2002
+++ ./rtl.h	Sun Aug 11 20:42:02 2002
@@ -130,6 +130,9 @@ struct rtx_def
   /* The kind of value the expression has.  */
   ENUM_BITFIELD(machine_mode) mode : 8;
 
+  /* Reference count.  */
+  unsigned int __count : 24;
+
   /* 1 in a MEM if we should keep the alias set for this mem unchanged
      when we access a component.
      1 in a CALL_INSN if it is a sibling call.
@@ -184,7 +187,7 @@ struct rtx_def
      1 in a REG means this reg refers to the return value
      of the current function.
      1 in a SYMBOL_REF if the symbol is weak.  */
-  unsigned integrated : 1;
+  unsigned int integrated : 1;
   /* 1 in an INSN or a SET if this rtx is related to the call frame,
      either changing how we compute the frame address or saving and
      restoring registers in the prologue and epilogue.
@@ -193,7 +196,7 @@ struct rtx_def
      1 in a REG if the register is a pointer.
      1 in a SYMBOL_REF if it addresses something in the per-function
      constant string pool.  */
-  unsigned frame_related : 1;
+  unsigned int frame_related : 1;
 
   /* The first element of the operands of this rtx.
      The number of operands and their types are controlled
@@ -211,12 +214,25 @@ struct rtx_def
 #define GET_MODE(RTX)	    ((enum machine_mode) (RTX)->mode)
 #define PUT_MODE(RTX, MODE) ((RTX)->mode = (ENUM_BITFIELD(machine_mode)) (MODE))
 
+/* Define macros to get/put references to RTL objects.  */
+
+#define GET_RTX(RTX)		(((RTX)->__count)++)
+#define PUT_RTX(RTX) \
+do \
+  { \
+    if (--((RTX)->__count) == 0) \
+      __put_rtx(RTX); \
+  } \
+while (0)
+
+
 /* RTL vector.  These appear inside RTX's when there is a need
    for a variable number of things.  The principle use is inside
    PARALLEL expressions.  */
 
 struct rtvec_def GTY(()) {
   int num_elem;		/* number of elements */
+  int __count;		/* reference count */
   rtx GTY ((length ("%h.num_elem"))) elem[1];
 };
 
@@ -225,6 +241,15 @@ struct rtvec_def GTY(()) {
 #define GET_NUM_ELEM(RTVEC)		((RTVEC)->num_elem)
 #define PUT_NUM_ELEM(RTVEC, NUM)	((RTVEC)->num_elem = (NUM))
 
+#define GET_RTVEC(RTVEC)	(((RTVEC)->__count)++)
+#define PUT_RTVEC(RTVEC) \
+do \
+  { \
+    if (--((RTVEC)->__count) == 0) \
+      __put_rtvec(RTVEC); \
+  } \
+while (0)
+
 /* Predicate yielding nonzero iff X is an rtl for a register.  */
 #define REG_P(X) (GET_CODE (X) == REG)
 
@@ -1347,6 +1372,8 @@ extern rtx emit_copy_of_insn_after	PARAM
 extern rtx rtx_alloc			PARAMS ((RTX_CODE));
 extern rtvec rtvec_alloc		PARAMS ((int));
 extern rtx copy_rtx			PARAMS ((rtx));
+extern void __put_rtx			PARAMS ((rtx));
+extern void __put_rtvec			PARAMS ((rtvec));
 
 /* In emit-rtl.c */
 extern rtx copy_rtx_if_shared		PARAMS ((rtx));
--- ./gengenrtl.c.~1~	Sun Aug 11 19:04:33 2002
+++ ./gengenrtl.c	Sun Aug 11 20:45:18 2002
@@ -278,11 +278,15 @@ gendef (format)
      the memory and initializes it.  */
   puts ("{");
   puts ("  rtx rt;");
-  printf ("  rt = ggc_alloc_rtx (%d);\n", (int) strlen (format));
+  puts ("  int n;");
+  printf ("  n = (sizeof (struct rtx_def) + ((%d - 1) * sizeof(rtunion)));\n",
+	  (int) strlen (format));
+  puts ("  rt = xmalloc (n);\n");
 
   puts ("  memset (rt, 0, sizeof (struct rtx_def) - sizeof (rtunion));\n");
   puts ("  PUT_CODE (rt, code);");
   puts ("  PUT_MODE (rt, mode);");
+  puts ("  rt->__count = 1;");
 
   for (p = format, i = j = 0; *p ; ++p, ++i)
     if (*p != '0')
--- ./rtl.c.~1~	Tue Jun  4 14:06:54 2002
+++ ./rtl.c	Sun Aug 11 20:53:11 2002
@@ -242,14 +242,34 @@ rtvec_alloc (n)
 {
   rtvec rt;
 
-  rt = ggc_alloc_rtvec (n);
+  n = (sizeof(struct rtvec_def)
+       + ((n - 1) * sizeof (rtx)));
+  rt = xmalloc (n);
+
+  PUT_NUM_ELEM (rt, n);
+  rt->__count = 1;
+
   /* clear out the vector */
   memset (&rt->elem[0], 0, n * sizeof (rtx));
 
-  PUT_NUM_ELEM (rt, n);
   return rt;
 }
 
+void
+__put_rtvec (rv)
+     rtvec rv;
+{
+  int i, len = GET_NUM_ELEM (rv);
+
+  for (i = 0; i < len; i++)
+    {
+      if (! rv->elem[i])
+	abort ();
+      PUT_RTX (rv->elem[i]);
+    }
+  xfree (rv);
+}
+
 /* Allocate an rtx of code CODE.  The CODE is stored in the rtx;
    all the rest is initialized to zero.  */
 
@@ -258,9 +278,11 @@ rtx_alloc (code)
   RTX_CODE code;
 {
   rtx rt;
-  int n = GET_RTX_LENGTH (code);
+  int n;
 
-  rt = ggc_alloc_rtx (n);
+  n = (sizeof (struct rtx_def)
+       + ((GET_RTX_LENGTH (code) - 1) * sizeof(rtunion)));
+  rt = xmalloc (n);
 
   /* We want to clear everything up to the FLD array.  Normally, this
      is one int, but we don't want to assume that and it isn't very
@@ -268,7 +290,58 @@ rtx_alloc (code)
 
   memset (rt, 0, sizeof (struct rtx_def) - sizeof (rtunion));
   PUT_CODE (rt, code);
+  rt->__count = 1;
   return rt;
+}
+
+void
+__put_rtx(rt)
+     rtx rt;
+{
+  char *fmt;
+  int i, j, len;
+
+  fmt = GET_RTX_FORMAT (GET_CODE (rt));
+  len = GET_RTX_LENGTH (GET_CODE (rt));
+  for (i = 0; i < len; i++) {
+    switch (fmt[i]) {
+	case 'e':
+	  if (! XEXP (rt, i))
+	    abort ();
+	  PUT_RTX (XEXP (rt, i));
+	  break;
+
+	case 'E':
+	case 'V':
+	  /* XXX How to handle vectors... XXX */
+	  if (XVEC (rt, i) != NULL)
+	    {
+	      for (j = 0; j < XVECLEN (rt, i); j++)
+		{
+		  if (! XVECEXP (rt, i, j))
+		    abort ();
+		  PUT_RTX (XVECEXP (rt, i, j));
+		}
+	    }
+	  break;
+
+	case 't':
+	case 'w':
+	case 'i':
+	case 's':
+	case 'S':
+	case 'T':
+	case 'u':
+	case 'B':
+	case '0':
+	  break;
+
+	default:
+	  abort ();
+    };
+  }
+
+  xfree(rt);
 }
 
 \f

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 12:56         ` David S. Miller
@ 2002-08-12 13:56           ` Matt Austern
  2002-08-12 14:27             ` Daniel Berlin
                               ` (2 more replies)
  2002-08-12 14:28           ` Stan Shebs
  1 sibling, 3 replies; 256+ messages in thread
From: Matt Austern @ 2002-08-12 13:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: dje, mrs, gcc

On Monday, August 12, 2002, at 12:43 PM, David S. Miller wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 13:56           ` Matt Austern
@ 2002-08-12 14:27             ` Daniel Berlin
  2002-08-12 15:26               ` David Edelsohn
  2002-08-12 14:59             ` David S. Miller
  2002-08-12 16:00             ` Geoff Keating
  2 siblings, 1 reply; 256+ messages in thread
From: Daniel Berlin @ 2002-08-12 14:27 UTC (permalink / raw)
  To: Matt Austern; +Cc: David S. Miller, dje, mrs, gcc

On Mon, 12 Aug 2002, Matt Austern wrote:

> On Monday, August 12, 2002, at 12:43 PM, David S. Miller wrote:
> 
> >    From: Matt Austern <austern@apple.com>
> >    Date: Mon, 12 Aug 2002 12:47:30 -0700
> >
> >    And yes, we're aware that many gains are possible only
> >    if we rewrite the parser or redesign the tree structure.  The
> >    only reason we haven't started on rewriting the parser is
> >    that someone else is already doing it.
> >
> > So work on an attempt at RTL refcounting, the patch below is a place
> > to start.
> 
> Thanks for the pointer, that's a useful starting point.
> 
> But, at the risk of sounding like a broken record...  Do
> we have benchmarks showing that RTL gc is one of
> the major causes of slow compile speed?
> 
> At the moment, we're spending a lot of time doing
> benchmarking and trying to figure out just where the
> time is going.  I realize this has its limitations, that
> poorly designed data structures may end up resulting
> in tiny bits of overhead everywhere even if they never
> show up in a profile.  But at least we can try to
> understand what kinds of programs are especially
> bad.  (One interesting fact, for example: one file that
> we care a lot about takes twice as long to compile with
> the C++ front end than with the C front end.)

Well, the tools for this stuff are much better on osx than on Linux, so 
you guys are probably ahead of others in figuring out whether GC is really 
bad for us.

You can easily get numbers like data cache miss cycles, etc, and graph 
them nicely with MONster.
-Dan

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 12:56         ` David S. Miller
  2002-08-12 13:56           ` Matt Austern
@ 2002-08-12 14:28           ` Stan Shebs
  2002-08-12 15:05             ` David S. Miller
  1 sibling, 1 reply; 256+ messages in thread
From: Stan Shebs @ 2002-08-12 14:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: austern, dje, mrs, gcc

David S. Miller wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 13:56           ` Matt Austern
  2002-08-12 14:27             ` Daniel Berlin
@ 2002-08-12 14:59             ` David S. Miller
  2002-08-12 16:00             ` Geoff Keating
  2 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-12 14:59 UTC (permalink / raw)
  To: austern; +Cc: dje, mrs, gcc

   From: Matt Austern <austern@apple.com>
   Date: Mon, 12 Aug 2002 13:56:32 -0700

   But, at the risk of sounding like a broken record...  Do
   we have benchmarks showing that RTL gc is one of
   the major causes of slow compile speed?

It's not the GC it's the resulting data access patterns
that result, and such overhead won't show up in normal profiling since
such overhead is simply spread all over the compiler.

That's the purpose of hobbling together a "hack" implementation
of refcounting, to get some performance comparisons.  You don't have
to do a "final" perfect implementation to realize a tree usable enough
for simple initial benchmarking.  Based upon those results, we can
decide to continue or not.

But hey if people are going to be silly enough to require
pre-benchmarking before even laying a finger on the refcounting
bits, no problem we'll just have to wait for me to work on it
then ;-)

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 14:28           ` Stan Shebs
@ 2002-08-12 15:05             ` David S. Miller
  0 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-12 15:05 UTC (permalink / raw)
  To: shebs; +Cc: austern, dje, mrs, gcc

   From: Stan Shebs <shebs@apple.com>
   Date: Mon, 12 Aug 2002 14:27:52 -0700

   So, uh, did I miss the part where refcounting is shown to be an improvement
   over the status quo?  It's plausible I suppose, but counting does have its
   overhead too.  We ought to have at least a back-of-the-envelope estimate
   before changing everything...

You can choose to do that, but I bet you can spend the same amount of
effort getting a benchmark'able refcounting tree together.

This is so frustrating that I just might stop everything else I'm
doing and put something together so I can just avoid all of this
rediculious red tape people are putting up just to work on what
amounts to a frickin technology demo!

Have you ever implemented something solely to figure out whether it
was worthwhile or not? :-)

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 18:00         ` Nix
  2002-08-10 20:36           ` Noel Yap
@ 2002-08-12 15:08           ` Mike Stump
  1 sibling, 0 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-12 15:08 UTC (permalink / raw)
  To: Nix; +Cc: Noel Yap, Neil Booth, gcc

On Saturday, August 10, 2002, at 05:49 PM, Nix wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 14:27             ` Daniel Berlin
@ 2002-08-12 15:26               ` David Edelsohn
  2002-08-13 10:49                 ` David Edelsohn
  0 siblings, 1 reply; 256+ messages in thread
From: David Edelsohn @ 2002-08-12 15:26 UTC (permalink / raw)
  To: Daniel Berlin, Matt Austern, David S. Miller; +Cc: gcc

	I have IBM's hpmcount tool installed on a Power4 AIX 5.1 system
which can use PMAPI to access the hardware performance counters on the
chip.  I would be happy to provide additional data for comparison with the
x86 cache statistics which have been mentioned.

	So that we're all on the same page, what sourcefile is being
compiled with which GCC options?

	I can acquire information like for cc1 -O2 hello.c:

  PM_DTLB_MISS (Data TLB misses)               :            5538
  PM_ITLB_MISS (Instruction TLB misses)        :             819
  PM_LD_MISS_L1 (L1 D cache load misses)       :           43074
  PM_ST_MISS_L1 (L1 D cache store misses)      :          349240
  PM_ST_REF_L1 (L1 D cache store references)   :         1958037
  PM_LD_REF_L1 (L1 D cache load references)    :         3113549

  Utilization rate                           :          29.438 %
  % TLB misses per cycle                     :           0.038 %
  Avg number of loads per TLB miss           :         562.215
  Load and store operations                  :           5.072 M
  Instructions per load/store                :           2.899
  Avg number of loads per load miss          :          72.284
  Avg number of stores per store miss        :           5.607
  Avg number of load/stores per D1 miss      :          12.927
  L1 cache hit rate                          :          92.264 %


  PM_DATA_FROM_L3 (Data loaded from L3)                   :            1420
  PM_DATA_FROM_MEM (Data loaded from memory)              :             144
  PM_DATA_FROM_L35 (Data loaded from L3.5)                :              19
  PM_DATA_FROM_L2 (Data loaded from L2)                   :           36410
  PM_DATA_FROM_L25_SHR (Data loaded from L2.5 shared)     :               0
  PM_DATA_FROM_L275_SHR (Data loaded from L2.75 shared)   :               0
  PM_DATA_FROM_L275_MOD (Data loaded from L2.75 modified) :               0
  PM_DATA_FROM_L25_MOD (Data loaded from L2.5 modified)   :               0

  Memory traffic                             :           0.074 MBytes
  Memory bandwidth                           :           1.589 MBytes/sec
  Total loads from L3                        :           0.001 M
  L3 traffic                                 :           0.184 MBytes
  L3 bandwidth                               :           3.970 MBytes/sec
  L3 Load miss rate                          :           9.097 %
  Total loads from L2                        :           0.036 M
  L2 traffic                                 :           4.660 MBytes
  L2 bandwidth                               :         100.446 MBytes/sec
  L2 Load miss rate                          :           4.167 %


David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 15:26   ` Geoff Keating
  2002-08-09 16:06     ` Stan Shebs
@ 2002-08-12 15:55     ` Mike Stump
  1 sibling, 0 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-12 15:55 UTC (permalink / raw)
  To: Geoff Keating; +Cc: Stan Shebs, gcc

On Friday, August 9, 2002, at 03:26 PM, Geoff Keating wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 13:56           ` Matt Austern
  2002-08-12 14:27             ` Daniel Berlin
  2002-08-12 14:59             ` David S. Miller
@ 2002-08-12 16:00             ` Geoff Keating
  2002-08-13  2:58               ` Nick Ing-Simmons
  2002-08-13 10:47               ` Richard Henderson
  2 siblings, 2 replies; 256+ messages in thread
From: Geoff Keating @ 2002-08-12 16:00 UTC (permalink / raw)
  To: Matt Austern; +Cc: gcc

Matt Austern <austern@apple.com> writes:

> On Monday, August 12, 2002, at 12:43 PM, David S. Miller wrote:
> 
> >    From: Matt Austern <austern@apple.com>
> >    Date: Mon, 12 Aug 2002 12:47:30 -0700
> >
> >    And yes, we're aware that many gains are possible only
> >    if we rewrite the parser or redesign the tree structure.  The
> >    only reason we haven't started on rewriting the parser is
> >    that someone else is already doing it.
> >
> > So work on an attempt at RTL refcounting, the patch below is a place
> > to start.
> 
> Thanks for the pointer, that's a useful starting point.
> 
> But, at the risk of sounding like a broken record...  Do
> we have benchmarks showing that RTL gc is one of
> the major causes of slow compile speed?

We happen to know that GC as a whole is 10-13% of total compile time,
even at -O0, and my expectation is that the RTL part of that is
perhaps two-thirds, say 7%.  So the benefit you can get is 7% less any
overhead in tracking the reference counts and freeing
briefly-allocated RTL.  

My suggestion is to try shrinking RTL in other ways.  For instance,
once RTL is generated it should all match an insn or a splitter.  If
we could store RTL as the insn number (or a splitter number) plus the
operands, rather than the expanded form we have now, that should be
much easier to traverse.  For those operations that look at the form
of RTL, code could be generated to perform that operation knowing what
insns exist; for instance, on x86 the form of the 'add' instruction is:

(insn 15 13 17 (parallel[
            (set (reg:SI 61)
                (plus:SI (reg/v:SI 59)
                    (reg/v:SI 60)))
            (clobber (reg:CC 17 flags))
        ] ) -1 (nil)
    (nil))

we could represent this as

(packed_insn 15 13 17 207 {*addsi_1} [(reg:SI 61) (reg:SI 59) (reg:SI 60)])

which would save us, by my count, 50% of the RTL objects for this
case.  I'd expect that would then speed GC (on this object) by 50%,
speed up allocation by 50%, and hopefully would also speed up code
that uses these objects because (a) they'd better fit in cache and (b)
there would be fewer pointers to chase.

To perform operations that are now done directly on the RTL, there'd be
a switch statement, for instance:

int reg_mentioned_p (reg, in)
{
...
  case PACKED_INSN:
    switch (PACKED_INSN_NUMBER (in)) {
...
      case 207: /* *addsi_1 */
	if (REGNO (reg) == 17)  // deal with the clobbered register
          return 1;
        // deal with the operands
        break;
    }
...
}

Even combine can be handled this way, by pregenerating rules based on
the insn numbers being combined.  Relatively few insns can actually be
combined, so it shouldn't require a huge amount of generated code.

On RISCy chips, you could take even further advantage of the fact that
often an operand is guaranteed to be a register, or a constant
integer or whatever, and so eliminate some tests.

I'm not sure how much work this is to implement.  I suspect what you'd
end up doing is performing a trade-off between generating too many
routines and having to rewrite large chunks of old code to use the
routines that already exist but that they don't use.

Now, if only I could think of something that would work like this on
trees...

-- 
- Geoffrey Keating <geoffk@geoffk.org> <geoffk@redhat.com>

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:00     ` Aldy Hernandez
  2002-08-09 16:26       ` Stan Shebs
@ 2002-08-12 16:05       ` Mike Stump
  1 sibling, 0 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-12 16:05 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: gcc

On Friday, August 9, 2002, at 04:05 PM, Aldy Hernandez wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:26       ` Stan Shebs
  2002-08-09 16:31         ` Aldy Hernandez
  2002-08-09 17:36         ` Daniel Berlin
@ 2002-08-12 16:23         ` Mike Stump
  2 siblings, 0 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-12 16:23 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Aldy Hernandez, gcc

On Friday, August 9, 2002, at 04:25 PM, Stan Shebs wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 16:29       ` Phil Edwards
@ 2002-08-12 16:24         ` Mike Stump
  2002-08-12 18:38           ` Phil Edwards
  2002-08-13  5:27           ` Theodore Papadopoulo
  0 siblings, 2 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-12 16:24 UTC (permalink / raw)
  To: Phil Edwards; +Cc: Stan Shebs, gcc

On Friday, August 9, 2002, at 04:29 PM, Phil Edwards wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 16:24         ` Mike Stump
@ 2002-08-12 18:38           ` Phil Edwards
  2002-08-13  5:27           ` Theodore Papadopoulo
  1 sibling, 0 replies; 256+ messages in thread
From: Phil Edwards @ 2002-08-12 18:38 UTC (permalink / raw)
  To: Mike Stump; +Cc: Stan Shebs, gcc

On Mon, Aug 12, 2002 at 04:24:46PM -0700, Mike Stump wrote:
> On Friday, August 9, 2002, at 04:29 PM, Phil Edwards wrote:
> > Personally, "fastest compile possible" usually just means 
> > -fsyntax-only.
> 
> -fsyntax-only isn't a compile.

My point, if we're nitpicking, is that almost every single time I hear a
user complain that, "gcc is taking so long," it's immediately followed by,
"all I want to do is check that I got the template specializations in the
right order," etc.  So they use -fsyntax-only while writing their code,
and then fire off a "real" build at -O5.2e7 and go home for the evening.

Phil

-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 12:11   ` Mike Stump
  2002-08-12 12:41     ` David Edelsohn
@ 2002-08-12 19:17     ` Mike Stump
  2002-08-12 23:28       ` Neil Booth
  1 sibling, 1 reply; 256+ messages in thread
From: Mike Stump @ 2002-08-12 19:17 UTC (permalink / raw)
  To: Mike Stump; +Cc: Neil Booth, gcc

On Monday, August 12, 2002, at 12:11 PM, Mike Stump wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 19:17     ` Mike Stump
@ 2002-08-12 23:28       ` Neil Booth
  0 siblings, 0 replies; 256+ messages in thread
From: Neil Booth @ 2002-08-12 23:28 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

Mike Stump wrote:-

> Ok, I looked at it.  A straight forward check to see it is has been 
> folded first with the use of an existing unused bit in the tree speeds 
> it up by 1.0003, or not enough to bother with all the code and the use 
> of the extra bit that someone else may find more valuable.  :-(

That's a shame.  8-(  Thanks for looking at it.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 18:25             ` David S. Miller
@ 2002-08-13  0:50               ` Loren James Rittle
  2002-08-13 21:46                 ` Fergus Henderson
  0 siblings, 1 reply; 256+ messages in thread
From: Loren James Rittle @ 2002-08-13  0:50 UTC (permalink / raw)
  To: davem; +Cc: gcc

In article < 20020809.181251.63969530.davem@redhat.com > David S. Miller writes:

> For example, I'm convinced that teaching all the RTL code "how to
> count" and thus obviating garbage collection all together, would be
> the biggest win ever.  (I'm saying RTL should have reference counts,
> if someone didn't catch what I meant)

Hi David,

(This message is in the interest of brainstorming ways to improve
 compilation speed, even if we can't volunteer to implement, as Mike
 requested.)

In general, comparing RC-GC to scan-GC, I often thought along the
quoted lines as well.  However, I had no systematic data and my
opinion softened somewhat after reading Boehm's papers.  Then, for
non-modern hardware, I once did compare the performance of a
scan-GC-based system (using boehm-gc) verses that of an equivalent
explicit-free-based system (along with all the application-level RC
code).  I was truly surprised at how little overhead there was for
using the boehm-gc technique (off-hand, I think it was under 1% for my
system, but I do doubt this study applies to modern HW and/or gcc's
memory usage pattern) and, more importantly, how much code complexity
was reduced.  I believe that reduction in code complexity is what
drove gcc switching to scan-GC RTL.  If you hand-coded RC back in, how
is that different than the complexity that was once removed with the
introduction of scan-GC?  If I recall correctly, subtle object
lifetime bugs came and went with the pre-scan-GC code due to
complexity (perhaps it was never formally RC'd and if that is your
answer, I'd buy it ;-).

Now, if I understand it right, the scan-GC technique used in gcc is
not as elegant (some explicit marking is required) or high-performance
(gcc's implementation doesn't use hardware dirty bits, etc.) as that
used in boehm-gc.  Has anyone ever tested gcc with its own GC disabled
but boehm-gc enabled?  OK, this is a red herring question.  Even if
performance was greater, portability concerns are what caused the
decision to build a new custom scan-GC verses reusing boehm-gc...
Assuming your (application-level) RC-GC test pans out in terms of
speedup, perhaps adding explicit code to maintain counts is not the
best approach to keeping the reins on complexity.

This might be what you meant, but: Wouldn't it be neater if gcc itself
could generally reference count underlying memory which supports C
pointers (as a language extension)?  According to published papers,
the compiler for Inferno could do it (I read them years ago when
looking at the classic Java GC model verses other VM technology thus
no cite here; I think it is interesting that the latest Java JIT
compilers support RC-GC now).

Perhaps it is impossible to add generic RC support to C and expose it
to all users (for instance, there is the classic pointer escape/ABI
problem).  But it seems that we could mark structs whose pointers and
underlying memory representations are to be handled specially upon
pointer copy/invalidation (i.e. due to failing off the end of a scope)
and then rigorously check usage against whatever model we use to avoid
pointer escape.  GCC's use of pointers in this area is regular and I
see no reason the RC extension couldn't be modeled off the exact needs
of the RTL usage (just as scan-GC was not exposed to compiler users,
this RC-GC support could be tuned for compiler implementation).

How to handle bootstrap since we'd want to use the new technique to
replace gcc's current scan-GC?  The current GC is only slightly
intrusive and could be retained to build the stage1 compiler with
support for the new RC-pointer handler (and related support for struct
marking in source).  Current scan-GC would be disabled for stage2 and
3; the new RC-pointer handler would be enabled.

Regards,
Loren

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 16:00             ` Geoff Keating
@ 2002-08-13  2:58               ` Nick Ing-Simmons
  2002-08-13 10:47               ` Richard Henderson
  1 sibling, 0 replies; 256+ messages in thread
From: Nick Ing-Simmons @ 2002-08-13  2:58 UTC (permalink / raw)
  To: geoffk; +Cc: gcc, Matt Austern

Geoff Keating <geoffk@geoffk.org> writes:
>
>We happen to know that GC as a whole is 10-13% of total compile time,
>even at -O0, and my expectation is that the RTL part of that is
>perhaps two-thirds, say 7%.  So the benefit you can get is 7% less any
>overhead in tracking the reference counts and freeing
>briefly-allocated RTL.  

That does not take into account the cache/tlb locality effects that 
Linus explained are caused by delayed reclaimation.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 16:24         ` Mike Stump
  2002-08-12 18:38           ` Phil Edwards
@ 2002-08-13  5:27           ` Theodore Papadopoulo
  2002-08-13 10:03             ` Mike Stump
  1 sibling, 1 reply; 256+ messages in thread
From: Theodore Papadopoulo @ 2002-08-13  5:27 UTC (permalink / raw)
  To: Mike Stump; +Cc: Phil Edwards, Stan Shebs, gcc

OK, since this is a brainstorming about speeding up gcc, and since 
silly ideas are at least discussed, let me try one.

Why not make incremental compilation a standard for gcc...

This would mean storing some information into the object files. 
Things I can see are:

- Compilation flags (defines, optimization, code generation 
  and debugging flags at least).
- A signature (eg MD5 or other) for each data_type/function/global (decl ?)
  allowing for a quick check for a change. We may even differentiate
  between visible/invisible changes. Eg if a function body changes 
  but not its interface there is no need to recompile the functions
  calling it. More generally name changes could be detected as 
  non-changes, but I suspect that this will mess up with debugging information.

Then generate code only for the relevant symbols (ie the new ones or 
those that have been changed or affected indirectly by a change ie 
depending on a function or variable that changed) and do a replacing 
of these in the .o file (is there an gas option like --replace ?).

In some way this is like PCH but pushed one step further. I can 
understand that making it work reliably is quite difficult, but the 
perspective of having a fast incremental compiler is tempting...
The information to store is certainly one of the trickiest part so a 
first step could be to add a flag stating recompile only this symbol and 
what depends on it. Not very user friendly, but maybe an interesting 
first step...

Is this a totally remote/stupid idea, or can it be done in some 
eventually not too distant future ??

	Theo.

--------------------------------------------------------------------
Theodore Papadopoulo
Email: Theodore.Papadopoulo@sophia.inria.fr Tel: (33) 04 92 38 76 01
 --------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13  5:27           ` Theodore Papadopoulo
@ 2002-08-13 10:03             ` Mike Stump
  0 siblings, 0 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-13 10:03 UTC (permalink / raw)
  To: Theodore Papadopoulo; +Cc: Phil Edwards, Stan Shebs, gcc

On Tuesday, August 13, 2002, at 05:27 AM, Theodore Papadopoulo wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 16:00             ` Geoff Keating
  2002-08-13  2:58               ` Nick Ing-Simmons
@ 2002-08-13 10:47               ` Richard Henderson
  1 sibling, 0 replies; 256+ messages in thread
From: Richard Henderson @ 2002-08-13 10:47 UTC (permalink / raw)
  To: Geoff Keating; +Cc: Matt Austern, gcc

On Mon, Aug 12, 2002 at 04:00:08PM -0700, Geoff Keating wrote:
> My suggestion is to try shrinking RTL in other ways.  For instance,
> once RTL is generated it should all match an insn or a splitter.  If
> we could store RTL as the insn number (or a splitter number) plus the
> operands, rather than the expanded form we have now, that should be
> much easier to traverse.

I've thought about this in passing before.

> (packed_insn 15 13 17 207 {*addsi_1} [(reg:SI 61) (reg:SI 59) (reg:SI 60)])
> 
> which would save us, by my count, 50% of the RTL objects for this
> case.

A bit more than that if the packed_insn rtl is actually variable
sized so that the operands are directly at the end of the other
arguments.

> To perform operations that are now done directly on the RTL, there'd be
> a switch statement, for instance:

Another possible solution, particularly for bletcherous code like
combine, is to regenerate the full instruction on demand.  After
try_combine is done with an insn, we free it immediately so that
we don't accumulate garbage.

But I suspect that most passes don't need this.  They only need
to know which operands are inputs, sets, and clobbers.  They need
to know which predicates apply.  Information which is trivial to
generate off the md file.

This idea, I think, has real potential, and could actually be
implemented without disrupting the entire compiler.

> Even combine can be handled this way, by pregenerating rules based on
> the insn numbers being combined.  Relatively few insns can actually be
> combined, so it shouldn't require a huge amount of generated code.

Pre-generating the combinations would be really cool, and probably
save quite a bit o time, but I don't really believe in that for even
the medium term.  The number of possibilities is really quite large.

> Now, if only I could think of something that would work like this on
> trees...

Having stronger typing instead of the union-of-everything would do.

r~

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 15:26               ` David Edelsohn
@ 2002-08-13 10:49                 ` David Edelsohn
  2002-08-13 10:52                   ` David S. Miller
                                     ` (2 more replies)
  0 siblings, 3 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-13 10:49 UTC (permalink / raw)
  To: Daniel Berlin, Matt Austern, David S. Miller; +Cc: gcc

Source file	Insns / L1 D$ Miss
-----------	------------------
reload.c		22
reload1.c		25
insn-recog.c		29

GCC 3.3 20020812 (experimental)
powerpc-ibm-aix5.1.0.0
Power4 processor

	As one of my colleagues commented, this is the cache behavior one
would see with database transaction processing.  In other words, this is
*really bad*.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 10:49                 ` David Edelsohn
@ 2002-08-13 10:52                   ` David S. Miller
  2002-08-13 14:03                   ` David Edelsohn
  2002-08-13 15:32                   ` Daniel Berlin
  2 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-13 10:52 UTC (permalink / raw)
  To: dje; +Cc: dan, austern, gcc

   From: David Edelsohn <dje@watson.ibm.com>
   Date: Tue, 13 Aug 2002 13:49:18 -0400

   As one of my colleagues commented, this is the cache behavior one
   would see with database transaction processing.  In other words, this is
   *really bad*.

Thanks for doing these tests David.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 10:49                 ` David Edelsohn
  2002-08-13 10:52                   ` David S. Miller
@ 2002-08-13 14:03                   ` David Edelsohn
  2002-08-13 14:46                     ` Geoff Keating
                                       ` (2 more replies)
  2002-08-13 15:32                   ` Daniel Berlin
  2 siblings, 3 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-13 14:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: dan, austern, gcc

	Here's an interesting (aka depressing) data point.  My previous
cache miss statistics were for GCC -O2.  At -O0, GCC's cache miss
statistics stay the same or get up to 20% *worse*.  In comparison, the
cache statistics for IBM's compiler without optimization enabled *improve*
up to 50 for the same reload.c and insn-recog.c input files compared to
optimized.

	GCC has some sort of overhead, maybe the tree->RTL conversion as
Dan mentioned, which really hurts re-use at -O0.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 14:03                   ` David Edelsohn
@ 2002-08-13 14:46                     ` Geoff Keating
  2002-08-13 15:10                       ` David Edelsohn
  2002-08-14  9:25                     ` Kevin Handy
  2002-08-18 12:58                     ` Jeff Sturm
  2 siblings, 1 reply; 256+ messages in thread
From: Geoff Keating @ 2002-08-13 14:46 UTC (permalink / raw)
  To: David Edelsohn; +Cc: gcc

David Edelsohn <dje@watson.ibm.com> writes:

> 	Here's an interesting (aka depressing) data point.  My previous
> cache miss statistics were for GCC -O2.  At -O0, GCC's cache miss
> statistics stay the same or get up to 20% *worse*.  In comparison, the
> cache statistics for IBM's compiler without optimization enabled *improve*
> up to 50 for the same reload.c and insn-recog.c input files compared to
> optimized.
> 
> 	GCC has some sort of overhead, maybe the tree->RTL conversion as
> Dan mentioned, which really hurts re-use at -O0.

Could you try with -fsyntax-only?

-- 
- Geoffrey Keating <geoffk@geoffk.org> <geoffk@redhat.com>

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 14:46                     ` Geoff Keating
@ 2002-08-13 15:10                       ` David Edelsohn
  2002-08-13 15:26                         ` Neil Booth
  0 siblings, 1 reply; 256+ messages in thread
From: David Edelsohn @ 2002-08-13 15:10 UTC (permalink / raw)
  To: Geoff Keating; +Cc: gcc

>>>>> Geoff Keating writes:

Geoff> Could you try with -fsyntax-only?

Source		I/D$ miss -O2	I/D$ miss -O0	I/D$ miss -fsyntax-only
------------	-------------	-------------	-----------------------
reload.c		22		22		23
reload1.c		25		22		23
insn-recog.c		29		23		26

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 15:10                       ` David Edelsohn
@ 2002-08-13 15:26                         ` Neil Booth
  0 siblings, 0 replies; 256+ messages in thread
From: Neil Booth @ 2002-08-13 15:26 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Geoff Keating, gcc

David Edelsohn wrote:-

> >>>>> Geoff Keating writes:
> 
> Geoff> Could you try with -fsyntax-only?
> 
> Source		I/D$ miss -O2	I/D$ miss -O0	I/D$ miss -fsyntax-only
> ------------	-------------	-------------	-----------------------
> reload.c		22		22		23
> reload1.c		25		22		23
> insn-recog.c		29		23		26

And -E 8-)  I'd actually be quite curious if you have time.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 10:49                 ` David Edelsohn
  2002-08-13 10:52                   ` David S. Miller
  2002-08-13 14:03                   ` David Edelsohn
@ 2002-08-13 15:32                   ` Daniel Berlin
  2002-08-13 15:58                     ` David Edelsohn
  2 siblings, 1 reply; 256+ messages in thread
From: Daniel Berlin @ 2002-08-13 15:32 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Daniel Berlin, Matt Austern, David S. Miller, gcc

On Tue, 13 Aug 2002, David Edelsohn wrote:

> Source file	Insns / L1 D$ Miss
> -----------	------------------
> reload.c		22
> reload1.c		25
> insn-recog.c		29
> 
> GCC 3.3 20020812 (experimental)
> powerpc-ibm-aix5.1.0.0
> Power4 processor
> 
> 	As one of my colleagues commented, this is the cache behavior one
> would see with database transaction processing.  In other words, this is
> *really bad*.

Yup.

> 
> David
> 
> 

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 15:32                   ` Daniel Berlin
@ 2002-08-13 15:58                     ` David Edelsohn
  2002-08-13 16:49                       ` David S. Miller
  0 siblings, 1 reply; 256+ messages in thread
From: David Edelsohn @ 2002-08-13 15:58 UTC (permalink / raw)
  To: dberlin; +Cc: Daniel Berlin, Matt Austern, David S. Miller, gcc

>>>>> Daniel Berlin writes:

>> As one of my colleagues commented, this is the cache behavior one
>> would see with database transaction processing.  In other words, this is
>> *really bad*.

Daniel> Yup.

	The problem isn't that the number is low at optimization.  29 I/M
is not horrible.  Low 20's is bad.  Scientific code will have a value in
the low hundreds, but compilation is not that regular a computation.

	The problem is that the number stays the same or gets worse
without optimization.  Most commercial compilers will be in the same
ballpark when optimizing, but use a lot fewer instructions and a lot fewer
cache misses to produce minimally optimized, debuggable code.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 15:58                     ` David Edelsohn
@ 2002-08-13 16:49                       ` David S. Miller
  0 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-13 16:49 UTC (permalink / raw)
  To: dje; +Cc: dberlin, dan, austern, gcc

   From: David Edelsohn <dje@watson.ibm.com>
   Date: Tue, 13 Aug 2002 18:58:25 -0400
   
   	The problem isn't that the number is low at optimization.

Can you control when the performance counters start/stop monitoring?
If so, then you can figure out more precisely whether it is mostly
during:

1) Front end tree or tree->rtl conversion

2) rest_of_compilation() onward

3) Both #1 and #2 about evenly, because all of our core data
   structures come out of GC the whole compiler has bad spatial
   and temporal locality

My money is on #3 :-)

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13  0:50               ` Loren James Rittle
@ 2002-08-13 21:46                 ` Fergus Henderson
  2002-08-13 22:40                   ` David S. Miller
  2002-08-14  7:36                   ` Jeff Sturm
  0 siblings, 2 replies; 256+ messages in thread
From: Fergus Henderson @ 2002-08-13 21:46 UTC (permalink / raw)
  To: Loren James Rittle; +Cc: davem, gcc

On 13-Aug-2002, Loren James Rittle <rittle@latour.rsch.comm.mot.com> wrote:
> Has anyone ever tested gcc with its own GC disabled
> but boehm-gc enabled?  OK, this is a red herring question.  Even if
> performance was greater, portability concerns are what caused the
> decision to build a new custom scan-GC verses reusing boehm-gc...

Yes, but GCC could use the Boehm GC on systems which supported it,
if the Boehm GC was faster...

I think this would be a very interesting experiment.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: < http://www.cs.mu.oz.au/~fjh >  |     -- the last words of T. S. Garp.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 21:46                 ` Fergus Henderson
@ 2002-08-13 22:40                   ` David S. Miller
  2002-08-13 23:44                     ` Fergus Henderson
                                       ` (2 more replies)
  2002-08-14  7:36                   ` Jeff Sturm
  1 sibling, 3 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-13 22:40 UTC (permalink / raw)
  To: fjh; +Cc: rittle, gcc

   From: Fergus Henderson <fjh@cs.mu.OZ.AU>
   Date: Wed, 14 Aug 2002 14:46:37 +1000

   Yes, but GCC could use the Boehm GC on systems which supported it,
   if the Boehm GC was faster...

   I think this would be a very interesting experiment.

Feel free to even try it with an infinitely fast GC, even
one that executed in zero time.

Because for the millionth time, it's not the performance of GC itself.
It's the temporal and spatial locality problems of data accesses which
is a fundamental result of using GC for memory allocation.

It is not an issue of "how fast" the GC is.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 22:40                   ` David S. Miller
@ 2002-08-13 23:44                     ` Fergus Henderson
  2002-08-14  7:58                     ` Jeff Sturm
  2002-08-14  9:52                     ` Richard Henderson
  2 siblings, 0 replies; 256+ messages in thread
From: Fergus Henderson @ 2002-08-13 23:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: rittle, gcc

On 13-Aug-2002, David S. Miller <davem@redhat.com> wrote:
>    From: Fergus Henderson <fjh@cs.mu.OZ.AU>
>    Date: Wed, 14 Aug 2002 14:46:37 +1000
>    
>    Yes, but GCC could use the Boehm GC on systems which supported it,
>    if the Boehm GC was faster...
>    
>    I think this would be a very interesting experiment.
> 
> Feel free to even try it with an infinitely fast GC, even
> one that executed in zero time.
> 
> Because for the millionth time, it's not the performance of GC itself.
> It's the temporal and spatial locality problems of data accesses which
> is a fundamental result of using GC for memory allocation.
> 
> It is not an issue of "how fast" the GC is.

Look, there are a number of possible memory management strategies and
implementations possible.  GC using GCC's current GC implementation is
one.  Conservative GC using the Boehm collector is another.  Reference
counting is another.  Reference counting has its own set of drawbacks
for locality, so it's not clear it would be a win; doing the experiment
would be a *lot* of work.  If someone really feels strongly about RC,
and has lots of time, by all means, go for it.

Using the Boehm collector is less likely to be a huge win, but it might
well be a significant win, and it would be much easier to carry out
that experiment.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: < http://www.cs.mu.oz.au/~fjh >  |     -- the last words of T. S. Garp.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 21:46                 ` Fergus Henderson
  2002-08-13 22:40                   ` David S. Miller
@ 2002-08-14  7:36                   ` Jeff Sturm
  1 sibling, 0 replies; 256+ messages in thread
From: Jeff Sturm @ 2002-08-14  7:36 UTC (permalink / raw)
  To: Fergus Henderson; +Cc: Loren James Rittle, davem, gcc

On Wed, 14 Aug 2002, Fergus Henderson wrote:
> On 13-Aug-2002, Loren James Rittle <rittle@latour.rsch.comm.mot.com> wrote:
> > Has anyone ever tested gcc with its own GC disabled
> > but boehm-gc enabled?  OK, this is a red herring question.  Even if
> > performance was greater, portability concerns are what caused the
> > decision to build a new custom scan-GC verses reusing boehm-gc...
>
> Yes, but GCC could use the Boehm GC on systems which supported it,
> if the Boehm GC was faster...
>
> I think this would be a very interesting experiment.

I tried it a year or so ago on the 3.0 sources.  Had a ggc-boehm.c
operating mostly conservatively.  Using ggc's marking infrastructure may
be possible, but seemed difficult to interface with boehm-gc.

One of the difficult problems is that boehm-gc doesn't want to follow
pointers through ordinary (malloc'ed) heap sections.  So I overrode
malloc/free to use the GC methods.

I made ggc_collect() a no-op, since boehm-gc knows when it needs to
collect, and overriding its heuristics doesn't really help matters anyway.

Overall it seemed to shave a few minutes off the bootstrap time, but also
increased memory usage considerably.  I expected this.  Tuning frequency
of collection typically amounts to a size/speed tradeoff.  I don't think
conservativeness was an important factor in heap size.

It could've been interesting to try incremental/generational collection.
I didn't do that.

My impression based partly on that experiment is that
allocation & collection overhead in GCC is not all that substantial, and
the real gains are going to be elsewhere, i.e. improving temporal locality
as has been discussed lately.  That isn't a problem that any GC is going
to fix.  (I also don't think it's a necessary evil of GC, rather it's how
you use the allocator... e.g. creating too many short-lived objects is a
bad thing.)

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 22:40                   ` David S. Miller
  2002-08-13 23:44                     ` Fergus Henderson
@ 2002-08-14  7:58                     ` Jeff Sturm
  2002-08-14  9:52                     ` Richard Henderson
  2 siblings, 0 replies; 256+ messages in thread
From: Jeff Sturm @ 2002-08-14  7:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: fjh, rittle, gcc

On Tue, 13 Aug 2002, David S. Miller wrote:
>    I think this would be a very interesting experiment.
>
> Feel free to even try it with an infinitely fast GC, even
> one that executed in zero time.
>
> Because for the millionth time, it's not the performance of GC itself.
> It's the temporal and spatial locality problems of data accesses which
> is a fundamental result of using GC for memory allocation.

Relax.  Earlier in this thread I seem to remember you were advocating
certain experiments in spite of the skeptics.  So give the GC experts a
chance.

As I understand it, generational collection ought to improve locality,
since the youngest generation can be collected frequently, and may even be
small enough to fit mostly in cache.

(I've never observed it to work in practice, but don't let that
discourage anyone :-)

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 14:03                   ` David Edelsohn
  2002-08-13 14:46                     ` Geoff Keating
@ 2002-08-14  9:25                     ` Kevin Handy
  2002-08-18 12:58                     ` Jeff Sturm
  2 siblings, 0 replies; 256+ messages in thread
From: Kevin Handy @ 2002-08-14  9:25 UTC (permalink / raw)
  To: gcc

David Edelsohn wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 22:40                   ` David S. Miller
  2002-08-13 23:44                     ` Fergus Henderson
  2002-08-14  7:58                     ` Jeff Sturm
@ 2002-08-14  9:52                     ` Richard Henderson
  2002-08-14 10:00                       ` David Edelsohn
  2002-08-14 10:15                       ` Faster compilation speed David Edelsohn
  2 siblings, 2 replies; 256+ messages in thread
From: Richard Henderson @ 2002-08-14  9:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: fjh, rittle, gcc

On Tue, Aug 13, 2002 at 10:26:41PM -0700, David S. Miller wrote:
> Because for the millionth time, it's not the performance of GC itself.
> It's the temporal and spatial locality problems of data accesses which
> is a fundamental result of using GC for memory allocation.

You havn't shown (or even provided guesstemates) how much temporal
or spacial locallity could be had by moving away from GC.  Exactly
how much garbage is created during compilation of a function, Dave?

Suppose we did do manual memory allocation and never created any
garbage whatsoever.  Suppose perfect temporal locality.  How much
spacial locality do we have, considering the pointer-chasing structure
of our IL?  My guess is not much.

The folks that are doing cache-miss studies and concluding anything
should also go back and measure gcc 2.95, before we used GC at all.
That's perhaps not ideal, since it's obstacks instead of reference
counting, but it's not a worthless data point.

The conclusion that RC will solve all our problems is not foregone.
I think we're better served trying to adjust the form of the IL so
that we do less pointer chasing, as Geoff suggested elsewhere in 
this thread.

r~

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14  9:52                     ` Richard Henderson
@ 2002-08-14 10:00                       ` David Edelsohn
  2002-08-14 12:01                         ` Andreas Schwab
  2002-08-14 10:15                       ` Faster compilation speed David Edelsohn
  1 sibling, 1 reply; 256+ messages in thread
From: David Edelsohn @ 2002-08-14 10:00 UTC (permalink / raw)
  To: Richard Henderson, David S. Miller; +Cc: gcc

>>>>> Richard Henderson writes:

Richard> You havn't shown (or even provided guesstemates) how much temporal
Richard> or spacial locallity could be had by moving away from GC.  Exactly
Richard> how much garbage is created during compilation of a function, Dave?

Richard> Suppose we did do manual memory allocation and never created any
Richard> garbage whatsoever.  Suppose perfect temporal locality.  How much
Richard> spacial locality do we have, considering the pointer-chasing structure
Richard> of our IL?  My guess is not much.

	Places where GCC could benefit from spacial locality is by
allocating the instruction list and pseudo registers from a large, static
virtual memory array instead of allocating individual objects dynamically.
I am *not* suggesting removing the linked list pointers or the pointers to
the actual RTL.  GCC often scans or walks through the instructions
linearly.  Pseudo registers are allocated consecutively.  Allocating those
linearly-accessed objects in contiguous memory would improve cache
locality.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14  9:52                     ` Richard Henderson
  2002-08-14 10:00                       ` David Edelsohn
@ 2002-08-14 10:15                       ` David Edelsohn
  2002-08-14 16:35                         ` Richard Henderson
  2002-08-20  4:15                         ` Richard Earnshaw
  1 sibling, 2 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-14 10:15 UTC (permalink / raw)
  To: Richard Henderson, David S. Miller; +Cc: gcc

>>>>> Richard Henderson writes:

Richard> The folks that are doing cache-miss studies and concluding anything
Richard> should also go back and measure gcc 2.95, before we used GC at all.
Richard> That's perhaps not ideal, since it's obstacks instead of reference
Richard> counting, but it's not a worthless data point.

	Thanks for the suggestion.  I think the results I got are pretty
damning: 

gcc-2.95.3 20010315 (release)

Source		I/D$ miss -O2		I/D$ miss -O0
------		-------------		-------------
reload.c		28			36
insn-recog.c		48			36


	For comparison, GCC 3.3 has values in the low 20's, especially at
no optimization.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 10:00                       ` David Edelsohn
@ 2002-08-14 12:01                         ` Andreas Schwab
  2002-08-14 12:07                           ` David Edelsohn
  0 siblings, 1 reply; 256+ messages in thread
From: Andreas Schwab @ 2002-08-14 12:01 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Richard Henderson, David S. Miller, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1074 bytes --]

David Edelsohn <dje@watson.ibm.com> writes:

|> >>>>> Richard Henderson writes:
|> 
|> Richard> You havn't shown (or even provided guesstemates) how much temporal
|> Richard> or spacial locallity could be had by moving away from GC.  Exactly
|> Richard> how much garbage is created during compilation of a function, Dave?
|> 
|> Richard> Suppose we did do manual memory allocation and never created any
|> Richard> garbage whatsoever.  Suppose perfect temporal locality.  How much
|> Richard> spacial locality do we have, considering the pointer-chasing structure
|> Richard> of our IL?  My guess is not much.
|> 
|> 	Places where GCC could benefit from spacial locality is by
|> allocating the instruction list and pseudo registers from a large, static
|> virtual memory array instead of allocating individual objects dynamically.

Obstacks?

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 NÃ¼rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 12:01                         ` Andreas Schwab
@ 2002-08-14 12:07                           ` David Edelsohn
  2002-08-14 13:20                             ` Jamie Lokier
  2002-08-14 13:20                             ` Michael Matz
  0 siblings, 2 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-14 12:07 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Richard Henderson, David S. Miller, gcc

>>>>> Andreas Schwab writes:

|> 	Places where GCC could benefit from spacial locality is by
|> allocating the instruction list and pseudo registers from a large, static
|> virtual memory array instead of allocating individual objects dynamically.

Andreas> Obstacks?

	I thought that obstacks are created dynamically, not statically.
One does not want to ever copy or grow the array.

	Statically allocating some of the large, persistent, sequential
collections of objects would help locality.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 12:07                           ` David Edelsohn
@ 2002-08-14 13:20                             ` Jamie Lokier
  2002-08-14 16:01                               ` Nix
  2002-08-14 13:20                             ` Michael Matz
  1 sibling, 1 reply; 256+ messages in thread
From: Jamie Lokier @ 2002-08-14 13:20 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Andreas Schwab, Richard Henderson, David S. Miller, gcc

David Edelsohn wrote:
> 	I thought that obstacks are created dynamically, not statically.
> One does not want to ever copy or grow the array.

Obstacks use chunks of memory to hold many contiguous objects, so they
offer fairly good spatial locality.  But then, so do many decent GC
allocators (not ones using free lists, though).

> 	Statically allocating some of the large, persistent, sequential
> collections of objects would help locality.

Linus and David are suggesting that temporal locality of short-lived
objects is important -- i.e. reuse of memory from freed objects.
Who knows.

-- Jamie

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 12:07                           ` David Edelsohn
  2002-08-14 13:20                             ` Jamie Lokier
@ 2002-08-14 13:20                             ` Michael Matz
  2002-08-14 16:31                               ` Faster compilation speed [zone allocation] Per Bothner
  1 sibling, 1 reply; 256+ messages in thread
From: Michael Matz @ 2002-08-14 13:20 UTC (permalink / raw)
  To: David Edelsohn; +Cc: gcc

Hi,

On Wed, 14 Aug 2002, David Edelsohn wrote:

> |> 	Places where GCC could benefit from spacial locality is by
> |> allocating the instruction list and pseudo registers from a large, static
> |> virtual memory array instead of allocating individual objects dynamically.
>
> Andreas> Obstacks?
>
> 	I thought that obstacks are created dynamically, not statically.

Sort of.  Obstacks have the ability to grow an object which isn't yet
finalized, and in that process there might be some copying (the canonical
example is a string, which is created character by character).  After
finalization it doesn't change it's address anymore, but still is part of
that obstack.

One would not use that functionality, but simply use obstacks as
convenient containers for small objects, which are allocated already
finalized.  It allocates memory in blocks, and then gives out part of the
current block as long as enough is free in it, and the request is not
larger than a certain size (in which case it gets it's own block).  This
makes for extremely fast allocation (just a pointer increment in the
general case).  One can't deallocate objects in an obstack (or better only
all objects allocated after a certain one).  And it creates good space
locality, and needs less memory then a general allocator like malloc (in
case many small objects are allocated).

But that one can't free objects is a quite severe limitation (I wrote one
for KDE, in which you can free objects, but it has certain restrictions).
But it's still usable.  E.g. I use an obstack in the new register
allocator to allocate most of my small objects from it (nodes and edges of
the graph), and then simply free the whole thing once at the end of that
phase.  But that's not possible e.g. with the current RTL of the function,
there you really don't want to use an obstack.

> One does not want to ever copy or grow the array.

As explained, this doesn't happen if one uses the obstack without growing
objects.

> Statically allocating some of the large, persistent, sequential
> collections of objects would help locality.

This would lead to the idea of obstacks (without growing obstacks) per
data structure type, IOW to a zone allocator, which is not a bad thing.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 13:20                             ` Jamie Lokier
@ 2002-08-14 16:01                               ` Nix
  0 siblings, 0 replies; 256+ messages in thread
From: Nix @ 2002-08-14 16:01 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David Edelsohn, Andreas Schwab, Richard Henderson, David S. Miller, gcc

On Wed, 14 Aug 2002, Jamie Lokier muttered drunkenly:
> David Edelsohn wrote:
>> 	I thought that obstacks are created dynamically, not statically.
>> One does not want to ever copy or grow the array.
> 
> Obstacks use chunks of memory to hold many contiguous objects, so they
> offer fairly good spatial locality.  But then, so do many decent GC
> allocators (not ones using free lists, though).

Also, surely one does not *often* want to grow or copy the array: the
occasional copy isn't a problem (but you initialize it quite large
so the resizing isn't required often).

-- 
`Mips are real and bitrate earnest, shifting spam is not our goal;
 silicon to sand returnest, was not spoken of the soul.'
   --- _Eventful History: Version 1.x_, John M. Ford

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-14 13:20                             ` Michael Matz
@ 2002-08-14 16:31                               ` Per Bothner
  2002-08-15 11:34                                 ` Aldy Hernandez
  0 siblings, 1 reply; 256+ messages in thread
From: Per Bothner @ 2002-08-14 16:31 UTC (permalink / raw)
  To: Michael Matz; +Cc: gcc

Michael Matz wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 10:15                       ` Faster compilation speed David Edelsohn
@ 2002-08-14 16:35                         ` Richard Henderson
  2002-08-14 17:02                           ` David Edelsohn
  2002-08-20  4:15                         ` Richard Earnshaw
  1 sibling, 1 reply; 256+ messages in thread
From: Richard Henderson @ 2002-08-14 16:35 UTC (permalink / raw)
  To: David Edelsohn; +Cc: David S. Miller, gcc

On Wed, Aug 14, 2002 at 01:14:53PM -0400, David Edelsohn wrote:
> Thanks for the suggestion.  I think the results I got are pretty damning...

Try the following.  Appears to cut 30 seconds (3.5%) off of an -O2 -g
build of reload.c, and a small fraction of a second (3.1%) at -O0 -g.
This on an 800MHz Pentium III (Coppermine).

If I have rest_of_compilation dump out insn addresses before
optimization (the only time we could even hope for relatively
sequential nodes), INSN nodes are indeed largely coherent
(even without this patch).  But NOTE nodes are smaller, and
get put in a different size bucket, and so are allocated from
different pages.  Padding out the size of NOTEs and BARRIERs
make them allocated from the same pages, and the resulting
initial addresses are about as sequential as one could hope.

The remaining main source of non-sequentiality in the initial rtl is

	label = gen_label_rtx ();
	/* emit code */
	emit_label (label);

and there's really no helping that.

The other change is to add allocation buckets for two 
important rtx sizes.  On 32-bit systems, two-operand rtxs
(including REG, MEM, PLUS, etc) are 12 bytes, but we were
allocating 16 bytes.  Similarly an INSN (9 operand) and
CALL_INSN (10 operand) are 40 and 44 bytes respectively
but we were allocating 64.  I choose to put the bucket
at 10 operand so that CALL_INSNs and JUMP_INSNs can fit.

I havn't measured the overall real-life memory savings,
but this is 25% for REGs and 30% for INSNs.


r~


	* ggc-page.c (RTL_SIZE): New.
	(extra_order_size_table): Add specializations for 2 and 10 rtl slots.
	* rtl.def (BARRIER, NOTE): Pad to 9 slots.

Index: ggc-page.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/ggc-page.c,v
retrieving revision 1.51
diff -c -p -d -r1.51 ggc-page.c
*** ggc-page.c	4 Jun 2002 11:30:36 -0000	1.51
--- ggc-page.c	14 Aug 2002 22:38:57 -0000
*************** Software Foundation, 59 Temple Place - S
*** 163,175 ****
  
  #define NUM_EXTRA_ORDERS ARRAY_SIZE (extra_order_size_table)
  
  /* The Ith entry is the maximum size of an object to be stored in the
     Ith extra order.  Adding a new entry to this array is the *only*
     thing you need to do to add a new special allocation size.  */
  
  static const size_t extra_order_size_table[] = {
    sizeof (struct tree_decl),
!   sizeof (struct tree_list)
  };
  
  /* The total number of orders.  */
--- 163,180 ----
  
  #define NUM_EXTRA_ORDERS ARRAY_SIZE (extra_order_size_table)
  
+ #define RTL_SIZE(NSLOTS) \
+   (sizeof (struct rtx_def) + ((NSLOTS) - 1) * sizeof (rtunion))
+ 
  /* The Ith entry is the maximum size of an object to be stored in the
     Ith extra order.  Adding a new entry to this array is the *only*
     thing you need to do to add a new special allocation size.  */
  
  static const size_t extra_order_size_table[] = {
    sizeof (struct tree_decl),
!   sizeof (struct tree_list),
!   RTL_SIZE (2),			/* REG, MEM, PLUS, etc.  */
!   RTL_SIZE (10),		/* INSN, CALL_INSN, JUMP_INSN */
  };
  
  /* The total number of orders.  */
Index: rtl.def
===================================================================
RCS file: /cvs/gcc/gcc/gcc/rtl.def,v
retrieving revision 1.58
diff -c -p -d -r1.58 rtl.def
*** rtl.def	19 Jul 2002 23:11:18 -0000	1.58
--- rtl.def	14 Aug 2002 22:38:57 -0000
*************** DEF_RTL_EXPR(JUMP_INSN, "jump_insn", "iu
*** 566,587 ****
  DEF_RTL_EXPR(CALL_INSN, "call_insn", "iuuBteieee", 'i')
  
  /* A marker that indicates that control will not flow through.  */
! DEF_RTL_EXPR(BARRIER, "barrier", "iuu", 'x')
  
  /* Holds a label that is followed by instructions.
     Operand:
!    4: is used in jump.c for the use-count of the label.
!    5: is used in flow.c to point to the chain of label_ref's to this label.
!    6: is a number that is unique in the entire compilation.
!    7: is the user-given name of the label, if any.  */
  DEF_RTL_EXPR(CODE_LABEL, "code_label", "iuuB00is", 'x')
  
  /* Say where in the code a source line starts, for symbol table's sake.
     Operand:
!    4: filename, if line number > 0, note-specific data otherwise.
!    5: line number if > 0, enum note_insn otherwise.
!    6: unique number if line number == note_insn_deleted_label.  */
! DEF_RTL_EXPR(NOTE, "note", "iuuB0ni", 'x')
  
  /* ----------------------------------------------------------------------
     Top level constituents of INSN, JUMP_INSN and CALL_INSN.
--- 566,589 ----
  DEF_RTL_EXPR(CALL_INSN, "call_insn", "iuuBteieee", 'i')
  
  /* A marker that indicates that control will not flow through.  */
! DEF_RTL_EXPR(BARRIER, "barrier", "iuu000000", 'x')
  
  /* Holds a label that is followed by instructions.
     Operand:
!    5: is used in jump.c for the use-count of the label.
!    6: is used in flow.c to point to the chain of label_ref's to this label.
!    7: is a number that is unique in the entire compilation.
!    8: is the user-given name of the label, if any.  */
  DEF_RTL_EXPR(CODE_LABEL, "code_label", "iuuB00is", 'x')
  
  /* Say where in the code a source line starts, for symbol table's sake.
     Operand:
!    5: filename, if line number > 0, note-specific data otherwise.
!    6: line number if > 0, enum note_insn otherwise.
!    7: unique number if line number == note_insn_deleted_label.
!    8-9: padding so that notes and insns are the same size, and thus
!          allocated from the same page ordering.  */
! DEF_RTL_EXPR(NOTE, "note", "iuuB0ni00", 'x')
  
  /* ----------------------------------------------------------------------
     Top level constituents of INSN, JUMP_INSN and CALL_INSN.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 16:35                         ` Richard Henderson
@ 2002-08-14 17:02                           ` David Edelsohn
  0 siblings, 0 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-14 17:02 UTC (permalink / raw)
  To: Richard Henderson, David S. Miller; +Cc: gcc

	The patch does improve the cache behavior:

Source		I/D$ miss -O2		I/D$ miss -O0
------		-------------		-------------
reload.c	22 -> 23.4		22 -> 23.9
insn-recog.c	29 -> 30.3		23 -> 24.6

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-14 16:31                               ` Faster compilation speed [zone allocation] Per Bothner
@ 2002-08-15 11:34                                 ` Aldy Hernandez
  2002-08-15 11:39                                   ` David Edelsohn
                                                     ` (3 more replies)
  0 siblings, 4 replies; 256+ messages in thread
From: Aldy Hernandez @ 2002-08-15 11:34 UTC (permalink / raw)
  To: Per Bothner; +Cc: Michael Matz, gcc

>>>>> "Per" == Per Bothner <per@bothner.com> writes:

This is just an idea, why doesn't someone hack the GC to never
collect, and then we can really find out how much is to be gained by a
refcounter, or no GC at all, etc.

Why go down this path, if we're not even sure it'll improve anything
(well, that much anyhow).

Aldy

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-15 11:34                                 ` Aldy Hernandez
@ 2002-08-15 11:39                                   ` David Edelsohn
  2002-08-15 12:01                                     ` Lynn Winebarger
  2002-08-15 11:41                                   ` Michael Matz
                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 256+ messages in thread
From: David Edelsohn @ 2002-08-15 11:39 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Per Bothner, Michael Matz, gcc

>>>>> Aldy Hernandez writes:

Aldy> This is just an idea, why doesn't someone hack the GC to never
Aldy> collect, and then we can really find out how much is to be gained by a
Aldy> refcounter, or no GC at all, etc.

Aldy> Why go down this path, if we're not even sure it'll improve anything
Aldy> (well, that much anyhow).

	Because the problem is not the garbage collection, its the
allocation pattern.  The proposal to use reference counting allows GCC to
switch to an allocator with better locality -- it's a requirement for the
underlying improvement, not a fix unto itself.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-15 11:34                                 ` Aldy Hernandez
  2002-08-15 11:39                                   ` David Edelsohn
@ 2002-08-15 11:41                                   ` Michael Matz
  2002-08-16  8:44                                     ` Kai Henningsen
  2002-08-15 11:43                                   ` Per Bothner
  2002-08-15 11:57                                   ` Kevin Handy
  3 siblings, 1 reply; 256+ messages in thread
From: Michael Matz @ 2002-08-15 11:41 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Per Bothner, gcc

Hi,

On 15 Aug 2002, Aldy Hernandez wrote:

> This is just an idea, why doesn't someone hack the GC to never
> collect, and then we can really find out how much is to be gained by a
> refcounter, or no GC at all, etc.

To switch off GC doesn't necessarily bring anything, except that GC isn't
done.  But the allocated memory still has the same locality as before
(i.e. if it's the reason for bad performance now, that will still be the
case if we switch off GC).  I.e. it wouldn't proove anything.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-15 11:34                                 ` Aldy Hernandez
  2002-08-15 11:39                                   ` David Edelsohn
  2002-08-15 11:41                                   ` Michael Matz
@ 2002-08-15 11:43                                   ` Per Bothner
  2002-08-15 11:57                                   ` Kevin Handy
  3 siblings, 0 replies; 256+ messages in thread
From: Per Bothner @ 2002-08-15 11:43 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Michael Matz, gcc

Aldy Hernandez wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-15 11:34                                 ` Aldy Hernandez
                                                     ` (2 preceding siblings ...)
  2002-08-15 11:43                                   ` Per Bothner
@ 2002-08-15 11:57                                   ` Kevin Handy
  3 siblings, 0 replies; 256+ messages in thread
From: Kevin Handy @ 2002-08-15 11:57 UTC (permalink / raw)
  To: gcc

Aldy Hernandez wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-15 11:39                                   ` David Edelsohn
@ 2002-08-15 12:01                                     ` Lynn Winebarger
  2002-08-15 12:11                                       ` David Edelsohn
  0 siblings, 1 reply; 256+ messages in thread
From: Lynn Winebarger @ 2002-08-15 12:01 UTC (permalink / raw)
  To: David Edelsohn, Aldy Hernandez; +Cc: Per Bothner, Michael Matz, gcc

On Thursday 15 August 2002 13:39, David Edelsohn wrote:
> >>>>> Aldy Hernandez writes:
> 
> 	Because the problem is not the garbage collection, its the
> allocation pattern.  The proposal to use reference counting allows GCC to
> switch to an allocator with better locality -- it's a requirement for the
> underlying improvement, not a fix unto itself.
> 
   GCC's GC promotion of poor locality of reference is not proof that
reference counting is the only way to improve that locality of reference.
It doesn't matter what allocation/reclamation scheme you switch to, if it's
not used in a way consistent with the cases it optimizes for, it's going to
stink.  There's just as much reason to believe there's a generational GC
that will do what you need as to believe reference counting will be some
kind of magic bullet (without the brittleness).
   
Lynn

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-15 12:01                                     ` Lynn Winebarger
@ 2002-08-15 12:11                                       ` David Edelsohn
  0 siblings, 0 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-15 12:11 UTC (permalink / raw)
  To: Lynn Winebarger; +Cc: Aldy Hernandez, Per Bothner, Michael Matz, gcc

>>>>> Lynn Winebarger writes:

Lynn> GCC's GC promotion of poor locality of reference is not proof that
Lynn> reference counting is the only way to improve that locality of reference.
Lynn> It doesn't matter what allocation/reclamation scheme you switch to, if it's
Lynn> not used in a way consistent with the cases it optimizes for, it's going to
Lynn> stink.  There's just as much reason to believe there's a generational GC
Lynn> that will do what you need as to believe reference counting will be some
Lynn> kind of magic bullet (without the brittleness).

	Let me correct my sloppy wording.  What I meant by "it's a
requirement for the underlying improvement" is that it is a dependency for
that particular proposal -- RC is a means to an end, not an end unto
itself.  There are many ways to address the locality problem.

	I am trying to encourage people participating in this discussion
to stop fixating on the garbage collector itself.  Somehow when GC is
mentioned, people obsess on the garbage collection process without reading
the entire discussion.  If there is interest in discussing garbage
collectors, there are other mailinglists on that specific topic where the
pros and cons of various styles with and without hardware assistance are
debated.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed [zone allocation]
  2002-08-15 11:41                                   ` Michael Matz
@ 2002-08-16  8:44                                     ` Kai Henningsen
  0 siblings, 0 replies; 256+ messages in thread
From: Kai Henningsen @ 2002-08-16  8:44 UTC (permalink / raw)
  To: gcc

matz@suse.de (Michael Matz)  wrote on 15.08.02 in < Pine.LNX.4.33.0208152037200.13269-100000@wotan.suse.de >:

> On 15 Aug 2002, Aldy Hernandez wrote:
>
> > This is just an idea, why doesn't someone hack the GC to never
> > collect, and then we can really find out how much is to be gained by a
> > refcounter, or no GC at all, etc.
>
> To switch off GC doesn't necessarily bring anything, except that GC isn't
> done.  But the allocated memory still has the same locality as before
> (i.e. if it's the reason for bad performance now, that will still be the
> case if we switch off GC).  I.e. it wouldn't proove anything.

Well, it might prove that the bad locality isn't *caused* by running the  
collector. (Or that it is, of course.)

MfG Kai

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Problem with PFE approach [Was: Faster compilation speed]
  2002-08-09 14:59 ` Timothy J. Wood
@ 2002-08-16 13:31   ` Timothy J. Wood
  2002-08-16 13:44     ` Devang Patel
  2002-08-16 13:54     ` Devang Patel
  0 siblings, 2 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-16 13:31 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

  So, another point in favor of discarding the concept of 'statically 
precompilation' based on a problem I just ran into with PFE under 
10.2...

  I'm emulating some of the Win32 API for porting games to Mac OS X.  
Win32 has a macro like this:

#ifndef INITGUID
#define DEFINE_GUID(name, l, w1, w2, b1, b2, b3, b4, b5, b6, b7, b8) \
    EXTERN_C const GUID FAR name
#else

#define DEFINE_GUID(name, l, w1, w2, b1, b2, b3, b4, b5, b6, b7, b8) \
        EXTERN_C const GUID name \
                = { l, w1, w2, { b1, b2,  b3,  b4,  b5,  b6,  b7,  b8 } 
}
#endif // INITGUID

  If this gets stuck in a PFE and the PFE is applied as a prefix header 
(the only way it can be done right now), then the file being compiled 
cannot make its own decision about whether INITGUID should be defined 
or not.

  Clearly there are ways around this, but the current approach makes 
the compiler produce different output based on whether PFE is on or 
not.  I consider this a bug.

  This would not be a problem with an automatic precompiler that 
remembered facts and didn't use the prefix header hack.

  Are there problems with what I describe below or are people just 
avoiding commenting on this since it is too hard to implement? :)

-tim

On Friday, August 9, 2002, at 02:58  PM, Timothy J. Wood wrote:
2) This one is rather crazy and would involve huge amounts of work 
probably....

  a) Toss some or all of your PFE code in the bin (yikes!)
  b) Build a precompile server that the compiler can attach to and 
request precompiled headers (give a path and set of -D flags or 
whatever other state is needed to uniquely identify the precompile 
output).  Requests would be satisfied via shared memory (yes, 
non-portable, so this whole mechanism will only work on modern 
machines).
  c) Inside the server, keep parsed representations of all headers 
that have been imported and the -D state used when parsing the 
headers.  As new headers are parsed, they should be able to **layer** 
on top of existing parsed headers (so there should only be one parsed 
version of std::string).  This avoids the confining requirement that 
you have one big master precompiled header.
 d) Details about concurrency, security, locating the server, and so 
on left as an exercise for the reader.

  The main advantage here is that people would get fast compiles 
WITHOUT having to tune their single PFE header.  Additionally, more 
headers would get precompiled than would otherwise, yielding faster 
builds.  If they layering is done correctly, the memory usage of the 
entire system could be lower (since if you have two projects to build, 
both of which import STL, there would be only one precompiled version 
of STL).

  At the start of a build, a special 'check filesystem' command could 
be sent to the server to have it do a one-time check of timestamps of 
headers files.  Assuming the timestamps haven't changed, the 
precompiled headers could be kept across builds!

  Naturally doing a 'clean' build from the IDE option would need to be 
able to flush and probably shut down the server since it is inevitable 
that there will be bugs that will corrupt the precomp database :(

  #2 could really take many forms.  The key idea is that having a 
single PFE file is non-optimal.  Developers should not have to spend 
time tuning such a file to get the best compile time.  The compiler 
and IDE should handle all these details by default.  Having the 
developer involved here just leads to extra (ongoing!) work for the 
developer and a sub-optimal set of precompiled headers.

  Your goal should be to have the developer open their project and 
have it build 6x faster (instead of requiring the developer to do a 
several hours of tweaking on their PFE file to get the best 
performance -- and then having to keep it up to date over the life of 
their project).

3) This is possibly even harder...  Keep track of what facts in a 
header each source file cared about (macro values defined or 
undefined, structure layout, function signature, etc, etc, etc).  If a 
header changes, have the precompile server keep track of the facts 
that have changed and then only rebuild source files that care about 
those changes (assuming the source file itself hasn't compiled).  This 
could get really ugly since you'd potentially keep track of multiple 
fact timestamps (consider if a build fails or is aborted so some files 
got updated for the current state of a header and some didn't).

  Extra bonus points for doing this on a lower granularity basis 
(i.e., don't recompile a function if it wouldn't produce different 
output).  This would clearly be very hard and a large departure from 
the current state of affairs :)

  Anyway, I think the biggest improvements lie in moving away from the 
current batch compile philosophy mandated by the command line tools.  
Instead, the command line tools should be a front end onto a much more 
powerful persistent compile server.

  (Hey, you asked for ideas and said it was OK if they were hard :)

-tim

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 13:31   ` Problem with PFE approach [Was: Faster compilation speed] Timothy J. Wood
@ 2002-08-16 13:44     ` Devang Patel
  2002-08-16 14:31       ` Timothy J. Wood
  2002-08-16 13:54     ` Devang Patel
  1 sibling, 1 reply; 256+ messages in thread
From: Devang Patel @ 2002-08-16 13:44 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Mike Stump, gcc

On Friday, August 16, 2002, at 01:31 PM, Timothy J. Wood wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 13:31   ` Problem with PFE approach [Was: Faster compilation speed] Timothy J. Wood
  2002-08-16 13:44     ` Devang Patel
@ 2002-08-16 13:54     ` Devang Patel
  2002-08-16 14:42       ` Neil Booth
  2002-08-16 14:45       ` Timothy J. Wood
  1 sibling, 2 replies; 256+ messages in thread
From: Devang Patel @ 2002-08-16 13:54 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Mike Stump, gcc

On Friday, August 16, 2002, at 01:31 PM, Timothy J. Wood wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 13:44     ` Devang Patel
@ 2002-08-16 14:31       ` Timothy J. Wood
  2002-08-16 14:39         ` Neil Booth
  2002-08-16 14:46         ` Devang Patel
  0 siblings, 2 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-16 14:31 UTC (permalink / raw)
  To: Devang Patel; +Cc: Mike Stump, gcc

On Friday, August 16, 2002, at 01:43  PM, Devang Patel wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 14:31       ` Timothy J. Wood
@ 2002-08-16 14:39         ` Neil Booth
  2002-08-16 14:46         ` Devang Patel
  1 sibling, 0 replies; 256+ messages in thread
From: Neil Booth @ 2002-08-16 14:39 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Devang Patel, Mike Stump, gcc

Timothy J. Wood wrote:-

>   The fact that you have to build this massive single header that acts 
> as a prefix header is the broken part -- implementation details like 
> this should not be exposed to the user.  Just like Apple doesn't make 
> users manually configure their Apache server for personal web sharing, 
> Apple shouldn't make their developers do a bunch of work to get decent 
> compile speeds.  It should "Just Work (TM)".

I agree.  Borland, MS and KAI managed this, so we should too.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 13:54     ` Devang Patel
@ 2002-08-16 14:42       ` Neil Booth
  2002-08-16 14:57         ` Devang Patel
  2002-08-16 14:45       ` Timothy J. Wood
  1 sibling, 1 reply; 256+ messages in thread
From: Neil Booth @ 2002-08-16 14:42 UTC (permalink / raw)
  To: Devang Patel; +Cc: Timothy J. Wood, Mike Stump, gcc

Devang Patel wrote:-

> In your previous two queries, what you want from PFE is to discard few 
> things
> based on macros  from precompiled headers. But when PFE restores trees,
> it has gone too far as far as macros are concerned.

The implementation should know what its assumptions are, and if they're
broken recover somehow.  Have you seen KAI's documentation (online)
for their PCH implementation?  It seems like a good solution to me.

Neil.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 13:54     ` Devang Patel
  2002-08-16 14:42       ` Neil Booth
@ 2002-08-16 14:45       ` Timothy J. Wood
  1 sibling, 0 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-16 14:45 UTC (permalink / raw)
  To: Devang Patel; +Cc: Mike Stump, gcc

On Friday, August 16, 2002, at 01:54  PM, Devang Patel wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 14:31       ` Timothy J. Wood
  2002-08-16 14:39         ` Neil Booth
@ 2002-08-16 14:46         ` Devang Patel
  1 sibling, 0 replies; 256+ messages in thread
From: Devang Patel @ 2002-08-16 14:46 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Mike Stump, gcc

On Friday, August 16, 2002, at 02:31 PM, Timothy J. Wood wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 14:42       ` Neil Booth
@ 2002-08-16 14:57         ` Devang Patel
  2002-08-17 15:31           ` Timothy J. Wood
  0 siblings, 1 reply; 256+ messages in thread
From: Devang Patel @ 2002-08-16 14:57 UTC (permalink / raw)
  To: Neil Booth; +Cc: Timothy J. Wood, Mike Stump, gcc

On Friday, August 16, 2002, at 02:41 PM, Neil Booth wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-16 14:57         ` Devang Patel
@ 2002-08-17 15:31           ` Timothy J. Wood
  2002-08-17 20:04             ` Daniel Berlin
                               ` (2 more replies)
  0 siblings, 3 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-17 15:31 UTC (permalink / raw)
  To: Devang Patel; +Cc: Mike Stump, gcc

  So, another problem with PFE that I've noticed after working with it 
for a while...

  If you put all your commonly used headers in a PFE, then changing any 
of these headers causes the PFE header to considered changed.  And, 
since this header is imported into every single file in your project, 
you end up in a situation where changing any header causes the entire 
project to be rebuilt.  This is clearly not good for day to day 
development.

  A PCH approach that was automatic and didn't have a single monolithic 
file would avoid the artificial tying together of all the headers in 
the world and would thus lead to faster incremental builds due to fewer 
files being rebuilt.

  Another approach that would work with a monolithic file would be some 
sort of fact database that would allow the build system to decide early 
on that the change in question didn't effect some subset of files.

-tim

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 15:31           ` Timothy J. Wood
@ 2002-08-17 20:04             ` Daniel Berlin
  2002-08-17 20:07               ` Andrew Pinski
  2002-08-17 20:14               ` Timothy J. Wood
  2002-08-17 20:15             ` Daniel Berlin
  2002-08-19  7:07             ` Stan Shebs
  2 siblings, 2 replies; 256+ messages in thread
From: Daniel Berlin @ 2002-08-17 20:04 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Devang Patel, Mike Stump, gcc

On Sat, 17 Aug 2002, Timothy J. Wood wrote:

> 
>    So, another problem with PFE that I've noticed after working with it 
> for a while...
> 
>    If you put all your commonly used headers in a PFE, then changing any 
> of these headers causes the PFE header to considered changed.  And, 
> since this header is imported into every single file in your project, 
> you end up in a situation where changing any header causes the entire 
> project to be rebuilt. 

Um, this header should *not* be explicitly included in the files.
It's *prefix* header.

The only thing that would need to be rebuilt in this case is the prefix 
header.
Everything else that would normally not be rebuilt will not be rebuilt.

IE the only thing extra that gets rebuilt is the prefix header.

 
> This is clearly not good for day to day 
> development.
> 
>    A PCH approach that was automatic and didn't have a single monolithic 
> file would avoid the artificial tying together of all the headers in 
> the world and would thus lead to faster incremental builds due to fewer 
> files being rebuilt.
> 
>    Another approach that would work with a monolithic file would be some 
> sort of fact database that would allow the build system to decide early 
> on that the change in question didn't effect some subset of files.
> 
> -tim
> 
> 
> 

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 20:04             ` Daniel Berlin
@ 2002-08-17 20:07               ` Andrew Pinski
  2002-08-17 20:14               ` Timothy J. Wood
  1 sibling, 0 replies; 256+ messages in thread
From: Andrew Pinski @ 2002-08-17 20:07 UTC (permalink / raw)
  To: dberlin; +Cc: Timothy J. Wood, Devang Patel, Mike Stump, gcc

PFE is like the prepocessed headers in CodeWarrior.

Thanks,
Andrew Pinski

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 20:04             ` Daniel Berlin
  2002-08-17 20:07               ` Andrew Pinski
@ 2002-08-17 20:14               ` Timothy J. Wood
  2002-08-17 20:21                 ` Daniel Berlin
  2002-08-19 11:59                 ` Devang Patel
  1 sibling, 2 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-17 20:14 UTC (permalink / raw)
  To: dberlin; +Cc: Devang Patel, Mike Stump, gcc

On Saturday, August 17, 2002, at 08:04  PM, Daniel Berlin wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 15:31           ` Timothy J. Wood
  2002-08-17 20:04             ` Daniel Berlin
@ 2002-08-17 20:15             ` Daniel Berlin
  2002-08-19  7:07             ` Stan Shebs
  2 siblings, 0 replies; 256+ messages in thread
From: Daniel Berlin @ 2002-08-17 20:15 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Devang Patel, Mike Stump, gcc

On Sat, 17 Aug 2002, Timothy J. Wood wrote:

> 
>    So, another problem with PFE that I've noticed after working with it 
> for a while...
> 
>    If you put all your commonly used headers in a PFE, then changing any 
> of these headers causes the PFE header to considered changed.  And, 
> since this header is imported into every single file in your project, 
> you end up in a situation where changing any header causes the entire 
> project to be rebuilt.  This is clearly not good for day to day 
> development.
> 
>    A PCH approach that was automatic and didn't have a single monolithic 
> file would avoid the artificial tying together of all the headers in 
> the world and would thus lead to faster incremental builds due to fewer 
> files being rebuilt.
> 
>    Another approach that would work with a monolithic file would be some 
> sort of fact database that would allow the build system to decide early 
> on that the change in question didn't effect some subset of files.
> 

Also, while constructive criticism is good and all, at some point, it 
becomes "put up or shut up". It's one thing to say how great something 
would be, another thing to implement it.  We have heard your idea, we know 
how to implement it.  Everyone is aware of it.  At this point, i'd 
rather you tell me how good it is when you've got code to do it, rather 
than keep pointing out what you perceive to be flaws in something that is 
a large improvement over what exists now.

One of the things that slows down gcc development is criticism of patches 
that are large improvements over what exists now, in favor of some 
"better" approach, which nobody has yet implemented.  Then this large 
improvement never gets accepted, and nobody ever implements the "better 
approach". The perfect is the enemy of the good.

 --Dan

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 20:14               ` Timothy J. Wood
@ 2002-08-17 20:21                 ` Daniel Berlin
  2002-08-18  3:17                   ` Kai Henningsen
  2002-08-19 11:59                 ` Devang Patel
  1 sibling, 1 reply; 256+ messages in thread
From: Daniel Berlin @ 2002-08-17 20:21 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Devang Patel, Mike Stump, gcc

On Sat, 17 Aug 2002, Timothy J. Wood wrote:

> 
> On Saturday, August 17, 2002, at 08:04  PM, Daniel Berlin wrote:
> 
> > On Sat, 17 Aug 2002, Timothy J. Wood wrote:
> >
> >>
> >>    So, another problem with PFE that I've noticed after working with 
> >> it
> >> for a while...
> >>
> >>    If you put all your commonly used headers in a PFE, then changing 
> >> any
> >> of these headers causes the PFE header to considered changed.  And,
> >> since this header is imported into every single file in your project,
> >> you end up in a situation where changing any header causes the entire
> >> project to be rebuilt.
> >
> > Um, this header should *not* be explicitly included in the files.
> > It's *prefix* header.
> 
>    I'm not saying that I'm #including it in my sources.  What I'm saying 
> is that the IDE knows that all my files depend upon it (they all end up 
> including it due to it being the prefix header, regardless of whether 
> it is listed or not).  This means that they may have depedencies on the 
> its contents and must be rebuilt if it or any header it includes 
> changes.

No, they shouldn't have any dependencies on it's contents. They should 
include what they normally include.  The fact that the prefix header stores the 
compiler state should prevent these includes from doing anything (since 
it'll know it's already processed that header) when it is present.
Any build system that makes the files depend on the prefix header is 
broken, and needs to be fixed.

Prefix headers need to be rebuilt when compilation options change, or the 
headers it includes change. 
Files only need rebuilt when some normal header they depend on changes.
*Not* when the prefix header changes.

 > 
>    The way I think about this is that the prefix header mess is just a 
> hack to avoid having a #include at the top of each file.  There should 
> be nothing else special about the header -- it is just assumed that 
> there is a #include at the top of your file.
> 
> > The only thing that would need to be rebuilt in this case is the 
> > prefix header.
> > Everything else that would normally not be rebuilt will not be rebuilt.
> 
>    Nope... everything needs to be rebuilt.  The problem is that the 
> prefix header might satisfy some symbol or macro that a source file 
> needs (assume that the source file doesn't explicitly include headers 
> it needs). 

Don't assume that.
It should always do so.
If not, the source code is wrong.
Period.
It's not a usability issue that users must have the proper includes.

--Dan

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 20:21                 ` Daniel Berlin
@ 2002-08-18  3:17                   ` Kai Henningsen
  2002-08-18  7:36                     ` Daniel Berlin
  0 siblings, 1 reply; 256+ messages in thread
From: Kai Henningsen @ 2002-08-18  3:17 UTC (permalink / raw)
  To: gcc

dberlin@dberlin.org (Daniel Berlin)  wrote on 17.08.02 in < Pine.LNX.4.44.0208172315090.29572-100000@dberlin.org >:

> On Sat, 17 Aug 2002, Timothy J. Wood wrote:
>
> >
> > On Saturday, August 17, 2002, at 08:04  PM, Daniel Berlin wrote:
> >
> > > On Sat, 17 Aug 2002, Timothy J. Wood wrote:
> > >
> > >>
> > >>    So, another problem with PFE that I've noticed after working with
> > >> it
> > >> for a while...
> > >>
> > >>    If you put all your commonly used headers in a PFE, then changing
> > >> any
> > >> of these headers causes the PFE header to considered changed.  And,
> > >> since this header is imported into every single file in your project,
> > >> you end up in a situation where changing any header causes the entire
> > >> project to be rebuilt.
> > >
> > > Um, this header should *not* be explicitly included in the files.
> > > It's *prefix* header.
> >
> >    I'm not saying that I'm #including it in my sources.  What I'm saying
> > is that the IDE knows that all my files depend upon it (they all end up
> > including it due to it being the prefix header, regardless of whether
> > it is listed or not).  This means that they may have depedencies on the
> > its contents and must be rebuilt if it or any header it includes
> > changes.
>
> No, they shouldn't have any dependencies on it's contents. They should

That would be seriously broken ...

> include what they normally include.  The fact that the prefix header stores
> the compiler state should prevent these includes from doing anything (since
> it'll know it's already processed that header) when it is present.
> Any build system that makes the files depend on the prefix header is
> broken, and needs to be fixed.

... unless you have some mechanism to prevent them from being influenced  
by any change in any header which is used in the prefix header but which  
they do not include normally.

What mechanism would that be?

The dependency chain is *exactly* the same as if the prefix header was  
normally included at the start of every source file.

MfG Kai

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-18  3:17                   ` Kai Henningsen
@ 2002-08-18  7:36                     ` Daniel Berlin
  2002-08-18 11:20                       ` jepler
  0 siblings, 1 reply; 256+ messages in thread
From: Daniel Berlin @ 2002-08-18  7:36 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: gcc

On 18 Aug 2002, Kai Henningsen wrote:

> dberlin@dberlin.org (Daniel Berlin)  wrote on 17.08.02 in < Pine.LNX.4.44.0208172315090.29572-100000@dberlin.org >:
> 
> > On Sat, 17 Aug 2002, Timothy J. Wood wrote:
> >
> > >
> > > On Saturday, August 17, 2002, at 08:04  PM, Daniel Berlin wrote:
> > >
> > > > On Sat, 17 Aug 2002, Timothy J. Wood wrote:
> > > >
> > > >>
> > > >>    So, another problem with PFE that I've noticed after working with
> > > >> it
> > > >> for a while...
> > > >>
> > > >>    If you put all your commonly used headers in a PFE, then changing
> > > >> any
> > > >> of these headers causes the PFE header to considered changed.  And,
> > > >> since this header is imported into every single file in your project,
> > > >> you end up in a situation where changing any header causes the entire
> > > >> project to be rebuilt.
> > > >
> > > > Um, this header should *not* be explicitly included in the files.
> > > > It's *prefix* header.
> > >
> > >    I'm not saying that I'm #including it in my sources.  What I'm saying
> > > is that the IDE knows that all my files depend upon it (they all end up
> > > including it due to it being the prefix header, regardless of whether
> > > it is listed or not).  This means that they may have depedencies on the
> > > its contents and must be rebuilt if it or any header it includes
> > > changes.
> >
> > No, they shouldn't have any dependencies on it's contents. They should
> 
> That would be seriously broken ...
> 
> > include what they normally include.  The fact that the prefix header stores
> > the compiler state should prevent these includes from doing anything (since
> > it'll know it's already processed that header) when it is present.
> > Any build system that makes the files depend on the prefix header is
> > broken, and needs to be fixed.
> 
> ... unless you have some mechanism to prevent them from being influenced  
> by any change in any header which is used in the prefix header but which  
> they do not include normally.

Why would they be influenced by a change to something they would not 
normally include?
Unless they don't include what they normally should.
> 
> What mechanism would that be?

Reality?
> 
> The dependency chain is *exactly* the same as if the prefix header was  
> normally included at the start of every source file.

This is wrong, and leads exactly to the problem Tim describes.
The dependency chain should *not* include the prefix header.

The fact that the prefix header exists is not something the build system 
should know about, except insofar that it rebuild the prefix header when 
the headers it includes changes.

That's *it*.


> 
> MfG Kai
> 
> 

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-18  7:36                     ` Daniel Berlin
@ 2002-08-18 11:20                       ` jepler
  2002-08-18 13:20                         ` Daniel Berlin
  0 siblings, 1 reply; 256+ messages in thread
From: jepler @ 2002-08-18 11:20 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Kai Henningsen, gcc

Let me see if I understand what people are talking about.

a.h:
    /* Include header guard if appropriate */
    #define X 1

b.h:
    /* Include header guard if appropriate */
    #define Y 1

m.c:
    #include "a.h"
    int main(void) { return Y; }

If m.c is compiled using PFE, and the PFE header contains both a.h and b.h,
will the compilation complete successfully?

If yes, and b.h is later modified to remove the Y definition will a build
system where m.c does not depend on the PFE header actually rebuild m.c,
since the output of m.c depends (erroneously) on an item in b.h through
the PFE header?

My understanding of the PFE symbol implies that m.c would see a definition
from b.h even though b.h was not the target of a #include directive.  This
means that programmers will accidentally depend on symbols from b.h even
when it's not included, and that if they do, and the build system does not
consider the PFE header a dependency of each source file, the definitions
will not only be visible when they should not be, but the build will be
wrong since the new contents of these accidentally referenced header files
will not catually cause a rebuild.

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 14:03                   ` David Edelsohn
  2002-08-13 14:46                     ` Geoff Keating
  2002-08-14  9:25                     ` Kevin Handy
@ 2002-08-18 12:58                     ` Jeff Sturm
  2002-08-19 12:55                       ` Mike Stump
  2002-08-20 11:22                       ` Will Cohen
  2 siblings, 2 replies; 256+ messages in thread
From: Jeff Sturm @ 2002-08-18 12:58 UTC (permalink / raw)
  To: David Edelsohn; +Cc: David S. Miller, dan, austern, gcc

On Tue, 13 Aug 2002, David Edelsohn wrote:
> 	Here's an interesting (aka depressing) data point.  My previous
> cache miss statistics were for GCC -O2.  At -O0, GCC's cache miss
> statistics stay the same or get up to 20% *worse*.  In comparison, the
> cache statistics for IBM's compiler without optimization enabled *improve*
> up to 50 for the same reload.c and insn-recog.c input files compared to
> optimized.

Here's a data point on alpha-linux:

cc1 -quiet -O2 reload.i
issues/cycles = 0.51  issues/dcache_miss = 26.93

Without optimization:

cc1 -quiet  reload.i
issues/cycles = 0.52  issues/dcache_miss = 31.29

This is on a ev56 with a direct-mapped cache.  To get some idea where the
misses are taking place, I experimented with iprobe's sampling mode.
Omitting results below the 1% sample threshold, I get:

function                    | issues | access | misses | i/m |  a/m
----------------------------+--------+--------+--------+-----+-----
yyparse                     |   2924 |    848 |    148 |  20 |  5.7
gt_ggc_mx_lang_tree_node    |   1336 |    612 |     74 |  18 |  8.2
verify_flow_info            |   1388 |    408 |    129 |  11 |  3.1
copy_rtx_if_shared          |   2120 |   1012 |     53 |  40 | 19.0
propagate_one_insn          |   3636 |    504 |     52 |  70 |  9.6
find_temp_slot_from_address |    728 |    232 |    126 |   6 |  1.8
ggc_mark_rtx_children_1     |   1580 |    316 |     40 |  40 |  7.9
extract_insn                |   1576 |    476 |     52 |  30 |  9.1
record_reg_classes          |   3848 |    944 |     65 |  59 | 14.5
reg_scan_mark_refs          |   1472 |    632 |     66 |  22 |  9.5
find_reloads                |   7680 |   3104 |    148 |  52 | 20.9
subst_reloads               |   4772 |   2736 |    169 |  28 | 16.1
side_effects_p              |   1344 |    564 |     43 |  31 | 13.1
for_each_rtx                |   4924 |   1464 |     75 |  66 | 19.5
ggc_alloc                   |   2424 |    728 |    111 |  22 |  6.5
ggc_set_mark                |   3392 |    976 |    107 |  32 |  9.1

(Each sample reported is 2^14 events.)

yyparse performs badly (as would any table-driven parser), but how about
verify_flow_info and find_temp_slot_from_address?  Both are reporting
awful cache behavior.

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-18 11:20                       ` jepler
@ 2002-08-18 13:20                         ` Daniel Berlin
  2002-08-18 14:31                           ` Timothy J. Wood
  0 siblings, 1 reply; 256+ messages in thread
From: Daniel Berlin @ 2002-08-18 13:20 UTC (permalink / raw)
  To: jepler; +Cc: Kai Henningsen, gcc

On Sun, 18 Aug 2002 jepler@unpythonic.net wrote:

> Let me see if I understand what people are talking about.
> 
> a.h:
>     /* Include header guard if appropriate */
>     #define X 1
> 
> b.h:
>     /* Include header guard if appropriate */
>     #define Y 1
> 
> m.c:
>     #include "a.h"
>     int main(void) { return Y; }
> 
> If m.c is compiled using PFE, and the PFE header contains both a.h and b.h,
> will the compilation complete successfully?
> 
> If yes, and b.h is later modified to remove the Y definition will a build
> system where m.c does not depend on the PFE header actually rebuild m.c,
> since the output of m.c depends (erroneously) on an item in b.h through
> the PFE header?

A build system where m.c does not depend on the prefix header should 
*not* rebuild if b.h is modified.
That's my point.

> 
> My understanding of the PFE symbol implies that m.c would see a definition
> from b.h even though b.h was not the target of a #include directive.
Yes, they would be existing, but this is user error.
They should always include the right things. 
In other words, you should make sure it works without a PFE header 
before you try it *with* one.
It's only when you *count* on the fact that the PFE header is there that 
you run into dependency problems.

--Dan

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-18 13:20                         ` Daniel Berlin
@ 2002-08-18 14:31                           ` Timothy J. Wood
  2002-08-18 14:35                             ` Andrew Pinski
  2002-08-19  2:41                             ` Michael Matz
  0 siblings, 2 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-18 14:31 UTC (permalink / raw)
  To: dberlin; +Cc: jepler, Kai Henningsen, gcc

On Sunday, August 18, 2002, at 01:20  PM, Daniel Berlin wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-18 14:31                           ` Timothy J. Wood
@ 2002-08-18 14:35                             ` Andrew Pinski
  2002-08-18 14:55                               ` Timothy J. Wood
  2002-08-19  2:41                             ` Michael Matz
  1 sibling, 1 reply; 256+ messages in thread
From: Andrew Pinski @ 2002-08-18 14:35 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: dberlin, jepler, Kai Henningsen, gcc

PFE is good for headers that hardly change, like system headers.
It is not good for headers that change in development.

Thanks,
Andrew Pinski

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-18 14:35                             ` Andrew Pinski
@ 2002-08-18 14:55                               ` Timothy J. Wood
  0 siblings, 0 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-18 14:55 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

On Sunday, August 18, 2002, at 02:36  PM, Andrew Pinski wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-18 14:31                           ` Timothy J. Wood
  2002-08-18 14:35                             ` Andrew Pinski
@ 2002-08-19  2:41                             ` Michael Matz
  2002-08-19  6:26                               ` jepler
  2002-08-19 11:53                               ` Devang Patel
  1 sibling, 2 replies; 256+ messages in thread
From: Michael Matz @ 2002-08-19  2:41 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: dberlin, jepler, Kai Henningsen, gcc

Hi,

On Sun, 18 Aug 2002, Timothy J. Wood wrote:

>    Thus, if you are going to implicitly include the header, you damn
> well better included it in dependency analysis.

No, because the existance of that header shouldn't influence the outcome
of the compiler in any way.

>    I can accept an argument of "this is too hard to do correctly right
> now", but not "the user screwed up".  The user didn't screw up -- the
> compiler just isn't smart enough to do it correctly yet.

If the source doesn't compile without the prefix header the user did
something wrong, IOW he's screwed if he doesn't want to fix it.  Period.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-19  2:41                             ` Michael Matz
@ 2002-08-19  6:26                               ` jepler
  2002-08-19  6:40                                 ` Daniel Berlin
  2002-08-19 11:50                                 ` Devang Patel
  2002-08-19 11:53                               ` Devang Patel
  1 sibling, 2 replies; 256+ messages in thread
From: jepler @ 2002-08-19  6:26 UTC (permalink / raw)
  To: Michael Matz; +Cc: Timothy J. Wood, dberlin, Kai Henningsen, gcc

> On Sun, 18 Aug 2002, Timothy J. Wood wrote:
> >    I can accept an argument of "this is too hard to do correctly right
> > now", but not "the user screwed up".  The user didn't screw up -- the
> > compiler just isn't smart enough to do it correctly yet.

On Mon, Aug 19, 2002 at 11:21:28AM +0200, Michael Matz wrote:
> If the source doesn't compile without the prefix header the user did
> something wrong, IOW he's screwed if he doesn't want to fix it.  Period.

PFE makes it too easy for the programmer to accidentally give his program
different meaning with or without the prefix header.  I can do without one
more way to screw up my program.

The following set of files will compile a program with or without PFE, but
using a PFE that contains both a.h and b.h, the behavior will change.  So
the suggestion that files should be checked that they compile without PFE
is not enough to ensure that there aren't unintended changes in program
meaning in the presence of PFE.

// a.h
#define DEFA

// b.h
#define DEFB

// m.c
#include "a.h"
int main(void) {
#ifdef DEFB
	return 1;
#else
	return 0;
#endif;
}

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-19  6:26                               ` jepler
@ 2002-08-19  6:40                                 ` Daniel Berlin
  2002-08-19 11:50                                 ` Devang Patel
  1 sibling, 0 replies; 256+ messages in thread
From: Daniel Berlin @ 2002-08-19  6:40 UTC (permalink / raw)
  To: jepler; +Cc: Michael Matz, Timothy J. Wood, Kai Henningsen, gcc

On Mon, 19 Aug 2002 jepler@unpythonic.net wrote:

> > On Sun, 18 Aug 2002, Timothy J. Wood wrote:
> > >    I can accept an argument of "this is too hard to do correctly right
> > > now", but not "the user screwed up".  The user didn't screw up -- the
> > > compiler just isn't smart enough to do it correctly yet.
> 
> On Mon, Aug 19, 2002 at 11:21:28AM +0200, Michael Matz wrote:
> > If the source doesn't compile without the prefix header the user did
> > something wrong, IOW he's screwed if he doesn't want to fix it.  Period.
> 
> PFE makes it too easy for the programmer to accidentally give his program
> different meaning with or without the prefix header.  I can do without one
> more way to screw up my program.
> 
> The following set of files will compile a program with or without PFE, but
> using a PFE that contains both a.h and b.h, the behavior will change. 

This is an implementation problem, and one that should be fixed.
As is making symbols visible without the explicit includes (Though this is 
slightly harder to solve, but still possible through various means).


^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 15:31           ` Timothy J. Wood
  2002-08-17 20:04             ` Daniel Berlin
  2002-08-17 20:15             ` Daniel Berlin
@ 2002-08-19  7:07             ` Stan Shebs
  2002-08-19  8:52               ` Timothy J. Wood
  2 siblings, 1 reply; 256+ messages in thread
From: Stan Shebs @ 2002-08-19  7:07 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Devang Patel, Mike Stump, gcc

Timothy J. Wood wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-19  7:07             ` Stan Shebs
@ 2002-08-19  8:52               ` Timothy J. Wood
  0 siblings, 0 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-19  8:52 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Devang Patel, Mike Stump, gcc

On Monday, August 19, 2002, at 07:05  AM, Stan Shebs wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-19  6:26                               ` jepler
  2002-08-19  6:40                                 ` Daniel Berlin
@ 2002-08-19 11:50                                 ` Devang Patel
  2002-08-19 12:55                                   ` Jeff Epler
  1 sibling, 1 reply; 256+ messages in thread
From: Devang Patel @ 2002-08-19 11:50 UTC (permalink / raw)
  To: jepler; +Cc: dberlin, gcc

On Monday, August 19, 2002, at 06:26  AM, jepler@unpythonic.net wrote:

The following set of files will compile a program with or without PFE, but
using a PFE that contains both a.h and b.h, the behavior will change.

This is not implementation problem or PFE model problem.
If you are including a.h and b.h in PFE means what you're asking compiler to do 
is to compile following source

/// m.c
#include "a.h"
#include "b.h"
int main(void) {
#ifdef DEFB
return 1;
#else
return 0;
#endif;
}

And, no doubt, it can have different behavior then following original source

// m.c
#include "a.h"
int main(void) {
#ifdef DEFB
return 1;
#else
return 0;
#endif;
}

-Devang

So
the suggestion that files should be checked that they compile without PFE
is not enough to ensure that there aren't unintended changes in program
meaning in the presence of PFE.

// a.h
#define DEFA

// b.h
#define DEFB

// m.c
#include "a.h"
int main(void) {
#ifdef DEFB
return 1;
#else
return 0;
#endif;
}

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-19  2:41                             ` Michael Matz
  2002-08-19  6:26                               ` jepler
@ 2002-08-19 11:53                               ` Devang Patel
  1 sibling, 0 replies; 256+ messages in thread
From: Devang Patel @ 2002-08-19 11:53 UTC (permalink / raw)
  To: Michael Matz; +Cc: Timothy J. Wood, dberlin, jepler, Kai Henningsen, gcc

On Monday, August 19, 2002, at 02:21  AM, Michael Matz wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-17 20:14               ` Timothy J. Wood
  2002-08-17 20:21                 ` Daniel Berlin
@ 2002-08-19 11:59                 ` Devang Patel
  1 sibling, 0 replies; 256+ messages in thread
From: Devang Patel @ 2002-08-19 11:59 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: dberlin, Mike Stump, gcc

On Saturday, August 17, 2002, at 08:14  PM, Timothy J. Wood wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-19 11:50                                 ` Devang Patel
@ 2002-08-19 12:55                                   ` Jeff Epler
  2002-08-19 13:03                                     ` Ziemowit Laski
  0 siblings, 1 reply; 256+ messages in thread
From: Jeff Epler @ 2002-08-19 12:55 UTC (permalink / raw)
  To: Devang Patel; +Cc: dberlin, gcc

On Mon, Aug 19, 2002 at 11:50:24AM -0700, Devang Patel wrote:
>  
> On Monday, August 19, 2002, at 06:26  AM, jepler@unpythonic.net wrote: 
> 
> 
> > The following set of files will compile a program with or without 
> > PFE, but 
> > using a PFE that contains both a.h and b.h, the behavior will 
> > change. 
> > 
>  
> 
> This is not implementation problem or PFE model problem. 
> If you are including a.h and b.h in PFE means what you're asking 
> compiler to do  
> is to compile following source 
> 
> 
> /// m.c 
> #include "a.h" 
> #include "b.h" 
> int main(void) { 
> #ifdef DEFB 
> 	return 1; 
> #else 
> 	return 0; 
> #endif; 
> } 

.. then the build system must treat m.c as depending on the PFE, which
in turn depends on all headers it contains.  But that's where this
discussion started, with the PFE cure being worse than the illness since
it makes your whole project recompile when you touch a header file.

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-18 12:58                     ` Jeff Sturm
@ 2002-08-19 12:55                       ` Mike Stump
  2002-08-20 11:22                       ` Will Cohen
  1 sibling, 0 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-19 12:55 UTC (permalink / raw)
  To: Jeff Sturm; +Cc: David Edelsohn, David S. Miller, dan, austern, gcc

On Sunday, August 18, 2002, at 12:57 PM, Jeff Sturm wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Problem with PFE approach [Was: Faster compilation speed]
  2002-08-19 12:55                                   ` Jeff Epler
@ 2002-08-19 13:03                                     ` Ziemowit Laski
  0 siblings, 0 replies; 256+ messages in thread
From: Ziemowit Laski @ 2002-08-19 13:03 UTC (permalink / raw)
  To: Jeff Epler; +Cc: Devang Patel, dberlin, gcc

On Monday, Aug 19, 2002, at 12:54 US/Pacific, Jeff Epler wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 10:15                       ` Faster compilation speed David Edelsohn
  2002-08-14 16:35                         ` Richard Henderson
@ 2002-08-20  4:15                         ` Richard Earnshaw
  2002-08-20  5:38                           ` Jeff Sturm
  2002-08-20  8:00                           ` David Edelsohn
  1 sibling, 2 replies; 256+ messages in thread
From: Richard Earnshaw @ 2002-08-20  4:15 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Richard Henderson, David S. Miller, gcc, Richard.Earnshaw

> >>>>> Richard Henderson writes:
> 
> Richard> The folks that are doing cache-miss studies and concluding anything
> Richard> should also go back and measure gcc 2.95, before we used GC at all.
> Richard> That's perhaps not ideal, since it's obstacks instead of reference
> Richard> counting, but it's not a worthless data point.
> 
> 	Thanks for the suggestion.  I think the results I got are pretty
> damning: 
> 
> gcc-2.95.3 20010315 (release)
> 
> Source		I/D$ miss -O2		I/D$ miss -O0
> ------		-------------		-------------
> reload.c		28			36
> insn-recog.c		48			36
> 
> 
> 	For comparison, GCC 3.3 has values in the low 20's, especially at
> no optimization.
> 
> David
> 

Do you have/can you get data for TLB misses?

R.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-20  4:15                         ` Richard Earnshaw
@ 2002-08-20  5:38                           ` Jeff Sturm
  2002-08-20  5:53                             ` Richard Earnshaw
  2002-08-20  8:00                           ` David Edelsohn
  1 sibling, 1 reply; 256+ messages in thread
From: Jeff Sturm @ 2002-08-20  5:38 UTC (permalink / raw)
  To: Richard.Earnshaw; +Cc: David Edelsohn, Richard Henderson, David S. Miller, gcc

On Tue, 20 Aug 2002, Richard Earnshaw wrote:
> > gcc-2.95.3 20010315 (release)
> >
> > Source		I/D$ miss -O2		I/D$ miss -O0
> > ------		-------------		-------------
> > reload.c		28			36
> > insn-recog.c		48			36
>
> Do you have/can you get data for TLB misses?

I had done that on alpha, but didn't initially report the figures.  Would
a comparison to 2.95 also be useful?

gcc version 3.3 20020802 (experimental)

---------------------------------------------------------------------------
cc1 -O2 reload.i

issues/cycles = 0.51  issues/dcache_miss = 26.93  issues/dtb_miss = 1214.36

---------------------------------------------------------------------------
cc1 reload.i

issues/cycles = 0.52  issues/dcache_miss = 31.29  issues/dtb_miss = 1854.16

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-20  5:38                           ` Jeff Sturm
@ 2002-08-20  5:53                             ` Richard Earnshaw
  2002-08-20 13:42                               ` Jeff Sturm
  0 siblings, 1 reply; 256+ messages in thread
From: Richard Earnshaw @ 2002-08-20  5:53 UTC (permalink / raw)
  To: Jeff Sturm
  Cc: Richard.Earnshaw, David Edelsohn, Richard Henderson,
	David S. Miller, gcc

> > Do you have/can you get data for TLB misses?
> 
> I had done that on alpha, but didn't initially report the figures.  Would
> a comparison to 2.95 also be useful?

Certainly -- the numbers don't really mean anything unless we have 
something to compare them against.  Remember, gcc-2.95 bootstrap times 
were about half those that we have now (*after* taking into account new 
languages and libraries etc).

R.

> 
> gcc version 3.3 20020802 (experimental)
> 
> ---------------------------------------------------------------------------
> cc1 -O2 reload.i
> 
> issues/cycles = 0.51  issues/dcache_miss = 26.93  issues/dtb_miss = 1214.36

So if I understand these figures correctly, then 

dcache_miss/dtb_miss ~= 45

That is, one in 45 dcache fetches also requires a tlb walk.  How many dtb 
entries does an Alpha have?

> 
> ---------------------------------------------------------------------------
> cc1 reload.i
> 
> issues/cycles = 0.52  issues/dcache_miss = 31.29  issues/dtb_miss = 1854.16
> 

giving
dcache_miss/dtb_miss ~= 60




^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-20  4:15                         ` Richard Earnshaw
  2002-08-20  5:38                           ` Jeff Sturm
@ 2002-08-20  8:00                           ` David Edelsohn
  1 sibling, 0 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-20  8:00 UTC (permalink / raw)
  To: Richard.Earnshaw; +Cc: Richard Henderson, David S. Miller, gcc

>>>>> Richard Earnshaw writes:

Richard> Do you have/can you get data for TLB misses?

	Yes.  I didn't comment on TLB statistics because it did not vary
much with optimization level or GCC versions.  GCC 2.95 is a little
better, but overlaps with GCC 3.3 TLB statistics.  Both GCC 2.95 and GCC
3.3 statistics follow the source file size.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-18 12:58                     ` Jeff Sturm
  2002-08-19 12:55                       ` Mike Stump
@ 2002-08-20 11:22                       ` Will Cohen
  1 sibling, 0 replies; 256+ messages in thread
From: Will Cohen @ 2002-08-20 11:22 UTC (permalink / raw)
  To: Jeff Sturm; +Cc: David Edelsohn, David S. Miller, dan, austern, gcc

How about reordering the rows and columns in the table used by yyparse 
to improve locality?  Have a instrumented version of the yyparse to 
record the number of times each transition is taken and use the data to 
interchange rows and columns to attempt to get frequent transitions in 
the same cache line (or at least not conflicting memory locations). It 
would be a kind of feedback-directed optimization 
(-fprofile-arcs/-fbranch-probabilities) for bison.

-Will

Jeff Sturm wrote:
On Tue, 13 Aug 2002, David Edelsohn wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-20  5:53                             ` Richard Earnshaw
@ 2002-08-20 13:42                               ` Jeff Sturm
  2002-08-22  1:55                                 ` Richard Earnshaw
  0 siblings, 1 reply; 256+ messages in thread
From: Jeff Sturm @ 2002-08-20 13:42 UTC (permalink / raw)
  To: Richard.Earnshaw; +Cc: David Edelsohn, Richard Henderson, David S. Miller, gcc

On Tue, 20 Aug 2002, Richard Earnshaw wrote:
> > I had done that on alpha, but didn't initially report the figures.  Would
> > a comparison to 2.95 also be useful?
>
> Certainly -- the numbers don't really mean anything unless we have
> something to compare them against.

I figured so.  (Wow, I hadn't built a 2.95 toolchain in a long time.)

> > gcc version 3.3 20020802 (experimental)
> >
> > ---------------------------------------------------------------------------
> > cc1 -O2 reload.i
> >
> > issues/cycles = 0.51  issues/dcache_miss = 26.93  issues/dtb_miss = 1214.36

gcc version 2.95.3 20010315 (release)

cc1 -O2 reload.i
issues/cycles = 0.54  issues/dcache_miss = 26.31  issues/dtb_miss = 2488.

cc1 reload.i
issues/cycles = 0.52  issues/dcache_miss = 26.30  issues/dtb_miss = 3306.

Now that's interesting.  No real change in L1 cache performance, but TLB
misses nearly cut in half vs. 3.3.

Trying L3 misses (both with -O0):

3.3: issues/bcache_miss = 370
2.95.3: issues/bcache_miss = 437

Wall-clock time is nearly 2/1 for these tests, as are TLB misses, while
other stats are close.  Hmm.

> So if I understand these figures correctly, then
>
> dcache_miss/dtb_miss ~= 45
>
> That is, one in 45 dcache fetches also requires a tlb walk.

That's how I see it.

> How many dtb entries does an Alpha have?

No idea.  This is an ev56.  I could try grabbing the specs from Digital's
site, if I can still find it...

How expensive is a TLB miss, anyway?  I hadn't expected it would occur
often enough in gcc to be significant.  Note the IPC ratio stays constant,
but as I understand it, TLB is handled in software, so maybe those cycles
are counted by iprobe?

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-20 13:42                               ` Jeff Sturm
@ 2002-08-22  1:55                                 ` Richard Earnshaw
  2002-08-22  2:03                                   ` David S. Miller
  2002-08-23 15:39                                   ` Jeff Sturm
  0 siblings, 2 replies; 256+ messages in thread
From: Richard Earnshaw @ 2002-08-22  1:55 UTC (permalink / raw)
  To: Jeff Sturm
  Cc: Richard.Earnshaw, David Edelsohn, Richard Henderson,
	David S. Miller, gcc

> On Tue, 20 Aug 2002, Richard Earnshaw wrote:
> > > I had done that on alpha, but didn't initially report the figures.  Would
> > > a comparison to 2.95 also be useful?
> >
> > Certainly -- the numbers don't really mean anything unless we have
> > something to compare them against.
> 
> I figured so.  (Wow, I hadn't built a 2.95 toolchain in a long time.)
> 
> > > gcc version 3.3 20020802 (experimental)
> > >
> > > ---------------------------------------------------------------------------
> > > cc1 -O2 reload.i
> > >
> > > issues/cycles = 0.51  issues/dcache_miss = 26.93  issues/dtb_miss = 1214.36
> 
> gcc version 2.95.3 20010315 (release)
> 
> cc1 -O2 reload.i
> issues/cycles = 0.54  issues/dcache_miss = 26.31  issues/dtb_miss = 2488.
> 
> cc1 reload.i
> issues/cycles = 0.52  issues/dcache_miss = 26.30  issues/dtb_miss = 3306.
> 
> Now that's interesting.  No real change in L1 cache performance, but TLB
> misses nearly cut in half vs. 3.3.
> 
> Trying L3 misses (both with -O0):
> 
> 3.3: issues/bcache_miss = 370
> 2.95.3: issues/bcache_miss = 437
> 
> Wall-clock time is nearly 2/1 for these tests, as are TLB misses, while
> other stats are close.  Hmm.
> 
> > So if I understand these figures correctly, then
> >
> > dcache_miss/dtb_miss ~= 45
> >
> > That is, one in 45 dcache fetches also requires a tlb walk.
> 
> That's how I see it.

OK, now consider it this way.  Each cache line miss will cause N bytes to 
be fetched from memory -- I don't know the details, but lets assume that's 
32 bytes, a typical value.  Each tlb entry will address one page -- again 
I don't know the details but 4K is common on many machines.

So, with gcc 2.95.3 we have

-O2 dcache_miss/tlb_miss = 2488 / 26.31 ~= 95
-O0 dcache_miss/tlb_miss = 3306 / 26.30 ~= 127

Since each dcache miss represents 32 bytes of memory we have 3040 (95 * 
32) and 4064 bytes fetched per tlb miss we have very nearly 75% and 100% 
of each page being accessed for each miss (it will be lower than this in 
practice, since some lines in a page will probably be fetched more than 
once and others not at all).

However, for gcc 3 we have 1440 and 1920 bytes; that is, we *at best* 
access less than half the memory in each page we touch.

> How expensive is a TLB miss, anyway?  I hadn't expected it would occur
> often enough in gcc to be significant.  Note the IPC ratio stays constant,
> but as I understand it, TLB is handled in software, so maybe those cycles
> are counted by iprobe?

A cache miss probably takes about twice as long if we also miss in the 
TLB, assuming tlb walking is done in hardware -- if you have a soft-loaded 
TLB, then it could take significantly longer.

R.



^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-22  1:55                                 ` Richard Earnshaw
@ 2002-08-22  2:03                                   ` David S. Miller
  2002-08-23 15:39                                   ` Jeff Sturm
  1 sibling, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-22  2:03 UTC (permalink / raw)
  To: Richard.Earnshaw, rearnsha; +Cc: jsturm, dje, rth, gcc

   From: Richard Earnshaw <rearnsha@arm.com>
   Date: Thu, 22 Aug 2002 09:53:19 +0100

   > How expensive is a TLB miss, anyway?  I hadn't expected it would occur
   > often enough in gcc to be significant.  Note the IPC ratio stays constant,
   > but as I understand it, TLB is handled in software, so maybe those cycles
   > are counted by iprobe?

   A cache miss probably takes about twice as long if we also miss in the 
   TLB, assuming tlb walking is done in hardware -- if you have a soft-loaded 
   TLB, then it could take significantly longer.

A soft-loaded TLB miss on UltraSPARC can be serviced in ~38 processor
cycles.  At least this is how fast the Linux software TLB miss handler
is.  This includes all of the overhead associated with entering and
leaving the trap.  It also assumes that the TLB miss handler hits the
L2 cache for the page table entry load (there is only one memory
access necessary to service a TLB miss, bonus points to those who know
how this is accomplished without looking at the sources :-).

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-22  1:55                                 ` Richard Earnshaw
  2002-08-22  2:03                                   ` David S. Miller
@ 2002-08-23 15:39                                   ` Jeff Sturm
  1 sibling, 0 replies; 256+ messages in thread
From: Jeff Sturm @ 2002-08-23 15:39 UTC (permalink / raw)
  To: Richard.Earnshaw; +Cc: David Edelsohn, Richard Henderson, David S. Miller, gcc

On Thu, 22 Aug 2002, Richard Earnshaw wrote:
> OK, now consider it this way.  Each cache line miss will cause N bytes to
> be fetched from memory -- I don't know the details, but lets assume that's
> 32 bytes, a typical value.  Each tlb entry will address one page -- again
> I don't know the details but 4K is common on many machines.
>
> So, with gcc 2.95.3 we have
>
> -O2 dcache_miss/tlb_miss = 2488 / 26.31 ~= 95
> -O0 dcache_miss/tlb_miss = 3306 / 26.30 ~= 127
>
> Since each dcache miss represents 32 bytes of memory we have 3040 (95 *
> 32) and 4064 bytes fetched per tlb miss we have very nearly 75% and 100%
> of each page being accessed for each miss (it will be lower than this in
> practice, since some lines in a page will probably be fetched more than
> once and others not at all).
>
> However, for gcc 3 we have 1440 and 1920 bytes; that is, we *at best*
> access less than half the memory in each page we touch.

Interesting analysis; thanks.  It's actually worse than you say since
Alpha has 8k pages.

I looked up the ev56 specs to find out there are just 64 TLB entries, so
for any working set larger than 512k some thrashing would be expected.

For another experiment I installed one of the superpage patches available
for Linux; this enables the granularity hint bits for Alpha to support
pages up to 4MB.  Then I modified ggc-page.c to allocate 4MB chucks by
anonymous mmap.

I then measured 70% fewer dtb misses for cc1, although wall clock time is
reduced by only ~5%.  So it would appear that TLB misses are indeed
important but not the overwhelming concern in gcc's performance.

Jeff

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-21 15:35 Tim Josling
  0 siblings, 0 replies; 256+ messages in thread
From: Tim Josling @ 2002-08-21 15:35 UTC (permalink / raw)
  To: gcc

"Tim Josling wrote:

>This is consistent with my tests; I found that a simplistic allocation which
>put everything on the same page, but which never freed anything, actually
>bootstrapped GCC faster than the standard GC.
>
Not too surprising actually; GCC's own sources aren't the hard cases for GC.

>
>The GC was never supposed to make GCC faster, it was supposed to reduce
>workload by getting rid of memory problems. But I doubt it achieves that
>objective. Certainly, keeping track of all the attempts to 'fix' GC has burned
>a lot of my time.
>
The original rationale that I remember was to deal with hairy C++ code
where the compiler would literally exhaust available VM when doing
function-at-a-time compilation.  If that's still the case, then memory
reclamation is a correctness issue.  But it's worth tinkering with the
heuristics; we got a little improvement on Darwin by bumping
GGC_MIN_EXPAND_FOR_GC from 1.3 to 2.0 (it was a while back, don't
have the comparative numbers).

Stan"

Much of the overhead of GC is not the collection as such, but the allocation
process and its side-effects. In fact, if you allocate using the GC code, the
build runs faster if you do the GC, though tweaking the threshold can help.
However for many programs you are better off to allocate very simply and not
do GC at all. 

The GC changes have, in my opinion, made small number of programs better at
the expense of making most compiles slower. We should not be using GC for most
compiles at all.

This - an optimisation that actually make things worse overall - is
unfortunately a common situation with 'improvments' to GCC.

Tim Josling

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-21  6:59 Richard Kenner
@ 2002-08-21 15:04 ` David S. Miller
  0 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-21 15:04 UTC (permalink / raw)
  To: kenner; +Cc: gcc

   From: kenner@vlsi1.ultra.nyu.edu (Richard Kenner)
   Date: Wed, 21 Aug 02 09:58:57 EDT

   True if you only walk *one* SET, but normally you walk a whole bunch,
   each of which have MEM and REG objects.  So I disagree this adds to
   the working set size.

In the obstack days, it was not uncommon to walk several full
consequetive INSNs (including traversing into their contents) and stay
on the same system page.  Today this never happens, it is in fact
guarenteed NOT to happen.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-21  6:59 Richard Kenner
  2002-08-21 15:04 ` David S. Miller
  0 siblings, 1 reply; 256+ messages in thread
From: Richard Kenner @ 2002-08-21  6:59 UTC (permalink / raw)
  To: davem; +Cc: gcc

    This is the one of the huge (of many) problems with GC as it currently
    is implemented.  Different tree and RTL types land on different pages
    so when you walk a "SET" for example, the MEM and REG objects
    contained within will be on different pages and this costs a lot
    especially on modern processors.  Our page working set is huge as a
    result of this.

True if you only walk *one* SET, but normally you walk a whole bunch,
each of which have MEM and REG objects.  So I disagree this adds to
the working set size.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-20 14:11 Tim Josling
  2002-08-20 14:13 ` David S. Miller
@ 2002-08-20 14:43 ` Stan Shebs
  1 sibling, 0 replies; 256+ messages in thread
From: Stan Shebs @ 2002-08-20 14:43 UTC (permalink / raw)
  To: Tim Josling; +Cc: gcc

Tim Josling wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-20 14:11 Tim Josling
@ 2002-08-20 14:13 ` David S. Miller
  2002-08-20 14:43 ` Stan Shebs
  1 sibling, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-20 14:13 UTC (permalink / raw)
  To: tej; +Cc: gcc

   From: Tim Josling <tej@melbpc.org.au>
   Date: Wed, 21 Aug 2002 07:10:02 +1000

   This is consistent with my tests; I found that a simplistic allocation which
   put everything on the same page, but which never freed anything, actually
   bootstrapped GCC faster than the standard GC.

ROFL, thanks for doing such a test. :-)

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-20 14:11 Tim Josling
  2002-08-20 14:13 ` David S. Miller
  2002-08-20 14:43 ` Stan Shebs
  0 siblings, 2 replies; 256+ messages in thread
From: Tim Josling @ 2002-08-20 14:11 UTC (permalink / raw)
  To: gcc

>   From: 
>        "David S. Miller" <davem@redhat.com>
> 
>    From: Richard Henderson <rth@redhat.com>
>    Date: Mon, 19 Aug 2002 10:29:09 -0700
> 
>    Well, no, since SET, MEM, REG, PLUS all have two arguments.
>    And thus are all allocated from the same page.
> 
> Ok, how about walking from INSN down to the SET?  The problem
> does indeed exist there.
> 
> Next, we have the fragmentation issue.  Look at the RTL you
> have right before reload runs on any non-trivial compilation,
> and see where the pointers are.
> 
> So the problem is there.

This is consistent with my tests; I found that a simplistic allocation which
put everything on the same page, but which never freed anything, actually
bootstrapped GCC faster than the standard GC.

The GC was never supposed to make GCC faster, it was supposed to reduce
workload by getting rid of memory problems. But I doubt it achieves that
objective. Certainly, keeping track of all the attempts to 'fix' GC has burned
a lot of my time.

Tim Josling

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-19 10:29         ` Richard Henderson
@ 2002-08-19 11:33           ` David S. Miller
  0 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-19 11:33 UTC (permalink / raw)
  To: rth; +Cc: nick.ing-simmons, gcc, dalej

   From: Richard Henderson <rth@redhat.com>
   Date: Mon, 19 Aug 2002 10:29:09 -0700

   Well, no, since SET, MEM, REG, PLUS all have two arguments.
   And thus are all allocated from the same page.

Ok, how about walking from INSN down to the SET?  The problem
does indeed exist there.

Next, we have the fragmentation issue.  Look at the RTL you
have right before reload runs on any non-trivial compilation,
and see where the pointers are.

So the problem is there.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-19  7:06       ` David S. Miller
@ 2002-08-19 10:29         ` Richard Henderson
  2002-08-19 11:33           ` David S. Miller
  0 siblings, 1 reply; 256+ messages in thread
From: Richard Henderson @ 2002-08-19 10:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: nick.ing-simmons, gcc, dalej

On Mon, Aug 19, 2002 at 06:51:39AM -0700, David S. Miller wrote:
> This is the one of the huge (of many) problems with GC as it currently
> is implemented.  Different tree and RTL types land on different pages
> so when you walk a "SET" for example, the MEM and REG objects
> contained within will be on different pages and this costs a lot
> especially on modern processors.  Our page working set is huge as a
> result of this.

Well, no, since SET, MEM, REG, PLUS all have two arguments.
And thus are all allocated from the same page.



r~

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-19  5:15     ` Nick Ing-Simmons
  2002-08-19  7:06       ` David S. Miller
@ 2002-08-19  9:20       ` Daniel Egger
  1 sibling, 0 replies; 256+ messages in thread
From: Daniel Egger @ 2002-08-19  9:20 UTC (permalink / raw)
  To: Nick Ing-Simmons; +Cc: GCC Developer Mailinglist

Am Mon, 2002-08-19 um 14.15 schrieb Nick Ing-Simmons:

> Yet another speed/space trade-off - most architecures are going to take
> significantly longer to inc/dec a bitfield than they will doing 
> an int. 

However a few (after all the overhead is not really much) extra
instructions are nothing compared to a cachemiss or even vm measures.
And bitfields can help dramatically to shrink object size especially
for a pile of boolean states and alike.
 
-- 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQA9YQzqchlzsq9KoIYRAir4AKCF/ipRTV6Z06GVoZyc5yOOn9HYBACeMdJS
1nI4Hh7+3NmwogSZ6MhXvSI=
=NssS
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-19  5:15     ` Nick Ing-Simmons
@ 2002-08-19  7:06       ` David S. Miller
  2002-08-19 10:29         ` Richard Henderson
  2002-08-19  9:20       ` Daniel Egger
  1 sibling, 1 reply; 256+ messages in thread
From: David S. Miller @ 2002-08-19  7:06 UTC (permalink / raw)
  To: nick.ing-simmons; +Cc: gcc, dalej

   From: Nick Ing-Simmons <nick.ing-simmons@elixent.com>
   Date: Mon, 19 Aug 2002 13:15:27 +0100

   Yet another speed/space trade-off - most architecures are going to take
   significantly longer to inc/dec a bitfield than they will doing 
   an int. 

A dumb one too, I believed 24 bits were free but they certainly were
not.  It should indeed be an int.

   Which reminds me that one of the advantages of the "obstack" scheme 
   was it tended to act as a "slab allocator" with relatively few 
   chunks with lots of little things inside each chunk.

Actually one of the core things that Richard Henderson and others
continually ignore is that obstack put independent object types on
the same page.

This is the one of the huge (of many) problems with GC as it currently
is implemented.  Different tree and RTL types land on different pages
so when you walk a "SET" for example, the MEM and REG objects
contained within will be on different pages and this costs a lot
especially on modern processors.  Our page working set is huge as a
result of this.

In the obstack days, walking such a SET expression could very well
stay on the same page, even the same set of cachelines.

Tweaking GC stuff like making some new size classes as Richard has
done is going to solve none of these problems.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 13:13       ` Jamie Lokier
@ 2002-08-19  5:20         ` Nick Ing-Simmons
  0 siblings, 0 replies; 256+ messages in thread
From: Nick Ing-Simmons @ 2002-08-19  5:20 UTC (permalink / raw)
  To: egcs; +Cc: gcc, tjw, dalej, David S. Miller

Jamie Lokier <egcs@tantalophile.demon.co.uk> writes:
>David S. Miller wrote:
>>       You can easily use much less than a full word.  Foundation on 
>>    OpenStep/Mach 4.2 started storing partial ref counts in whatever spare 
>>    bits were available in each object.
>> 
>> We don't have any spare bits, we have 32 bits used for RTX state then
>> the next object is a pointer.  So whatever size counter we use will
>> eat at a minimum a word's worth of space to get the pointers aligned
>> up properly.
>
>Did I see "pointers" and "aligned" there?
>
>You have 2 spare bits per pointer, man, open your eyes! :-)

Note the :-)  Given the amount of 
pointer use having to mask every pointer to clear out "spare bits" 
would be a disaster.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 10:56   ` David S. Miller
  2002-08-14 11:04     ` Dale Johannesen
@ 2002-08-19  5:15     ` Nick Ing-Simmons
  2002-08-19  7:06       ` David S. Miller
  2002-08-19  9:20       ` Daniel Egger
  1 sibling, 2 replies; 256+ messages in thread
From: Nick Ing-Simmons @ 2002-08-19  5:15 UTC (permalink / raw)
  To: davem; +Cc: gcc, dalej

David S. Miller <davem@redhat.com> writes:
>   From: Dale Johannesen <dalej@apple.com>
>   Date: Wed, 14 Aug 2002 10:17:46 -0700
>   
>   And I know this is blindingly obvious, but RC takes an extra field (word,
>   probably) in each node.  I suspect this is going to eat up a lot of
>   whatever gain there might be.
>   
>My implementation (I posted the hacked up infrastructure patch the
>other day) used space which is currently empty alongside a bitfield
>in the rtx.

Yet another speed/space trade-off - most architecures are going to take
significantly longer to inc/dec a bitfield than they will doing 
an int. 

RC is obviously less expensive in space when the "things" are larger,
an extra int is horribly expensive when there are only a few words
in the struct, but harmless when there are thousands.

Which reminds me that one of the advantages of the "obstack" scheme 
was it tended to act as a "slab allocator" with relatively few 
chunks with lots of little things inside each chunk.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 21:44   ` Fergus Henderson
  2002-08-14  4:00     ` Noel Yap
@ 2002-08-19  4:58     ` Nick Ing-Simmons
  1 sibling, 0 replies; 256+ messages in thread
From: Nick Ing-Simmons @ 2002-08-19  4:58 UTC (permalink / raw)
  To: fjh; +Cc: gcc, Robert Dewar, Theodore Papadopoulo, mrs, shebs

Fergus Henderson <fjh@cs.mu.OZ.AU> writes:
>On 13-Aug-2002, Theodore Papadopoulo <Theodore.Papadopoulo@sophia.inria.fr> wrote:
>> 
>> the "average 
>> source code" (to be defined by someone ;-) ) is also probably growing
>> in size and complexity...
>
>Indeed.  Also note that more people are using higher-level languages,
>a fair number of which work by compiling to C -- generally with a
>significant expansion in the number of lines of code.
>For example, the Mercury compiler is about 270,000 lines of
>Mercury code, which compiles to about 4.5 million lines of C code.
>This takes gcc a long time to compile...

There has been a perl -> C converter for a while, but nobody has 
been working on it because it takes so long to compile the result.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-17 22:23 ` Mat Hounsell
@ 2002-08-18  6:27   ` Michael S. Zick
  0 siblings, 0 replies; 256+ messages in thread
From: Michael S. Zick @ 2002-08-18  6:27 UTC (permalink / raw)
  To: Mat Hounsell, gcc

On Sunday 18 August 2002 12:23 am, Mat Hounsell wrote:
> > "From the projects I've worked on most software is devided into modules
> > and
>
> each module has common headers and options and generally several files. As
>
> A better solution would be to develop a GCC language frontend that took
> compiler commands and ran them, while maintain the information from the
> last command.
>
That front-end already exists; or, at least a substitute; it's called "make" 
on most systems.

Mike

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
       [not found] <1029519609.8400.ezmlm@gcc.gnu.org>
@ 2002-08-17 22:23 ` Mat Hounsell
  2002-08-18  6:27   ` Michael S. Zick
  0 siblings, 1 reply; 256+ messages in thread
From: Mat Hounsell @ 2002-08-17 22:23 UTC (permalink / raw)
  To: gcc

> "From the projects I've worked on most software is devided into modules and
each module has common headers and options and generally several files. As such
all the headers are pre-processed and parsed for every file that needs to be
compiled. Pre-Compiled Headers allow the compiler to load the parsed code."

> "But why load and unload the compiler and the headers for every file in a
module. It would be far more effecient to adapt the build process and start gcc
for the module and then to tell it to compile each file that needs to be
re-compiled. Add pre-compiled header support and it wouldn't even need to
compile the headers once."

When developing GCC _please_ remember not everyone has a dedicated server farm
for compiling.

I developed a better solution to my previous. I chose pragma's because GCC will
accept and compile code from the standard input. So I thought you would need
unreal code.

$> gcc -o a.o -c b.c c.c d.c 
This is a good first start. The problem is this will build every file, which
defeats the purpose of having a build system. You have to improve your build
system for this to work effectively.

A better solution would be to develop a GCC language frontend that took
compiler commands and ran them, while maintain the information from the last
command.

The complexity is then constrained in a front end and maintaining the state. I
would like to help develop this, but I have no idea how to maintain state
properly.

http://digital.yahoo.com.au - Yahoo! Digital How To
- Get the best out of your PC!

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-16  5:08 Joe Wilson
  2002-08-16  5:51 ` Noel Yap
@ 2002-08-16 11:04 ` Mike Stump
  1 sibling, 0 replies; 256+ messages in thread
From: Mike Stump @ 2002-08-16 11:04 UTC (permalink / raw)
  To: Joe Wilson; +Cc: gcc

On Friday, August 16, 2002, at 05:08 AM, Joe Wilson wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-16  5:08 Joe Wilson
@ 2002-08-16  5:51 ` Noel Yap
  2002-08-16 11:04 ` Mike Stump
  1 sibling, 0 replies; 256+ messages in thread
From: Noel Yap @ 2002-08-16  5:51 UTC (permalink / raw)
  To: Joe Wilson, gcc

--- Joe Wilson <developir@yahoo.com> wrote:
> I was thinking the same thing, except without
> introducing new pragmas.
> You could do the common (header) code precompiling
> only for modules listed 
> on the commandline without having to save state to a
> file-based code 
> respository.  i.e.:
> 
>  g++ -c [flags] module1.cpp module2.cpp module3.cpp

Me, too, but not to the same extent as you guys :-)

> But compiling groups of modules at one time is
> contrary to the way most 
> makefiles work, so it might not be practical.

I think many build systems are starting to have to
deal with this issue due to Java.  IMHO, although I
love make, new build requirements (eg multiple targets
from one or more dependencies, target names chosen by
the compiler, ...) are making it obsolete.

Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-16  5:08 Joe Wilson
  2002-08-16  5:51 ` Noel Yap
  2002-08-16 11:04 ` Mike Stump
  0 siblings, 2 replies; 256+ messages in thread
From: Joe Wilson @ 2002-08-16  5:08 UTC (permalink / raw)
  To: gcc

Mat Hounsell wrote:
>But why load and unload the compiler and the headers for every file in a
>module. It would be far more effecient to adapt the build process and start gcc
>for the module and then to tell it to compile each file that needs to be
>re-compiled. Add pre-compiled header support and it wouldn't even need to
>compile the headers once.

I was thinking the same thing, except without introducing new pragmas.
You could do the common (header) code precompiling only for modules listed 
on the commandline without having to save state to a file-based code 
respository.  i.e.:

 g++ -c [flags] module1.cpp module2.cpp module3.cpp

But compiling groups of modules at one time is contrary to the way most 
makefiles work, so it might not be practical.

Perhaps GCC already economizes the evaluation of common code in such 
"group" builds.  Can anyone comment on whether it does or not?

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
       [not found] <1029475232.9572.ezmlm@gcc.gnu.org>
@ 2002-08-16  1:28 ` Mat Hounsell
  0 siblings, 0 replies; 256+ messages in thread
From: Mat Hounsell @ 2002-08-16  1:28 UTC (permalink / raw)
  To: gcc

After I first built GCC, G++ & GCJ etc (3.0) on my x86 300MHz , 32Mb machine I
began to seriously consider how to increase compilation speed. The full build
would have taken at least nine hours. This is why I only built GCC, G++ for 3.1

From the projects I've worked on most software is devided into modules and each
module has common headers and options and generally several files. As such all
the headers are pre-processed and parsed for every file that needs to be
compiled. Pre-Compiled Headers allow the compiler to load the parsed code.

But why load and unload the compiler and the headers for every file in a
module. It would be far more effecient to adapt the build process and start gcc
for the module and then to tell it to compile each file that needs to be
re-compiled. Add pre-compiled header support and it wouldn't even need to
compile the headers once.

This is how you might implement in unix ...

Add a compiler directive to the front end that causes the compiler to compile a
file.
e.g. #pragma gcc reset
e.g. #pragma gcc compile -o file.o -c file.c
e.g. #pragma gcc quit
Add a make system to start gcc reading from a named pipe. 
For each module issue a 'reset' directive to the pipe.
  Then issue a directive to set the compiler options.
  For each file in the module issue a compile directive to the pipe.
When your finished issue a quit directive.

If only gcc uses this capability it will be good as it would allow a much
faster boostrap in the later more complex stages. Making use of amd work on GCC
practical for more people.

http://digital.yahoo.com.au - Yahoo! Digital How To
- Get the best out of your PC!

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-14 19:11 Tim Josling
  0 siblings, 0 replies; 256+ messages in thread
From: Tim Josling @ 2002-08-14 19:11 UTC (permalink / raw)
  To: gcc

> It could've been interesting to try incremental/generational collection.
> I didn't do that.

There may be quite a few ways to improve locality if that is the problem
(maybe the problem could just be that GC causes a bigger footprint and thereby
affects the hit rates).

Examples: 

Subpools (give allocations a name and put like named allocations together),
for example "Front End" and "Back End".

Hints about allocations that are likely to be long and short lived. Put them
in different places.

"Allocate Near" models where you give a pointer that gives a hint where you
want the next thing allocated.

Some big functions should only be optimised in chunks perhaps. This could
avoid walking long lists that are bigger than cache, and reduce the damage of
various non-linear algorithms:

big.c:999999: Warning: "This function is too big to optimise in a reasonable
time"

Compaction of allocations after freeing memory, perhaps combined with other
options. This requires knowing about all the users of that memory so pointers
can be updated. Indirect pointers perhaps?

Allocate all sizes together. This would make reuse of storage harder of course
but could improve locality.

Explicitly freeing stuff when you know you are the only user.

Allocating certain things in 'never to be freed' mode, thus avoiding having to
GC it all the time. These could be all put together with no need for bitmaps,
holes in allocated memory etc etc.

Maybe some things should be allowed to migrate out of cache and never return.
Maybe freeing them is worse than leaving them alone.

----

One problem is that GCC is so complex and large it is difficult to try
theories. 

It is pretty easy to effectively turn off GC, just increase the size below
which GC does nothing - hardcoded in ggc-page.c (GGC_MIN_LAST_ALLOCATED)
default is 4mb.

Tim Josling

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 13:43           ` Tim Hollebeek
@ 2002-08-14 13:57             ` Jamie Lokier
  0 siblings, 0 replies; 256+ messages in thread
From: Jamie Lokier @ 2002-08-14 13:57 UTC (permalink / raw)
  To: Tim Hollebeek; +Cc: Timothy J. Wood, Dale Johannesen, gcc

Tim Hollebeek wrote:
> A simpler strategy is to just make every object with RC >= 2^n immortal.
> Something like:
> 
> add: if (rc) rc++;
> subtract: if (rc) if (!--rc) delete;

If the goal is simply to encourage freeing of objects for rapid cache
reuse, then I suspect even 1 or 2 bits of reference count would be
enough.  GGC can sort out the rest.

You still have to add ref/unref code all over the compiler though.
Perhaps that could be automated, by the compiler itself tracing pointers
and adding ref/unref code?  (That would be useful for a lot of programs,
I suspect).  If we see ref/unref as only an optimisation hint for GGC,
then it doesn't matter that the bootstrap compiler won't insert them.

-- Jamie

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 13:35         ` Jamie Lokier
@ 2002-08-14 13:43           ` Tim Hollebeek
  2002-08-14 13:57             ` Jamie Lokier
  0 siblings, 1 reply; 256+ messages in thread
From: Tim Hollebeek @ 2002-08-14 13:43 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Timothy J. Wood, Dale Johannesen, gcc

On Wed, Aug 14, 2002 at 09:35:07PM +0100, Jamie Lokier wrote:
> Timothy J. Wood wrote:
> >    I'm not sure what you mean.  The external RC table only gets modified 
> > once every 2^N operations already (and never if the internal RC never 
> > overflows the internal RC).  Stated another way, the internal RC holds 
> > the low order bits and the external RC holds the high order bits.
> 
> Ah, right, I misunderstood the code.  Your way is a fine way.
> Presumably if it only gets called every 2^N operations at most, then
> it's not _strictly_ the high order bits in the hash table.  I.e. I'm
> thinking of the reference pattern 127 -> 128 -> 127 -> 128...

A simpler strategy is to just make every object with RC >= 2^n immortal.
Something like:

add: if (rc) rc++;

subtract: if (rc) if (!--rc) delete;

For a program like gcc that doesn't have to absolutely guarantee it
leaks no memory, this can be an acceptable tradeoff.

(for belt and suspenders people, run a infrequent lazy gc to clean up
the scraps)

-Tim

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 13:29       ` Timothy J. Wood
@ 2002-08-14 13:35         ` Jamie Lokier
  2002-08-14 13:43           ` Tim Hollebeek
  0 siblings, 1 reply; 256+ messages in thread
From: Jamie Lokier @ 2002-08-14 13:35 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Dale Johannesen, gcc

Timothy J. Wood wrote:
>    I'm not sure what you mean.  The external RC table only gets modified 
> once every 2^N operations already (and never if the internal RC never 
> overflows the internal RC).  Stated another way, the internal RC holds 
> the low order bits and the external RC holds the high order bits.

Ah, right, I misunderstood the code.  Your way is a fine way.
Presumably if it only gets called every 2^N operations at most, then
it's not _strictly_ the high order bits in the hash table.  I.e. I'm
thinking of the reference pattern 127 -> 128 -> 127 -> 128...

-- Jamie

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 13:16     ` Jamie Lokier
@ 2002-08-14 13:29       ` Timothy J. Wood
  2002-08-14 13:35         ` Jamie Lokier
  0 siblings, 1 reply; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-14 13:29 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Dale Johannesen, gcc

On Wednesday, August 14, 2002, at 01:16  PM, Jamie Lokier wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 11:27   ` Timothy J. Wood
  2002-08-14 11:42     ` David S. Miller
@ 2002-08-14 13:16     ` Jamie Lokier
  2002-08-14 13:29       ` Timothy J. Wood
  1 sibling, 1 reply; 256+ messages in thread
From: Jamie Lokier @ 2002-08-14 13:16 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Dale Johannesen, gcc

Timothy J. Wood wrote:
> 		// This function adds an entry of 1 to the external hash table if there
> 		// was no entry, otherwise it increments the existing entry.
> 		IncrementExternalRC(this);

Nice idea, although you might be able to amortise the hash table
accesses some more by storing multiples of 64 in the external hash
table.  When the internal count overflows, transfer 64 counts to the
hash table, and vice versa when it reaches zero.

-- Jamie

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 11:42     ` David S. Miller
@ 2002-08-14 13:13       ` Jamie Lokier
  2002-08-19  5:20         ` Nick Ing-Simmons
  0 siblings, 1 reply; 256+ messages in thread
From: Jamie Lokier @ 2002-08-14 13:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: tjw, dalej, gcc

David S. Miller wrote:
>       You can easily use much less than a full word.  Foundation on 
>    OpenStep/Mach 4.2 started storing partial ref counts in whatever spare 
>    bits were available in each object.
> 
> We don't have any spare bits, we have 32 bits used for RTX state then
> the next object is a pointer.  So whatever size counter we use will
> eat at a minimum a word's worth of space to get the pointers aligned
> up properly.

Did I see "pointers" and "aligned" there?

You have 2 spare bits per pointer, man, open your eyes! :-)

-- Jamie

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 11:27   ` Timothy J. Wood
@ 2002-08-14 11:42     ` David S. Miller
  2002-08-14 13:13       ` Jamie Lokier
  2002-08-14 13:16     ` Jamie Lokier
  1 sibling, 1 reply; 256+ messages in thread
From: David S. Miller @ 2002-08-14 11:42 UTC (permalink / raw)
  To: tjw; +Cc: dalej, gcc

   From: "Timothy J. Wood" <tjw@omnigroup.com>
   Date: Wed, 14 Aug 2002 11:27:07 -0700
   
      You can easily use much less than a full word.  Foundation on 
   OpenStep/Mach 4.2 started storing partial ref counts in whatever spare 
   bits were available in each object.

We don't have any spare bits, we have 32 bits used for RTX state then
the next object is a pointer.  So whatever size counter we use will
eat at a minimum a word's worth of space to get the pointers aligned
up properly.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 10:17 ` Dale Johannesen
  2002-08-14 10:56   ` David S. Miller
@ 2002-08-14 11:27   ` Timothy J. Wood
  2002-08-14 11:42     ` David S. Miller
  2002-08-14 13:16     ` Jamie Lokier
  1 sibling, 2 replies; 256+ messages in thread
From: Timothy J. Wood @ 2002-08-14 11:27 UTC (permalink / raw)
  To: Dale Johannesen; +Cc: gcc

On Wednesday, August 14, 2002, at 10:17  AM, Dale Johannesen wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 11:04     ` Dale Johannesen
@ 2002-08-14 11:08       ` David S. Miller
  0 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-14 11:08 UTC (permalink / raw)
  To: dalej; +Cc: gcc

   From: Dale Johannesen <dalej@apple.com>
   Date: Wed, 14 Aug 2002 11:04:46 -0700

   On Wednesday, August 14, 2002, at 10:42 AM, David S. Miller wrote:
   
   > My implementation (I posted the hacked up infrastructure patch the
   > other day) used space which is currently empty alongside a bitfield
   > in the rtx.
   
   Looks to me like you increased the total bitfield usage in rtx_def
   from 32 to 56 bits.  That's free on a machine with 64-bit words I
   suppose, is that what you're talking about?  It's certainly not
   empty space on a 32-bit machine.
   
Ignore me, I can't count.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 10:56   ` David S. Miller
@ 2002-08-14 11:04     ` Dale Johannesen
  2002-08-14 11:08       ` David S. Miller
  2002-08-19  5:15     ` Nick Ing-Simmons
  1 sibling, 1 reply; 256+ messages in thread
From: Dale Johannesen @ 2002-08-14 11:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: Dale Johannesen, gcc

On Wednesday, August 14, 2002, at 10:42 AM, David S. Miller wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14 10:17 ` Dale Johannesen
@ 2002-08-14 10:56   ` David S. Miller
  2002-08-14 11:04     ` Dale Johannesen
  2002-08-19  5:15     ` Nick Ing-Simmons
  2002-08-14 11:27   ` Timothy J. Wood
  1 sibling, 2 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-14 10:56 UTC (permalink / raw)
  To: dalej; +Cc: gcc

   From: Dale Johannesen <dalej@apple.com>
   Date: Wed, 14 Aug 2002 10:17:46 -0700

   And I know this is blindingly obvious, but RC takes an extra field (word,
   probably) in each node.  I suspect this is going to eat up a lot of
   whatever gain there might be.

My implementation (I posted the hacked up infrastructure patch the
other day) used space which is currently empty alongside a bitfield
in the rtx.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 12:49 Robert Dewar
@ 2002-08-14 10:17 ` Dale Johannesen
  2002-08-14 10:56   ` David S. Miller
  2002-08-14 11:27   ` Timothy J. Wood
  0 siblings, 2 replies; 256+ messages in thread
From: Dale Johannesen @ 2002-08-14 10:17 UTC (permalink / raw)
  To: gcc; +Cc: Dale Johannesen

Two points on reference counts.

The compiler I worked on before gcc used RC, and was neither blindingly
fast, nor so slow that anybody complained about it.  Compile speed wasn't
an issue so we never ran any numbers, but my impression was that it was
roughly the same as the gcc in SPEC (2.7 IIRC).  OTOH, we did have quite a
lot of RC bugs, and I think you can expect to also.

And I know this is blindingly obvious, but RC takes an extra field (word,
probably) in each node.  I suspect this is going to eat up a lot of
whatever gain there might be.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14  4:45         ` Noel Yap
@ 2002-08-14 10:06           ` Janis Johnson
  0 siblings, 0 replies; 256+ messages in thread
From: Janis Johnson @ 2002-08-14 10:06 UTC (permalink / raw)
  To: Noel Yap; +Cc: Michael Matz, gcc

On Wed, Aug 14, 2002 at 04:45:14AM -0700, Noel Yap wrote:
>
>  How is compilation speed being tested?

I would also like to know the preferred method of testing compilation
speed.  Possibilities include using the time utility, or GCC's
-ftime-report, which prints information about the time used by each
pass.  People with access to platform-dependent tools are also using
those to gather specific kinds of information, as David Edelsohn did
with cache misses on AIX.

The release criteria include testing compile-time performance and peak
memory usage; see http://gcc.gnu.org/gcc-3.1/criteria.html .  The only
source code listed is a GCC source file.  As people do compile-time
testing they should consider proposing more tests to include in the
release criteria, along with specific methods of measurement.

Janis

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14  4:36       ` Michael Matz
@ 2002-08-14  4:45         ` Noel Yap
  2002-08-14 10:06           ` Janis Johnson
  0 siblings, 1 reply; 256+ messages in thread
From: Noel Yap @ 2002-08-14  4:45 UTC (permalink / raw)
  To: Michael Matz; +Cc: gcc

--- Michael Matz <matz@suse.de> wrote:
> > From what I've heard on this thread, I was under
> the impression that
> > the talk is of improving the C++ front end, but
> not C.
> 
> Simply look at the subject of this thread.  The
> topic is faster
> compilation, no matter if C, C++ or whatever else.

If compilation speed is being tested purely on C++
code, it's possible that only C++ compilation speed
will be improved.  How is compilation speed being
tested?  How much investigation is being done on the
back end as compared to the front end?

I would really like to understand what's being done so
I apologize for these questions.

Thanks,
Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-14  4:00     ` Noel Yap
@ 2002-08-14  4:36       ` Michael Matz
  2002-08-14  4:45         ` Noel Yap
  0 siblings, 1 reply; 256+ messages in thread
From: Michael Matz @ 2002-08-14  4:36 UTC (permalink / raw)
  To: Noel Yap; +Cc: gcc

Hi,

On Wed, 14 Aug 2002, Noel Yap wrote:

> From what I've heard on this thread, I was under the impression that
> the talk is of improving the C++ front end, but not C.

Simply look at the subject of this thread.  The topic is faster
compilation, no matter if C, C++ or whatever else.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 21:44   ` Fergus Henderson
@ 2002-08-14  4:00     ` Noel Yap
  2002-08-14  4:36       ` Michael Matz
  2002-08-19  4:58     ` Nick Ing-Simmons
  1 sibling, 1 reply; 256+ messages in thread
From: Noel Yap @ 2002-08-14  4:00 UTC (permalink / raw)
  To: Fergus Henderson, Theodore Papadopoulo; +Cc: Robert Dewar, gcc, shebs, mrs

--- Fergus Henderson <fjh@cs.mu.OZ.AU> wrote:
> On 13-Aug-2002, Theodore Papadopoulo
> <Theodore.Papadopoulo@sophia.inria.fr> wrote:
> > 
> > the "average 
> > source code" (to be defined by someone ;-) ) is
> also probably growing
> > in size and complexity...
> 
> Indeed.  Also note that more people are using
> higher-level languages,
> a fair number of which work by compiling to C --
> generally with a
> significant expansion in the number of lines of
> code.
> For example, the Mercury compiler is about 270,000
> lines of
> Mercury code, which compiles to about 4.5 million
> lines of C code.
> This takes gcc a long time to compile...
> 
> So I am very supportive of any work that can be done
> to improve the
> speed of gcc.

From what I've heard on this thread, I was under the
impression that the talk is of improving the C++ front
end, but not C.  I'm probably confused about this
(possibly since this thread is partially a
brainstorming session) so can someone put me straight,
please?  If there's a large enough concensus on the
top candidates to investigate, what are they?

Thanks,
Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 12:02 Robert Dewar
  2002-08-13 12:32 ` Robert Lipe
@ 2002-08-14  2:55 ` Daniel Egger
  1 sibling, 0 replies; 256+ messages in thread
From: Daniel Egger @ 2002-08-14  2:55 UTC (permalink / raw)
  To: Robert Dewar; +Cc: GCC Developer Mailinglist

Am Die, 2002-08-13 um 21.02 schrieb Robert Dewar:

> and compilers are indeed slower. 10-hours should certainly be enough to build
> any project at this stage. 

Speculation. I haven't recently tried to build OpenOffice but build 632
took around 24h on a dual P-III with lots of RAM and a fast RAID one
year ago with gcc 2.95.x. I bet that with the code increase and the
general slowdown of gcc 3.x you'll hardly find any mainstream machine
which will compile it un under 10 hours.

-- 
Servus,
       Daniel

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 10:20 ` Theodore Papadopoulo
@ 2002-08-13 21:44   ` Fergus Henderson
  2002-08-14  4:00     ` Noel Yap
  2002-08-19  4:58     ` Nick Ing-Simmons
  0 siblings, 2 replies; 256+ messages in thread
From: Fergus Henderson @ 2002-08-13 21:44 UTC (permalink / raw)
  To: Theodore Papadopoulo; +Cc: Robert Dewar, gcc, shebs, mrs

On 13-Aug-2002, Theodore Papadopoulo <Theodore.Papadopoulo@sophia.inria.fr> wrote:
> 
> the "average 
> source code" (to be defined by someone ;-) ) is also probably growing
> in size and complexity...

Indeed.  Also note that more people are using higher-level languages,
a fair number of which work by compiling to C -- generally with a
significant expansion in the number of lines of code.
For example, the Mercury compiler is about 270,000 lines of
Mercury code, which compiles to about 4.5 million lines of C code.
This takes gcc a long time to compile...

So I am very supportive of any work that can be done to improve the
speed of gcc.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: < http://www.cs.mu.oz.au/~fjh >  |     -- the last words of T. S. Garp.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 16:53 ` Joe Buck
@ 2002-08-13 17:24   ` Paul Koning
  0 siblings, 0 replies; 256+ messages in thread
From: Paul Koning @ 2002-08-13 17:24 UTC (permalink / raw)
  To: Joe.Buck; +Cc: dewar, gcc

>>>>> "Joe" == Joe Buck <Joe.Buck@synopsys.com> writes:

 Joe> Robert Dewar writes:
 >> But remember that work you put in on speeding up the compiler is
 >> work that you do not put in on improving the compiler. As time
 >> goes on, quality of generated code continues to be critical,
 >> compiler speed is less critical.

 Joe> Um, possibly you forget that in order to get a change accepted,
 Joe> the contributor has to do a three-stage bootstrap with the
 Joe> compiler and run the regressions.  If that process ran three
 Joe> times faster, contributors could try three times as many
 Joe> versions of possible patches in the same time period.

Yes indeed.  On a reasonable PC (a 2-year old laptop) a single
iteration of that process takes many hours.

	  paul

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 10:36 Robert Dewar
  2002-08-13 13:46 ` Kai Henningsen
@ 2002-08-13 16:53 ` Joe Buck
  2002-08-13 17:24   ` Paul Koning
  1 sibling, 1 reply; 256+ messages in thread
From: Joe Buck @ 2002-08-13 16:53 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Theodore.Papadopoulo, dewar, gcc, mrs, shebs

Robert Dewar writes:
> But remember that work you put in on speeding up the compiler is work
> that you do not put in on improving the compiler. As time goes on, quality
> of generated code continues to be critical, compiler speed is less critical.

Um, possibly you forget that in order to get a change accepted, the
contributor has to do a three-stage bootstrap with the compiler and run
the regressions.  If that process ran three times faster, contributors
could try three times as many versions of possible patches in the same
time period.

Given this, making gcc build itself and run the regressions faster would
lead to faster improvement in the compiler.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 15:00 Tim Josling
@ 2002-08-13 15:48 ` Russ Allbery
  0 siblings, 0 replies; 256+ messages in thread
From: Russ Allbery @ 2002-08-13 15:48 UTC (permalink / raw)
  To: gcc

Tim Josling <tej@melbpc.org.au> writes:

> The COBOL spec is about 1500 pages in a smallish font (including addenda
> and the "intrinsic functions"). My copy of the C standard, for example,
> runs to about 200 pages(1).

> (1) Excluding the library.

ISO C99 is 554 pages including the standard library, all of the annexes,
and the index, so you can even make the comparison on that basis; it's
about a third of the size of COBOL including a decent chunk of the
standard library.  (glibc includes a full POSIX implementation plus a
bunch of other stuff, so it's hard to make an apples and apples
comparison.)

-- 
Russ Allbery (rra@stanford.edu)             < http://www.eyrie.org/~eagle/ >

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-13 15:00 Tim Josling
  2002-08-13 15:48 ` Russ Allbery
  0 siblings, 1 reply; 256+ messages in thread
From: Tim Josling @ 2002-08-13 15:00 UTC (permalink / raw)
  To: gcc

>>File size is not the only parameter. Modern languages do more
>> complicated thing than the average Cobol compiler I suppose....
>>

> You suppose dramatically wrong (it is amazing how little people now about
> COBOL and how much they are willing to guess). Modern COBOL is an extremely
> complex language, certainly more complex than Ada, and probably more complex
> than C++.

The COBOL spec is about 1500 pages in a smallish font (including addenda and
the "intrinsic functions"). My copy of the C standard, for example, runs to
about 200 pages(1). 'Modern' languages are a lot more regular and were
designed with the compiler writer in mind. The concerns of the compiler writer
were definitely not at the forefront of the COBOL language designers' minds.

> The point is that GCC has a really terrible time if you throw a single
> procedure with tens of thousands of lines of code in it at the compiler.

Correct. The largest single function written in COBOL, that I have been able
to find, is several *hundred thousand* lines long. Even the slightest
non-linearity is a major problem.

Tim Josling

(1) Excluding the library. You could argue that the some COBOL verbs are
similar to the library, which is true, but the C library hardly affects the
compiler itself. In GNU the C library is even a separate project. In COBOL the
verbs are part of the language syntax and require their own parse trees and so
forth so it would be very difficult to have a separate project. Even the
intrinsic functions though they look like functions are just more syntax in a
slightly more regular form.

Some of the C library functions are tightly coupled to the compiler e.g.
setjmp, va_*, memset (if inlined), printf (for parameter checking). But by and
large the library is independent.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 11:53 ` Stan Shebs
@ 2002-08-13 14:53   ` Joe Buck
  0 siblings, 0 replies; 256+ messages in thread
From: Joe Buck @ 2002-08-13 14:53 UTC (permalink / raw)
  To: Stan Shebs; +Cc: Robert Dewar, drow, Theodore.Papadopoulo, gcc, mrs, phil

Stan writes:
> I'm not very keen on trying to start from scratch, the people that have
> tried over the past few years haven't done so well.  Also, since we don't
> seem to have a good understanding of why GCC is slow, then why would we
> expect a redesign to somehow avoid those problems?  And if we do get an
> understanding, then we can estimate the effort to change it incrementally.

David Edelsohn's numbers suggest that bad cache behavior may be the
primary culprit.  Figuring out how to improve data locality might be a
good starting point.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 10:36 Robert Dewar
@ 2002-08-13 13:46 ` Kai Henningsen
  2002-08-13 16:53 ` Joe Buck
  1 sibling, 0 replies; 256+ messages in thread
From: Kai Henningsen @ 2002-08-13 13:46 UTC (permalink / raw)
  To: gcc

dewar@gnat.com (Robert Dewar)  wrote on 13.08.02 in < 20020813173609.3DC80F2D49@nile.gnat.com >:

> But remember that work you put in on speeding up the compiler is work
> that you do not put in on improving the compiler. As time goes on, quality
> of generated code continues to be critical, compiler speed is less critical.

Not necessarily. It depends on what you do to get the speed up.

If, for example, you get the speed up by noticing that gcc does some  
useless work, and you eliminate that, that is most definitely an  
improvement. Or if you find that another algorithm gives at least as well  
a result in much shorter time.

On the other hand, if you find you can improve the speed by writing lots  
of spaghetti code, that is probabl not an improvement.

> Very little in practice. You do not rebuild a million line system every
> two minutes after all, and in practice once the build time for a large
> system is down in the ten minute range, the gains in making it faster
> diminish rapidly. This is not a guess, as I say, this is an observation

Well, my personal observation is that a change much like this ten minutes  
to one has recently made me *much* more productive. For one thing, if the  
right change will cause a complete rebuild to the project (because a  
common header file changes), I no longer look for alternate changes just  
to avoid that. Nor do I have as much trouble remembering what exactly that  
last change was that I'm supposed to test now, as I have already spent  
time thinking about (and possibly implementing) three more changes.

Sure, ten hours to one hour gives you back much more time. But ten minutes  
to one is still enough for a significant change in development style.

Ten seconds to one does not look like it would be all that important - but  
not having lived with it, I can't say for sure.

MfG Kai

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 15:25 ` David S. Miller
@ 2002-08-13 13:46   ` Kai Henningsen
  0 siblings, 0 replies; 256+ messages in thread
From: Kai Henningsen @ 2002-08-13 13:46 UTC (permalink / raw)
  To: gcc

davem@redhat.com (David S. Miller)  wrote on 12.08.02 in < 20020812.151215.81906053.davem@redhat.com >:

>    From: dewar@gnat.com (Robert Dewar)
>    Date: Mon, 12 Aug 2002 18:21:28 -0400 (EDT)
>
>    Of course the issue is what happens if there is a lapse in
>    discipline. If it is only a matter of efficiency, that's one thing,
>    if it becomes a focus of bugs then that's another.
>
> This is why it is important to use something, such as the existing RTL
> walking GC infrastructure, to verify the reference counts.  And to
> have this verification mechanism enabled during development cycles.

I've recently found quite a number of allocation bugs (in another project)  
by using a malloc implementation that filled any unallocated or freshly- 
allocated memory with a value that makes for a bad pointer. (That's  
hardware specific, of course.) It might pay to have at least one  
regression tester running with such a beast.

MfG Kai

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-13 12:49 Robert Dewar
  2002-08-14 10:17 ` Dale Johannesen
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-13 12:49 UTC (permalink / raw)
  To: gcc, robertlipe

<<Now will you please quit arguing with Apple that GCC is not really too
slow for them today based solely on counterarguments that either some
other compiler for some other language was fast or that processors will
be faster sooner than the compiler can be made faster?
>>

You miss my point. Which is that it is only worth doing things that have a
really substantial impact and can be done on a reasonably short time scale.
You are simply not going to get anywhere by, for example, worrying about
avoiding refolding expressions.

I would guess the two big opportunites are the persistent front end and PCH,
but from what I understand Apple has already done these two steps, so the 
question is where to go from there, and that is far from clear.

I have always found GCC awfully slow myself. Remember that I am used to
using compilers that are far far faster than the code warrior compilers :-)

The thing to avoid is putting in a large amount of work that results in little
real speed up, at the expense of reliability and other improvements.

One thing that would be interesting is to know, for one of these giant OS
projects (which I assume are in the million line but not dozens of million
line range) what the division between front end time and back end time is.

In the case of Ada most of the time is spent in the back end for large programs
so there is not much we can do in the front end if optimization is turned on.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 12:32 ` Robert Lipe
@ 2002-08-13 12:45   ` Gabriel Dos Reis
  0 siblings, 0 replies; 256+ messages in thread
From: Gabriel Dos Reis @ 2002-08-13 12:45 UTC (permalink / raw)
  To: Robert Lipe; +Cc: gcc

Robert Lipe <robertlipe@usa.net> writes:

[...]

| Now will you please quit arguing with Apple that GCC is not really too
| slow for them today based solely on counterarguments that either some
| other compiler for some other language was fast or that processors will
| be faster sooner than the compiler can be made faster?
| 
| This is getting pretty silly...

I do remember -- in a recent discussion on this list -- someone arguing
something similar...

-- Gaby

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13 12:02 Robert Dewar
@ 2002-08-13 12:32 ` Robert Lipe
  2002-08-13 12:45   ` Gabriel Dos Reis
  2002-08-14  2:55 ` Daniel Egger
  1 sibling, 1 reply; 256+ messages in thread
From: Robert Lipe @ 2002-08-13 12:32 UTC (permalink / raw)
  To: gcc

Robert Dewar wrote:

> compilers are indeed slower. 10-hours should certainly be enough to build
> any project at this stage. 

It's past the threshold of pain, for sure, but plenty of things
take waaay longer than that to build.  (Hint: remember that Stan's
first-order customers are in the OS business.)

Now will you please quit arguing with Apple that GCC is not really too
slow for them today based solely on counterarguments that either some
other compiler for some other language was fast or that processors will
be faster sooner than the compiler can be made faster?

This is getting pretty silly...

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-13 12:02 Robert Dewar
  2002-08-13 12:32 ` Robert Lipe
  2002-08-14  2:55 ` Daniel Egger
  0 siblings, 2 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-13 12:02 UTC (permalink / raw)
  To: austern, dewar; +Cc: Theodore.Papadopoulo, drow, gcc, mrs, phil, shebs

<<For some of the things Apple does, a 10-hour build time would be
a major improvement.  I don't think we're alone in that.  Machines
are faster than they once were, but projects are now much larger.
>>

and compilers are indeed slower. 10-hours should certainly be enough to build
any project at this stage. 

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13  9:10 Robert Dewar
  2002-08-13 10:20 ` Theodore Papadopoulo
  2002-08-13 10:50 ` Matt Austern
@ 2002-08-13 11:53 ` Stan Shebs
  2002-08-13 14:53   ` Joe Buck
  2 siblings, 1 reply; 256+ messages in thread
From: Stan Shebs @ 2002-08-13 11:53 UTC (permalink / raw)
  To: Robert Dewar; +Cc: drow, Theodore.Papadopoulo, gcc, mrs, phil

Robert Dewar wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13  9:10 Robert Dewar
  2002-08-13 10:20 ` Theodore Papadopoulo
@ 2002-08-13 10:50 ` Matt Austern
  2002-08-13 11:53 ` Stan Shebs
  2 siblings, 0 replies; 256+ messages in thread
From: Matt Austern @ 2002-08-13 10:50 UTC (permalink / raw)
  To: Robert Dewar; +Cc: drow, Theodore.Papadopoulo, gcc, mrs, phil, shebs

On Tuesday, August 13, 2002, at 09:10 AM, Robert Dewar wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-13 10:36 Robert Dewar
  2002-08-13 13:46 ` Kai Henningsen
  2002-08-13 16:53 ` Joe Buck
  0 siblings, 2 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-13 10:36 UTC (permalink / raw)
  To: Theodore.Papadopoulo, dewar; +Cc: gcc, mrs, shebs

<<We should see any speed improvement as a possibility to add
more functionnality into the compiler without changing much the
increase of speed the user expects to see. Even though for the time
being (and given the current state of gcc compared to the competition),
it looks like a lot of people just want to see the compiler go faster...
>>

But remember that work you put in on speeding up the compiler is work
that you do not put in on improving the compiler. As time goes on, quality
of generated code continues to be critical, compiler speed is less critical.

<<Now, you may probably be right in this case, you certainly know more
than I do. Are you sure though that the quality of the codes generated by
these compilers were equal ?!? I suppose so, but just asking a
confirmation.
>>

Well Phillipe Kahn in the keynote address at one big PC meeting asked
the audience if they knew which compiler for any language on the PC
generated the best code for the popular sieve benchmark. He surprised
the audience by telling them it was Realia COBOL. Now I don't know if
the guys at Computer Associates have kept up, but certainly that date
point shows that fast compilers can generate efficient code.

<<File size is not the only parameter. Modern languages do more
complicated thing than the average Cobol compiler I suppose....
>>

You suppose dramatically wrong (it is amazing how little people now about
COBOL and how much they are willing to guess). Modern COBOL is an extremely
complex language, certainly more complex than Ada, and probably more complex
than C++.

The point is that GCC has a really terrible time if you throw a single
procedure with tens of thousands of lines of code in it at the compiler.

<<At the same time, people are getting new machines and expect their
programs to compile faster... nad not to mention that the "average
source code" (to be defined by someone ;-) ) is also probably growing
in size and complexity...
>>

Actually compilers have in general got slower with time (see my SIGPLAN
compiler tutorial of many years ago, where I talked about the spectacular
advances in technology of slow compilers :-) Few modern compilers can
match Fastran on the IBM 7094.

<<And, it also depends on what the nine minutes you gained allow you
to do on your computer.... If the nine minutes can be used to do what
the average user considers to be a very important task, then nine
minutes is a lot !!!
>>

Very little in practice. You do not rebuild a million line system every
two minutes after all, and in practice once the build time for a large
system is down in the ten minute range, the gains in making it faster
diminish rapidly. This is not a guess, as I say, this is an observation 
of market forces over a period of year in the competition between
Realia COBOL and Microfocus COBOL, where Realia always had a factor of
ten or more in compile speed to compete with.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13  9:10 Robert Dewar
@ 2002-08-13 10:20 ` Theodore Papadopoulo
  2002-08-13 21:44   ` Fergus Henderson
  2002-08-13 10:50 ` Matt Austern
  2002-08-13 11:53 ` Stan Shebs
  2 siblings, 1 reply; 256+ messages in thread
From: Theodore Papadopoulo @ 2002-08-13 10:20 UTC (permalink / raw)
  To: Robert Dewar; +Cc: gcc, shebs, mrs

dewar@gnat.com said:
> I seriously doubt that incremental compilation can help. Usually it is
> far better to aim at the simplest fastest possible compilation path
> without bothering with the extra bookkeeping needed for IC.

I'm not convinced that the bookkeeping should be that expensive 
(difficult to set write possibly but expensive ??).

> Historically the fastest compilers have not been incremental, and IC
> has only been used to make painfully slow compilers a little less
> painful 

History has its value and must be considered. Unfortunately, often 
only the conclusions are kept and not all the premisses that led to 
it. I have often seen a progress coming from realizing that some 
historical rule that no-one was questionning was no longer true...

Now, you may probably be right in this case, you certainly know more 
than I do. Are you sure though that the quality of the codes generated by 
these compilers were equal ?!? I suppose so, but just asking a 
confirmation.

IC just looks to go one step beyond PCH and it looks like that PCH is 
nowadays an often used technique to speed up compilation. But (see 
below), I agree that this (IC or PCH or whatever else) should not be
done at any cost...

dewar@gnat.com said:
> Actually COBOL programs are FAR FAR larger than C or C++ programs in
> practice. In particular, single files of hundreds of thousands of
> lines are common, and complete systems of millions of lines are
> common. That's why there is so much legacy COBOL around :-)

> My point is that a factor of ten is relative. 

File size is not the only parameter. Modern languages do more 
complicated thing than the average Cobol compiler I suppose.... 

> My point is that a factor of ten is relative. 
> If you have a million lines COBOL program and it takes 10 hours to
> compile, then cutting it down to 1 hour is a real win. If it takes 10
> minutes to compile, then cutting it down to 1 minute is a much smaller
> win in practice. 

Relativity is a strange beast...
Of course, your argument is sensible, but the problem is that, we 
humans, often do not work like this;

- Something that is faster than our reaction time, is considered as 
  zero cost.

- Something that is slow enough to bore us, is considered 
  unacceptably slow. The limit is somewhat fuzzy, but for some tasks,
  people will always tend to push towards this limit.

And, it also depends on what the nine minutes you gained allow you 
to do on your computer.... If the nine minutes can be used to do what 
the average user considers to be a very important task, then nine 
minutes is a lot !!!

> My point is that if you embark on a big project that will take you two
> years to complete successfully, that speeds up the compiler by a
> factor of two, then it probably will seem not that worth while when it
> is finished. 

Well, I tend to slightly disagree. If you made your algorithm/compiler
faster, that is always a net gain for the future. Computer are 
faster, but eventually also require more complex techniques for code 
generation, so that what is gained in terms of raw speed from the 
processor might be lost in terms of the more expensive algorithms 
that are needed for extracting all the possible power out of the 
newest beasts. In some way, this is what happened to gcc, a lot of 
good things have been added (adding more reliability or better 
optimisation, ...) but somehow that seems to have counter-balanced
the increase in computing power (at least since 2.95 and possibly even
for previous releases).

At the same time, people are getting new machines and expect their 
programs to compile faster... nad not to mention that the "average 
source code" (to be defined by someone ;-) ) is also probably growing
in size and complexity...

Now, I agree that, whatever is done, it has to be done in the proper 
way, so that it is maintainable and reliable (that's the first 
concern) AND that the speed improvement is there for good or at least
an amount of time much larger than the amount of development/
debugging/maintainance.

We should see any speed improvement as a possibility to add 
more functionnality into the compiler without changing much the 
increase of speed the user expects to see. Even though for the time 
being (and given the current state of gcc compared to the competition),
it looks like a lot of people just want to see the compiler go faster...

--------------------------------------------------------------------
Theodore Papadopoulo
Email: Theodore.Papadopoulo@sophia.inria.fr Tel: (33) 04 92 38 76 01
 --------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-13 10:08 Robert Dewar
  0 siblings, 0 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-13 10:08 UTC (permalink / raw)
  To: Theodore.Papadopoulo, mrs; +Cc: gcc, phil, shebs

incidentally, I find the idea of a persistent front end for the compiler that
keeps compiled stuff around a very good one. This is something we have 
considered for GNAT for years :-)

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-13  9:10 Robert Dewar
  2002-08-13 10:20 ` Theodore Papadopoulo
                   ` (2 more replies)
  0 siblings, 3 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-13  9:10 UTC (permalink / raw)
  To: dewar, drow; +Cc: Theodore.Papadopoulo, gcc, mrs, phil, shebs

<<Yes it is - projects have grown correspondingly.  Maybe not for COBOL,
but for the sorts of things GCC is used for.  A factor of ten is
still very significant, which is the whole point of Apple's efforts!
>>

Actually COBOL programs are FAR FAR larger than C or C++ programs in practice.
In particular, single files of hundreds of thousands of lines are common, and
complete systems of millions of lines are common. That's why there is so much
legacy COBOL around :-)

My point is that a factor of ten is relative.

If you have a million lines COBOL program and it takes 10 hours to compile,
then cutting it down to 1 hour is a real win. If it takes 10 minutes to
compile, then cutting it down to 1 minute is a much smaller win in practice.

Remember, I am a great fan of fast compilers. Realia COBOL is certainly the
fastest compiler for arbitrarily large programs ever written for the PC, and
when I used to bootstrap the compiler (it was about 100,000 lines of COBOL)
on a 386 in a couple of minutes, that was definitely pleasant. I certainly
agree that GCC is slow :-)

My point is that if you embark on a big project that will take you two
years to complete successfully, that speeds up the compiler by a factor
of two, then it probably will seem not that worth while when it is finished.

You have to look for easy opportunities for big gains. Nothing else is worth
while. In general you cannot design a slow compiler and then molest it into
being a fast compiler, you have to design in speed as a major criterion from
the start. Small incremental changes just don't get you where you want to be.

Obviously in our situation PCH are a good target of opportunity (though I
will say again, that if you designed a really fast C++ compiler, that
compiled code at millions lines/minute, then PCH would not be such an
obvious win, but that's not what we are dealing with here).

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-13  8:07 Robert Dewar
@ 2002-08-13  8:40 ` Daniel Jacobowitz
  0 siblings, 0 replies; 256+ messages in thread
From: Daniel Jacobowitz @ 2002-08-13  8:40 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Theodore.Papadopoulo, mrs, gcc, phil, shebs

On Tue, Aug 13, 2002 at 11:07:07AM -0400, Robert Dewar wrote:
> >>Why not make incremental compilation a standard for gcc...
> 
> I seriously doubt that incremental compilation can help. Usually it is far
> better to aim at the simplest fastest possible compilation path without
> bothering with the extra bookkeeping needed for IC.
> 
> Historically the fastest compilers have not been incremental, and IC has
> only been used to make painfully slow compilers a little less painful
> 
> (I realize that some would put GCC into the second category here, but I 
> would prefer that we keep efforts focussed on moving it into the first
> category).
> 
> That being said, I still wonder over time whether the effort to speed up
> gcc is effort well spent. Or rather, put that another way, let's try to make
> sure that it is effort well spent. If there are obvious opportunities, then
> certainly it makes sense to take advantage of them.
> 
> But there are definite effort tradeoffs, and continued increase in speed of
> machines does tend to mute the requirements for faster compilation.
> 
> When Realia COBOL ran 10,000 lpm on a PC-1, with the major competitor running
> at 1,000 lpm, then the speed difference was a major marketing advantage, but
> now days with essentially the same compiler running over a million lines a
> minute, and essentially the same competitive compiler running at 100,000 lpm
> the difference is no longer nearly so significant :-)

Yes it is - projects have grown correspondingly.  Maybe not for COBOL,
but for the sorts of things GCC is used for.  A factor of ten is
still very significant, which is the whole point of Apple's efforts!

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-13  8:07 Robert Dewar
  2002-08-13  8:40 ` Daniel Jacobowitz
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-13  8:07 UTC (permalink / raw)
  To: Theodore.Papadopoulo, mrs; +Cc: gcc, phil, shebs

>>Why not make incremental compilation a standard for gcc...

I seriously doubt that incremental compilation can help. Usually it is far
better to aim at the simplest fastest possible compilation path without
bothering with the extra bookkeeping needed for IC.

Historically the fastest compilers have not been incremental, and IC has
only been used to make painfully slow compilers a little less painful

(I realize that some would put GCC into the second category here, but I 
would prefer that we keep efforts focussed on moving it into the first
category).

That being said, I still wonder over time whether the effort to speed up
gcc is effort well spent. Or rather, put that another way, let's try to make
sure that it is effort well spent. If there are obvious opportunities, then
certainly it makes sense to take advantage of them.

But there are definite effort tradeoffs, and continued increase in speed of
machines does tend to mute the requirements for faster compilation.

When Realia COBOL ran 10,000 lpm on a PC-1, with the major competitor running
at 1,000 lpm, then the speed difference was a major marketing advantage, but
now days with essentially the same compiler running over a million lines a
minute, and essentially the same competitive compiler running at 100,000 lpm
the difference is no longer nearly so significant :-)

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-12 23:39 Tim Josling
  0 siblings, 0 replies; 256+ messages in thread
From: Tim Josling @ 2002-08-12 23:39 UTC (permalink / raw)
  To: gcc

> On Sat, 10 Aug 2002, Noel Yap spake:
>>  parser                :   6.12 (65%) usr   0.75
>> (53%) sys  10.85 (63%) wall
>> ...
>>  parser                :   6.46 (65%) usr   0.63
>> (53%) sys   9.98 (62%) wall
>> ...
> Thanks,
> Noel

I have trouble believing that bison is taking that amount of time. There are a
lot of calls from the parser that are counted as PARSE. And flag_syntax_only
doesn't turn off as much as you might think. 

In my COBOL front end, all I do in the parse file is build a 'tree'. Although
many people told me bison would be too slow, let alone using flex, profiling
shows them to be a non-issue. The problem is the code generation.

According to a gprof on the largest gcc module (insn-recog.c) the parser is
only 0.43% of the total run time. On the other hand the GC figures very
prominently in the top 100 functions. This is of course without taking into
account the additional effect on cache hit rates of the larger working set
that results from using GC. On my system, this program takes about 90 seconds
to compile, but preprocessing takes less than one second. The RTL time is very
large.

The largest hand coded code gcc module (combine.c) shows broadly similar
results. The parser remains negligible. The GC is somewhat lower presumably
due to the smaller size of the program. GC remains significant, even apart
from working set/cache effects.

Compiling combine.c takes 7 seconds with -O0, 15 seconds with -O1 and 25
seconds with -O2. Nearly everyone uses -O2 so it is clear where the time is
being spent in most cases - doing optimisation. Even in -O0 a fair bit of
time, maybe 2-3 seconds, is spent optimising. 

Conclusion:

1. The fault dear Bison, is in ourselves not in you.

2. Same for the preprocessor, except maybe for C++ where many headers are
included. This is one of many design problems with the C++ language IMHO but
maybe something can be done to help.

3. GC chews up a substantial amount of time, especially in non-optimised
compiles. GC needs to be improved, but any further changes to GC should be
evidence based and subject to peer review. This would have two beneficial
effects: firstly reduced thrashing of front end developers keeping up with
significant changes of unknown benefit; and secondly we could be confident
that changes represent significant progress.

4. We do need some good numbers on how much GCC is affected by cache misses.
This would give us an idea how much effort should be devoted to improving
working set size and locality. There are lots of ways to improve locality and
reduce working sets. But let's find out if it is needed before we start
coding.

5. Most of the time in GCC compiles is spent in optimisation. So, the focus
should be there. The RTL phase of GCC is poorly understood, by anyone. Code
that is not well understood and that people are afraid to touch is invariably
inefficient. 

Two gprof outputs follow.

Tim Josling

insn-recog.c:
 %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
  4.84      2.14     2.14 23088715     0.00     0.00  ggc_set_mark
  3.87      3.85     1.71  2153853     0.00     0.00  ggc_mark_rtx_children_1
  3.40      5.35     1.50  1070849     0.00     0.01  cse_insn
  3.28      6.80     1.45     1169     1.24     1.47  verify_flow_info
  2.83      8.05     1.25  5252491     0.00     0.00  for_each_rtx
  2.67      9.23     1.18                             htab_traverse
  1.90     10.07     0.84      456     1.84     2.99  init_alias_analysis
  1.81     10.87     0.80 10643243     0.00     0.00  find_reg_note
  1.72     11.63     0.76  6645463     0.00     0.00  side_effects_p
  1.68     12.37     0.74  2561968     0.00     0.00  fold_rtx
  1.52     13.04     0.67  6523785     0.00     0.00  ggc_alloc
  1.47     13.69     0.65   799692     0.00     0.00  gt_ggc_mx_lang_tree_node
  1.45     14.33     0.64  2176585     0.00     0.00  canon_reg
  1.38     14.94     0.61  2676602     0.00     0.00  rtx_cost
  1.15     15.45     0.51  1967977     0.00     0.00  ggc_mark_rtx_children
  1.06     15.92     0.47  1747317     0.00     0.00  insert
  1.00     16.36     0.44  4171744     0.00     0.00  canon_hash
  1.00     16.80     0.44  1526861     0.00     0.00  exp_equiv_p
  0.95     17.22     0.42  1356669     0.00     0.00  propagate_one_insn
  0.91     17.62     0.40  7510741     0.00     0.00  canon_rtx
  0.88     18.01     0.39   554639     0.00     0.00  count_reg_usage
  0.86     18.39     0.38  3718565     0.00     0.00  note_stores
  0.82     18.75     0.36    88538     0.00     0.01  find_reloads
  0.77     19.09     0.34       43     7.91     7.91  poison_pages
  0.72     19.41     0.32    49907     0.01     0.01  preprocess_constraints
  0.70     19.72     0.31  1011445     0.00     0.00  invalidate
  0.70     20.03     0.31   558511     0.00     0.00  reg_scan_mark_refs
  0.66     20.32     0.29   650208     0.00     0.00  constrain_operands
  0.63     20.60     0.28   774372     0.00     0.00  mark_used_regs
  0.63     20.88     0.28    24927     0.01     0.01 
count_or_remove_death_notes
  0.61     21.15     0.27  2880996     0.00     0.00  mark_set_1
  0.59     21.41     0.26  7613625     0.00     0.00  approx_reg_cost_1
  0.57     21.66     0.25  1516590     0.00     0.00 
simplify_binary_operation
  0.57     21.91     0.25  1014709     0.00     0.00  mention_regs
  0.57     22.16     0.25   177560     0.00     0.00  validate_value_data
  0.57     22.41     0.25     7063     0.04     0.10  compute_transp
  0.54     22.65     0.24   539000     0.00     0.00  copy_rtx
  0.52     22.88     0.23  1172954     0.00     0.00  insn_extract
  0.52     23.11     0.23   109544     0.00     0.00  record_reg_classes
  0.50     23.33     0.22  1495174     0.00     0.00  reg_mentioned_p
  0.48     23.54     0.21    51125     0.00     0.23  cse_basic_block
  0.48     23.75     0.21    51125     0.00     0.00  cse_end_of_basic_block
  0.45     23.95     0.20  1886796     0.00     0.00  legitimate_address_p
  0.45     24.15     0.20   501459     0.00     0.00  find_best_addr
  0.45     24.35     0.20   354991     0.00     0.00  mark_jump_label
  0.43     24.54     0.19  6597365     0.00     0.00  get_cse_reg_info
  0.43     24.73     0.19   598028     0.00     0.00  copy_rtx_if_shared
  0.43     24.92     0.19        1   190.00 40549.99  yyparse
  0.41     25.10     0.18   279766     0.00     0.00  simplify_plus_minus
...

combine.c:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  2.63      0.29     0.29   146391     0.00     0.01  cse_insn
  2.45      0.56     0.27  2791299     0.00     0.00  find_reg_note
  2.45      0.83     0.27   872878     0.00     0.00  for_each_rtx
  2.18      1.07     0.24  1844867     0.00     0.00  side_effects_p
  2.00      1.29     0.22  2403378     0.00     0.00  ggc_set_mark
  2.00      1.51     0.22     1779     0.12     0.18  verify_flow_info
  1.72      1.70     0.19  2090080     0.00     0.00  ggc_alloc
  1.63      1.88     0.18                             htab_traverse
  1.45      2.04     0.16   196228     0.00     0.00  gt_ggc_mx_lang_tree_node
  1.36      2.19     0.15  2787969     0.00     0.00  bitmap_bit_p
  1.36      2.34     0.15    21830     0.01     0.01  preprocess_constraints
  1.27      2.48     0.14  2235546     0.00     0.00  canon_rtx
  1.27      2.62     0.14    42175     0.00     0.01  find_reloads
  1.18      2.75     0.13  1707121     0.00     0.00  mark_set_1
  1.18      2.88     0.13   328751     0.00     0.00  fold_rtx
  1.09      3.00     0.12   624046     0.00     0.00  propagate_one_insn
  1.09      3.12     0.12   288895     0.00     0.00  count_reg_usage
  1.00      3.23     0.11   276995     0.00     0.00  constrain_operands
  1.00      3.34     0.11   128667     0.00     0.00  ggc_mark_rtx_children_1
  1.00      3.45     0.11    77278     0.00     0.00  validate_value_data
  1.00      3.56     0.11      786     0.14     0.43  init_alias_analysis
  0.82      3.65     0.09  1502223     0.00     0.00  note_stores
  0.82      3.74     0.09  1093219     0.00     0.00  get_cse_reg_info
  0.82      3.83     0.09   291031     0.00     0.00  m16m
  0.82      3.92     0.09   157999     0.00     0.00  mark_jump_label
  0.82      4.01     0.09    43257     0.00     0.00 
reload_cse_simplify_operands
  0.82      4.10     0.09    42404     0.00     0.00  record_reg_classes
  0.73      4.18     0.08   513121     0.00     0.00  find_base_term
  0.73      4.26     0.08   497589     0.00     0.00  insn_extract
  0.73      4.34     0.08   256522     0.00     0.00  reg_scan_mark_refs
  0.64      4.41     0.07  1126581     0.00     0.00  returnjump_p_1
  0.64      4.48     0.07   450028     0.00     0.00  mark_used_reg
  0.64      4.55     0.07   417871     0.00     0.00  mark_used_regs
  0.64      4.62     0.07   386598     0.00     0.00  loc_mentioned_in_p
  0.64      4.69     0.07   299594     0.00     0.00  bitmap_operation
  0.64      4.76     0.07   298619     0.00     0.00  canon_reg
  0.64      4.83     0.07   227797     0.00     0.00  copy_rtx_if_shared
  0.64      4.90     0.07   150028     0.00     0.00  ggc_mark_rtx_children
  0.64      4.97     0.07                             htab_find_slot_with_hash
  0.54      5.03     0.06  1747388     0.00     0.00  bitmap_set_bit
  0.54      5.09     0.06   794615     0.00     0.00  record_set
  0.54      5.15     0.06   538361     0.00     0.00  canon_hash
  0.54      5.21     0.06   497904     0.00     0.00  extract_insn
  0.54      5.27     0.06   322376     0.00     0.00  rtx_cost
  0.54      5.33     0.06   139171     0.00     0.00  try_forward_edges
  0.54      5.39     0.06   111797     0.00     0.00  cselib_subst_to_values
  0.54      5.45     0.06    61617     0.00     0.00  fold
  0.54      5.51     0.06    12204     0.00     0.01  cse_end_of_basic_block
  0.54      5.57     0.06     9853     0.01     0.01 
count_or_remove_death_notes
  0.54      5.63     0.06        1    60.00 10319.57  yyparse
  0.45      5.68     0.05  1149752     0.00     0.00  rtx_equal_p
  0.45      5.73     0.05   544172     0.00     0.00  ix86_decompose_address
  0.45      5.78     0.05   405992     0.00     0.00  insns_for_mem_walk
  0.45      5.83     0.05   308718     0.00     0.00  volatile_refs_p
...

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 11:17 ` Linus Torvalds
@ 2002-08-12 23:33   ` Kai Henningsen
  0 siblings, 0 replies; 256+ messages in thread
From: Kai Henningsen @ 2002-08-12 23:33 UTC (permalink / raw)
  To: gcc

torvalds@transmeta.com (Linus Torvalds)  wrote on 10.08.02 in < Pine.LNX.4.44.0208101102380.2197-100000@home.transmeta.com >:

> On Sat, 10 Aug 2002, Richard Kenner wrote:

> > It also assumes that certain other RTL is *not* shared, so that it can
> > be changed without affecting any others insns.

> So my claim is that if you _were_ to have real rtx memory management, you
> wouldn't need any of the ad-hoc rules. You could just mark the RTX as
> being shared (the same way you can mark a file mapping as being shared),
> and then that tells the copy-on-write routines that no copy is needed,
> exactly because everybody wants one single shared object. But even when it
> is shared, you still need to have a reference count - to know when there
> are no people interested in it any more.

Or, the other way around, when you *need* to be the sole owner, you can  
make sure of that by looking at that very same reference count, and doing  
a copy exactly when it is not 1. (Assuming your reference count isn't  
inflated by internal pointers. Design again.)

MfG Kai

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 15:21 Robert Dewar
@ 2002-08-12 15:25 ` David S. Miller
  2002-08-13 13:46   ` Kai Henningsen
  0 siblings, 1 reply; 256+ messages in thread
From: David S. Miller @ 2002-08-12 15:25 UTC (permalink / raw)
  To: dewar; +Cc: terra, gcc

   From: dewar@gnat.com (Robert Dewar)
   Date: Mon, 12 Aug 2002 18:21:28 -0400 (EDT)

   Of course the issue is what happens if there is a lapse in
   discipline. If it is only a matter of efficiency, that's one thing,
   if it becomes a focus of bugs then that's another.

This is why it is important to use something, such as the existing RTL
walking GC infrastructure, to verify the reference counts.  And to
have this verification mechanism enabled during development cycles.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-12 15:21 Robert Dewar
  2002-08-12 15:25 ` David S. Miller
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-12 15:21 UTC (permalink / raw)
  To: davem, terra; +Cc: gcc

> Frankly, nobody who wants to improve GCCs runtime performance can
> reasonably complain about this "dicipline" in the same breath :-)
> Others can feel free to disagree.

Of course the issue is what happens if there is a lapse in discipline. If it
is only a matter of efficiency, that's one thing, if it becomes a focus of bugs
then that's another. For me reliability of the code generator is far far more
important than speed.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-12 14:10 Morten Welinder
@ 2002-08-12 15:01 ` David S. Miller
  0 siblings, 0 replies; 256+ messages in thread
From: David S. Miller @ 2002-08-12 15:01 UTC (permalink / raw)
  To: terra; +Cc: gcc

   From: Morten Welinder <terra@diku.dk>
   Date: 12 Aug 2002 21:09:41 -0000

   The hardest part probably is that ref-counting requires more
   discipline than a lot of people can muster.

Frankly, nobody who wants to improve GCCs runtime performance can
reasonably complain about this "dicipline" in the same breath :-)
Others can feel free to disagree.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-12 14:10 Morten Welinder
  2002-08-12 15:01 ` David S. Miller
  0 siblings, 1 reply; 256+ messages in thread
From: Morten Welinder @ 2002-08-12 14:10 UTC (permalink / raw)
  To: davem; +Cc: gcc

Hi there,

> 5) If you are still bored at this point, add the machinery to use the
>    RTX walking of the current garbage collector to verify the
>    reference counts.  This will basically be required in order to
>    make and sufficiently correctness check a final implementation.

There are other ways.

* Excess unrefs and missing refs will show whereever the ref count goes
  below zero.
* Excess refs and missing unrefs will show as leaks.
* An evil combination might not show.  (Tough.)

Take a look at Gnumeric's chunk allocator (in src/gutils.c) which
has an almost-for-free leak walker, see gnm_mem_chunk_foreach_leak,
which is always turned on for gnumeric.  (It's linear-time in the
number of leaks.)  If we leak an expression tree, we will be told.
And we will be told what that expression was.  Same thing for all
the other structured objects we have in Gnumeric.

ftp://ftp.gnome.org/pub/GNOME/pre-gnome2/sources/gnumeric/gnumeric-1.1.6.tar.gz
(or 1.1.7 if you wait half an hour)

The hardest part probably is that ref-counting requires more discipline
than a lot of people can muster.

Morten

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 13:47 Robert Dewar
  0 siblings, 0 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-10 13:47 UTC (permalink / raw)
  To: dewar, gmariani; +Cc: dje, gcc

>>People might be anti c++, but this is where I think it shines.

Or any other language with a smidgeon of abstraction :-)

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 10:45 Robert Dewar
@ 2002-08-10 13:26 ` Gianni Mariani
  0 siblings, 0 replies; 256+ messages in thread
From: Gianni Mariani @ 2002-08-10 13:26 UTC (permalink / raw)
  To: Robert Dewar; +Cc: dje, gcc

Robert Dewar wrote:

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 11:51 Robert Dewar
  0 siblings, 0 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-10 11:51 UTC (permalink / raw)
  To: kenner, torvalds; +Cc: gcc

<<Now, I'm probably very biased, because in a kernel you really have to be
very very careful indeed about never leaking memory, and about being able
to reclaim stuff when new situations arise. So to me, memory management is
the basis of anything working _at_all_.
>>

Many compilers don't bother with memory management, they simply don't use
that much memory and there is nothing worth reclaiming (the front end
of GNAT is certainly in this category for instance).

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 10:47 Richard Kenner
@ 2002-08-10 11:17 ` Linus Torvalds
  2002-08-12 23:33   ` Kai Henningsen
  0 siblings, 1 reply; 256+ messages in thread
From: Linus Torvalds @ 2002-08-10 11:17 UTC (permalink / raw)
  To: Richard Kenner; +Cc: gcc

On Sat, 10 Aug 2002, Richard Kenner wrote:
> 
> No, it doesn't.  As I said, it has to do with *correctness* issues.  For
> example, GCC assumes that there is exactly one copy of the RTL for each
> pseudo-register so that when the pseudo is forced to memory, only that
> RTL needs to be changed.
> 
> It also assumes that certain other RTL is *not* shared, so that it can
> be changed without affecting any others insns.
> 
> Nothing whatsoever to do with memory management.

But the above is _exactly_ what memory management is all about. 

Memory management has almost _nothing_ to do with "malloc()" and "free()".  

Those are the trivial parts. All the interesting stuff is knowing _when_
to call them, and that very much means (a) maintaining a count of users
(so you know when you can call free()) and (b) maintaining a "sharedness"  
of users (so you know when you need to copy and when you can just re-use). 

THAT is what memory management is all about. 

Let's take an example. In obstacks, the real memory management is not the
malloc that the internal obstack routines do when they need more memory.
No. The real MM is the decision to have a stack-based allocator, and the 
decision to say that all allocations get free'd when a previous one was 
freed. That's the _management_ part. 

(Admittedly it's _bad_ management, but hey, according to Dilbert that's a
most inherent part of management ;).

And when it comes to rtx's, gcc has no real memory management, and to me
that looks like a design mistake. And exactly _because_ gcc doesn't really
"manage" the rtx's, you end up having these ad-hoc "correctness" issues.

Now, I'm probably very biased, because in a kernel you really have to be 
very very careful indeed about never leaking memory, and about being able 
to reclaim stuff when new situations arise. So to me, memory management is 
the basis of anything working _at_all_. 

So my claim is that if you _were_ to have real rtx memory management, you
wouldn't need any of the ad-hoc rules. You could just mark the RTX as 
being shared (the same way you can mark a file mapping as being shared), 
and then that tells the copy-on-write routines that no copy is needed, 
exactly because everybody wants one single shared object. But even when it 
is shared, you still need to have a reference count - to know when there 
are no people interested in it any more.

(And yes, some objects stay around forever, like the atoms in lisp. They
should still have reference counts, it's just that they get created with
an implicit reference so the count never goes down to zero).

			Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10 10:43 Robert Dewar
@ 2002-08-10 11:02 ` Linus Torvalds
  0 siblings, 0 replies; 256+ messages in thread
From: Linus Torvalds @ 2002-08-10 11:02 UTC (permalink / raw)
  To: Robert Dewar; +Cc: dberlin, gcc, kevin

On Sat, 10 Aug 2002, Robert Dewar wrote:
>
> <<Well, at some point space "optimizations" do actually become functional
> requirements. When you need to have a gigabyte of real memory in order to
> compile some things in a reasonable timeframe, it has definitely become
> functional ;)
> >>
> 
> Interesting example, because this is just on the edge. We are just on the point
> where cheap machines have less than a gigabyte, but not by much (my notebook
> has a gigabyte of real memory). In two years time, a gigabyte of real memory
> will sound small.

Careful. 

That's an extremely slipperly slope, as I'm sure you are well aware.

Yes, all the machines I work on daily have a gigabyte of RAM these days,
and usually at least two CPU's. So it should be ok to have a compiler use
it up, assuming that the end result of the compilation is a really well-
optimized program, right?

Well, even if you could assume that machines have gigabytes of RAM (and
I'll give you that you probably _can_ assume it in another few years, and
not just on the kinds of machines I play with), it takes quite a while to
access that gigabyte. 

Yeah, the machine I'm working on gets memory throughputs of 1.5GB/s right
now. That's assuming good access patterns, though - it's a lot less if you
chase pointers and only use a few bytes per 128-byte cacheline loaded. 

The difference between cache access times and memory access times are
already on the order of a factor of 200, and likely to go up. But since
nobody can expect hot gcc data to fit in the L1, it's probably fairer to
compare L2 times to main memory, which is "only" a factor of 20 or so. 

And quite frankly, I _would_ expect gcc data to fit in a reasonable L2 in 
the same timeframe that you can sanely assume that machines have at least 
a gigabyte of memory.

So if we're talking about performance, I still say that gcc should aim at
fitting in the L2 (and maybe the TLB) of any reasonable CPU. Right now
that means that you want to try to fit the real working set in half a meg
or so, to reliably get the 20-times increase in performance.

			Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 10:56 Robert Dewar
  0 siblings, 0 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-10 10:56 UTC (permalink / raw)
  To: dewar, kenner; +Cc: gcc

Also we have had obstack problems which were NOT front end problems, if you
look through the fixed bugs for obstack, you will find quite a few of these.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 10:55 Robert Dewar
  0 siblings, 0 replies; 256+ messages in thread
From: Robert Dewar @ 2002-08-10 10:55 UTC (permalink / raw)
  To: dewar, kenner; +Cc: gcc

<<Yes, but whenever that happened, it represented a scoping problem in the
front end.  If the entities in question only involved constants, switching to
GC indeed "removed" the bug.  But in most of these cases, the problem could
also occur where non-constants are involved.  In that case, what we've done
is to replace a memory corruption problem in the compiler which causes a
crash with a bug that generates subtly wrong code.  Not a good trade, in my
opinion.  In most cases, though, what this does is that it makes the scoping
bug become latent.
>>

Of course in retrospect the fierce rules on scoping were a HUGE mistake, and
it is too bad that they cannot be fixed in gigi. Almost all of the time, the
requirement for "correct" scoping is entirely artificial, since there is
no code for elaboraiton of the declaration (this is true for instance of
almost all itypes).

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 10:52 Richard Kenner
  0 siblings, 0 replies; 256+ messages in thread
From: Richard Kenner @ 2002-08-10 10:52 UTC (permalink / raw)
  To: dewar; +Cc: gcc

    It also removes a pernicious variety of bug that often caused nasty memory
    corruption in earlier versions of GCC. 

Yes, but whenever that happened, it represented a scoping problem in the
front end.  If the entities in question only involved constants, switching to
GC indeed "removed" the bug.  But in most of these cases, the problem could
also occur where non-constants are involved.  In that case, what we've done
is to replace a memory corruption problem in the compiler which causes a
crash with a bug that generates subtly wrong code.  Not a good trade, in my
opinion.  In most cases, though, what this does is that it makes the scoping
bug become latent.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 10:47 Richard Kenner
  2002-08-10 11:17 ` Linus Torvalds
  0 siblings, 1 reply; 256+ messages in thread
From: Richard Kenner @ 2002-08-10 10:47 UTC (permalink / raw)
  To: torvalds; +Cc: gcc

    Or am I wrong?

Yes.

    Basically, what I'm saying is that it _does_ have everything to do with
    allocation efficiency. The gcc allocators have just always been bad.

No, it doesn't.  As I said, it has to do with *correctness* issues.  For
example, GCC assumes that there is exactly one copy of the RTL for each
pseudo-register so that when the pseudo is forced to memory, only that
RTL needs to be changed.

It also assumes that certain other RTL is *not* shared, so that it can
be changed without affecting any others insns.

Nothing whatsoever to do with memory management.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 10:45 Robert Dewar
  2002-08-10 13:26 ` Gianni Mariani
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-10 10:45 UTC (permalink / raw)
  To: dje, torvalds; +Cc: gcc

<        GCC did not switch from obstacks to garbage collection because of
any inherent love for garbage collection.  Using garbage collection
instead of obstacks was the most efficient way to support other features
which were added to GCC 3.0.
>

It also removes a pernicious variety of bug that often caused nasty memory
corruption in earlier versions of GCC. Our experience with the back end of
GCC (from the point of view of GNAT) is that code generation errors have
been a much more serious problem than time and space requirements.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10 10:43 Robert Dewar
  2002-08-10 11:02 ` Linus Torvalds
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-10 10:43 UTC (permalink / raw)
  To: dewar, torvalds; +Cc: dberlin, gcc, kevin

<<Well, at some point space "optimizations" do actually become functional
requirements. When you need to have a gigabyte of real memory in order to
compile some things in a reasonable timeframe, it has definitely become
functional ;)
>>

Interesting example, because this is just on the edge. We are just on the point
where cheap machines have less than a gigabyte, but not by much (my notebook
has a gigabyte of real memory). In two years time, a gigabyte of real memory
will sound small.

It is always hard to know how to target main memory requirements (Realia
COBOL, one of the fastest compilers ever written for the PC, it compiled
100,000 lines/minute on a 386, was targetted to work in 64K bytes, we did
not make that, it required 130K bytes :-)

But of course it is not clear that caches get larger that quickly, so the
point Linus is making about cache usage is certainly valid, though it would
be nice to have measurements rather than just rhetoric [on both sides of
the issue].

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10  9:52 Richard Kenner
@ 2002-08-10 10:41 ` Linus Torvalds
  0 siblings, 0 replies; 256+ messages in thread
From: Linus Torvalds @ 2002-08-10 10:41 UTC (permalink / raw)
  To: Richard Kenner; +Cc: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1191 bytes --]

On Sat, 10 Aug 2002, Richard Kenner wrote:
> 
> Bad example. That code predates any sort of GC and has to do with 
> *correctness* issues involving the semantics of RTL, not anything having
> to do with allocation efficiency.

Heh.

  garbage collection (gÃ¤r'bij kolek'shon)
	noun.

	The act of not managing your memory explicitly, but trusting 
	some other power to free the memory after you're done with it.

	See also: lazy bum, religion, trust in higher powers, flame wars

gcc has always depended on garbage collection, it's just that it generated 
the garbage, and the OS collected it when it exited.

The obstacks _are_ a real memory management technique, but obstacks are
clearly broken. The ordering constraints are too tight to be useful for
most real life situations, which means that when you rely exclusively on
obstacks you end up usually not freeing that obstack at all, until your
whole phase is done (very few problems are _so_ clearly nested that a 
stack is a sufficient memory management technique).

Or am I wrong?

Basically, what I'm saying is that it _does_ have everything to do with 
allocation efficiency. The gcc allocators have just always been bad.

		Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10  9:45 ` Linus Torvalds
@ 2002-08-10 10:24   ` David Edelsohn
  0 siblings, 0 replies; 256+ messages in thread
From: David Edelsohn @ 2002-08-10 10:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: gcc

	GCC did not switch from obstacks to garbage collection because of
any inherent love for garbage collection.  Using garbage collection
instead of obstacks was the most efficient way to support other features
which were added to GCC 3.0.

David

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10  9:52 Richard Kenner
  2002-08-10 10:41 ` Linus Torvalds
  0 siblings, 1 reply; 256+ messages in thread
From: Richard Kenner @ 2002-08-10  9:52 UTC (permalink / raw)
  To: torvalds; +Cc: gcc

    Just to make a point: look at copy_rtx_if_shared(), which tries to do 
    this (yeah, I have an older tree, maybe this is fixed these days. I 
    seriously doubt it).

    The code is CRAP. Total and utter sh*t. The damn thing should just
    test a reference count and be done with it. Instead, it has this
    heuristic that knows about some rtx's that might be shared, and knows
    which never can be.  And that _cap_ comes directly from the fact that
    the code uses a lazy GC scheme instead of a more intelligent memory
    manager.

Bad example. That code predates any sort of GC and has to do with 
*correctness* issues involving the semantics of RTL, not anything having
to do with allocation efficiency.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10  4:38 Robert Dewar
@ 2002-08-10  9:47 ` Linus Torvalds
  0 siblings, 0 replies; 256+ messages in thread
From: Linus Torvalds @ 2002-08-10  9:47 UTC (permalink / raw)
  To: Robert Dewar; +Cc: dberlin, gcc, kevin

On Sat, 10 Aug 2002, Robert Dewar wrote:
>
> <<Hmm. I can't imagine what is there that is inherently cyclic, but breaking
> the cycles might be more painful than it's worth, so I'll take your word
> for it.
> >>
> 
> Indeed it may be perfectly acceptable to simply ignore the cycles, garbage
> collection is not a functional requirement here, just a space optimization.

Well, at some point space "optimizations" do actually become functional
requirements. When you need to have a gigabyte of real memory in order to
compile some things in a reasonable timeframe, it has definitely become
functional ;)

		Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-10  4:35 Robert Dewar
@ 2002-08-10  9:45 ` Linus Torvalds
  2002-08-10 10:24   ` David Edelsohn
  0 siblings, 1 reply; 256+ messages in thread
From: Linus Torvalds @ 2002-08-10  9:45 UTC (permalink / raw)
  To: Robert Dewar; +Cc: dberlin, gcc, kevin

On Sat, 10 Aug 2002, Robert Dewar wrote:
>
> If garbage collection is taking a significant amount of time (is this really
> the case),

Note that my whole argument was that the problem with GC is not the 
_collection_, but the side effects of GC.

You can always make the collection take basically zero time by just not 
doing it very often. Problem solved. In fact, this is the very problem 
that most of the papers and implementations seem to have focused on, yet I 
don't think that particular problem is very interesting at all.

So the real problem with GC is two-fold: the lack of spatial and in
particular temporal locality, because lazy de-allocation (which you need
to keep the collection overhead acceptably low) "smears" out the locality
over time (and over space).  The bigger the GC cycle, the bigger the
smear.

Many GC schemes try to help the spatial locality by doing compaction, but 
that incurs extra cache overhead, and it still totally misses the temporal 
locality you get from re-using allocations quickly.

Another way of saying the same thing in a gcc-centric manner: this is
equivalent to stack slot re-use. Clearly stack slot re-use is a good 
thing, but it requires exact liveness analysis. Similarly, allocation 
re-use is a good thing, but it requires exact (non-lazy) collection.

And the thing is, the "smearing" of locality you get by lazy collection
kills your caches and your TLB footprint, but it does NOT show up as "gc
overhead".  It shows up as overhead everywhere else, which is probably why
so many GC proponents just ignore it.

The other argument I have against GC is the mindset it fosters. See my
point about how you can _trivially_ do copy-on-write with a manual (or
automatic but _exposed_) refcounting scheme. Doing the same with lazy
collection is an excercise in futility - so you don't do it.

Just to make a point: look at copy_rtx_if_shared(), which tries to do 
this (yeah, I have an older tree, maybe this is fixed these days. I 
seriously doubt it).

The code is CRAP. Total and utter sh*t. The damn thing should just test a 
reference count and be done with it. Instead, it has this heuristic that 
knows about some rtx's that might be shared, and knows which never can be. 
And that _cap_ comes directly from the fact that the code uses a lazy GC 
scheme instead of a more intelligent memory manager.

THAT is my beef with GC. I don't care at _all_ about the collection cost. 
The real problems are elsewhere, and GC proponents never even acknowledge 
them.

(And please - the person who wrote copy_rtx_if_shared() - don't take my
complaint personally. I'm not trying to rag on the poor soul who had to
implement that silly switch statement etc. It's not your fault. It's the
fault of a much more fundamental mistake, and copy_rtx_if_shared is an
innocent bystander, a victim of circumstance and a stupid decision to use 
GC.)

			Linus

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10  4:38 Robert Dewar
  2002-08-10  9:47 ` Linus Torvalds
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-10  4:38 UTC (permalink / raw)
  To: dberlin, torvalds; +Cc: gcc, kevin

<<Hmm. I can't imagine what is there that is inherently cyclic, but breaking
the cycles might be more painful than it's worth, so I'll take your word
for it.
>>

Indeed it may be perfectly acceptable to simply ignore the cycles, garbage
collection is not a functional requirement here, just a space optimization.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-10  4:35 Robert Dewar
  2002-08-10  9:45 ` Linus Torvalds
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-10  4:35 UTC (permalink / raw)
  To: dberlin, torvalds; +Cc: gcc, kevin

If garbage collection is taking a significant amount of time (is this really
the case), then concentrating on speeding it up may make sense, but I am
quite dubious that reference counting would speed things up (it very rarely
does, speaking of long experience in implementation of garbage collected
languages, because the distributed overhead is high -- one of the interesting
things about reference counting is that, since it distributes the overhead,
it then becomes very difficult to accurately measure the overhead.

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
  2002-08-09 19:45 Robert Dewar
@ 2002-08-09 20:24 ` Daniel Berlin
  0 siblings, 0 replies; 256+ messages in thread
From: Daniel Berlin @ 2002-08-09 20:24 UTC (permalink / raw)
  To: Robert Dewar; +Cc: dje, shebs, gcc, mrs

On Fri, 9 Aug 2002, Robert Dewar wrote:

> <<        Saying "do not run any optimization at -O0" shows a tremendous
> lack of understanding or investigation.  One wants minimal optimization
> even at -O0 to decrease the size of the IL representation of the function
> being compiled.  The little bit of computation to perform trivial
> optimization more than makes up for itself with the decreased size of the
> IL that needs to be processed to generate the output.
> >>
> 
> There are two reasons to run at -O0
> 
> a) make the code as easy to debug as possible
> b) speedy compilation
> 
> There is also a third reason that is relevant to safety critical code
> 
> c) avoid optimization, on the grounds that it inteferes with verification
> 
> Now with respect to a), the trouble with GCC is that the code generated
> with no optimization is really horrible. Much worse than typical competing
> compilers operating in no optimization mode. Now of course we can say
> "yes, but gcc is really doing what you want, the other compiler is not"
> but the fact remains that you are stuck between two unpleasant choices
> 
>   -O0 generates far too much code and giant executables
>   -O1 already loses debugging information
> 
> I think there is a real need for a mode which would do all possible
> optimizations that do NOT intefere with debugging. I would probably
> use this as my default development mode all the time.

Um, a *lot* of our O1 optimizations would not interefere with debugging in 
a substantial manner if the stuff necessary to do var-tracking (and thus, 
location lists) gets accepted.  This is the register attribute stuff.

This stuff is done (mostly, there was one or two places in dwarf2out i 
think i might need to copy and paste some code).
Try the cfg-branch, and it'll generate location lists to track variables 
as they move through registers. It's on by default for -O2 or above.

Of course, gdb won't consume location lists (hopefully location 
expressions soon), but that's another matter.

readelf will happily list the location lists.

Heck, in fact, you actually get better debugging because it currently does 
it at an insn level, rather than a source line level, so variables should 
still be described properly even if you step by instructions.

Though once i can move loclists over to the mainline, this will probably 
move to a "-g4" level of debugging, and -g will only output the info 
necessary to track variable location changes over source lines.
 --Dan

^ permalink raw reply	[flat|nested] 256+ messages in thread

* Re: Faster compilation speed
@ 2002-08-09 19:45 Robert Dewar
  2002-08-09 20:24 ` Daniel Berlin
  0 siblings, 1 reply; 256+ messages in thread
From: Robert Dewar @ 2002-08-09 19:45 UTC (permalink / raw)
  To: dje, shebs; +Cc: gcc, mrs

<<        Saying "do not run any optimization at -O0" shows a tremendous
lack of understanding or investigation.  One wants minimal optimization
even at -O0 to decrease the size of the IL representation of the function
being compiled.  The little bit of computation to perform trivial
optimization more than makes up for itself with the decreased size of the
IL that needs to be processed to generate the output.
>>

There are two reasons to run at -O0

a) make the code as easy to debug as possible
b) speedy compilation

There is also a third reason that is relevant to safety critical code

c) avoid optimization, on the grounds that it inteferes with verification

Now with respect to a), the trouble with GCC is that the code generated
with no optimization is really horrible. Much worse than typical competing
compilers operating in no optimization mode. Now of course we can say
"yes, but gcc is really doing what you want, the other compiler is not"
but the fact remains that you are stuck between two unpleasant choices

  -O0 generates far too much code and giant executables
  -O1 already loses debugging information

I think there is a real need for a mode which would do all possible
optimizations that do NOT intefere with debugging. I would probably
use this as my default development mode all the time.

With respect to b) one has to be careful that sometimes some limited
amount of optimzation (e.g. simple register tracking, and slightly
reasonable register allocation) can cut down the size of the code
enough that compilation time suffers very little, or even is improved.

With respect to c), we find in practice that -O1 mode is manageable for
a lot of certification needs, but probably it is a good idea to retain
the absolutely-no-optimization mode.

^ permalink raw reply	[flat|nested] 256+ messages in thread

end of thread, other threads:[~2002-08-23 15:39 UTC | newest]

Thread overview: 256+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-09 12:17 Faster compilation speed Mike Stump
2002-08-09 13:04 ` Noel Yap
2002-08-09 13:10   ` Matt Austern
2002-08-09 14:22   ` Neil Booth
2002-08-09 14:44     ` Noel Yap
2002-08-09 15:14       ` Neil Booth
2002-08-10 15:54         ` Noel Yap
2002-08-09 15:13   ` Stan Shebs
2002-08-09 15:18     ` Neil Booth
2002-08-10 16:12       ` Noel Yap
2002-08-10 18:00         ` Nix
2002-08-10 20:36           ` Noel Yap
2002-08-11  4:30             ` Nix
2002-08-12 15:08           ` Mike Stump
2002-08-09 15:19     ` Ziemowit Laski
2002-08-09 15:25       ` Neil Booth
2002-08-10 16:16       ` Noel Yap
2002-08-10 16:07     ` Noel Yap
2002-08-10 16:18       ` Neil Booth
2002-08-10 20:27         ` Noel Yap
2002-08-11  0:11           ` Neil Booth
2002-08-12 12:04             ` Devang Patel
2002-08-09 18:57   ` Linus Torvalds
2002-08-09 19:12     ` Phil Edwards
2002-08-09 19:34     ` Kevin Atkinson
2002-08-09 20:28       ` Linus Torvalds
2002-08-09 21:12         ` Daniel Berlin
2002-08-09 21:52           ` Linus Torvalds
2002-08-10  6:32         ` Robert Lipe
2002-08-10 14:26           ` Cyrille Chepelov
2002-08-10 17:33             ` Daniel Berlin
2002-08-10 18:21               ` Linus Torvalds
2002-08-10 18:38                 ` Daniel Berlin
2002-08-10 18:39                 ` Cyrille Chepelov
2002-08-10 18:28               ` Cyrille Chepelov
2002-08-10 18:30                 ` John Levon
2002-08-11  1:03             ` Florian Weimer
2002-08-10 19:20     ` Noel Yap
2002-08-09 13:10 ` Aldy Hernandez
2002-08-09 15:28   ` Mike Stump
2002-08-09 16:00     ` Aldy Hernandez
2002-08-09 16:26       ` Stan Shebs
2002-08-09 16:31         ` Aldy Hernandez
2002-08-09 16:51           ` Stan Shebs
2002-08-09 16:54             ` Aldy Hernandez
2002-08-09 17:44             ` Daniel Berlin
2002-08-09 18:35               ` David S. Miller
2002-08-09 18:39                 ` Aldy Hernandez
2002-08-09 18:59                   ` David S. Miller
2002-08-09 20:01                   ` Per Bothner
2002-08-09 18:25             ` David S. Miller
2002-08-13  0:50               ` Loren James Rittle
2002-08-13 21:46                 ` Fergus Henderson
2002-08-13 22:40                   ` David S. Miller
2002-08-13 23:44                     ` Fergus Henderson
2002-08-14  7:58                     ` Jeff Sturm
2002-08-14  9:52                     ` Richard Henderson
2002-08-14 10:00                       ` David Edelsohn
2002-08-14 12:01                         ` Andreas Schwab
2002-08-14 12:07                           ` David Edelsohn
2002-08-14 13:20                             ` Jamie Lokier
2002-08-14 16:01                               ` Nix
2002-08-14 13:20                             ` Michael Matz
2002-08-14 16:31                               ` Faster compilation speed [zone allocation] Per Bothner
2002-08-15 11:34                                 ` Aldy Hernandez
2002-08-15 11:39                                   ` David Edelsohn
2002-08-15 12:01                                     ` Lynn Winebarger
2002-08-15 12:11                                       ` David Edelsohn
2002-08-15 11:41                                   ` Michael Matz
2002-08-16  8:44                                     ` Kai Henningsen
2002-08-15 11:43                                   ` Per Bothner
2002-08-15 11:57                                   ` Kevin Handy
2002-08-14 10:15                       ` Faster compilation speed David Edelsohn
2002-08-14 16:35                         ` Richard Henderson
2002-08-14 17:02                           ` David Edelsohn
2002-08-20  4:15                         ` Richard Earnshaw
2002-08-20  5:38                           ` Jeff Sturm
2002-08-20  5:53                             ` Richard Earnshaw
2002-08-20 13:42                               ` Jeff Sturm
2002-08-22  1:55                                 ` Richard Earnshaw
2002-08-22  2:03                                   ` David S. Miller
2002-08-23 15:39                                   ` Jeff Sturm
2002-08-20  8:00                           ` David Edelsohn
2002-08-14  7:36                   ` Jeff Sturm
2002-08-10 10:02             ` Neil Booth
2002-08-09 17:36         ` Daniel Berlin
2002-08-12 16:23         ` Mike Stump
2002-08-12 16:05       ` Mike Stump
2002-08-09 19:07     ` David Edelsohn
2002-08-09 14:29 ` Neil Booth
2002-08-09 15:02   ` Nathan Sidwell
2002-08-09 17:05     ` Stan Shebs
2002-08-10  2:21     ` Gabriel Dos Reis
2002-08-12 12:11   ` Mike Stump
2002-08-12 12:41     ` David Edelsohn
2002-08-12 12:47       ` Matt Austern
2002-08-12 12:56         ` David S. Miller
2002-08-12 13:56           ` Matt Austern
2002-08-12 14:27             ` Daniel Berlin
2002-08-12 15:26               ` David Edelsohn
2002-08-13 10:49                 ` David Edelsohn
2002-08-13 10:52                   ` David S. Miller
2002-08-13 14:03                   ` David Edelsohn
2002-08-13 14:46                     ` Geoff Keating
2002-08-13 15:10                       ` David Edelsohn
2002-08-13 15:26                         ` Neil Booth
2002-08-14  9:25                     ` Kevin Handy
2002-08-18 12:58                     ` Jeff Sturm
2002-08-19 12:55                       ` Mike Stump
2002-08-20 11:22                       ` Will Cohen
2002-08-13 15:32                   ` Daniel Berlin
2002-08-13 15:58                     ` David Edelsohn
2002-08-13 16:49                       ` David S. Miller
2002-08-12 14:59             ` David S. Miller
2002-08-12 16:00             ` Geoff Keating
2002-08-13  2:58               ` Nick Ing-Simmons
2002-08-13 10:47               ` Richard Henderson
2002-08-12 14:28           ` Stan Shebs
2002-08-12 15:05             ` David S. Miller
2002-08-12 19:17     ` Mike Stump
2002-08-12 23:28       ` Neil Booth
2002-08-09 14:51 ` Stan Shebs
2002-08-09 15:03   ` David Edelsohn
2002-08-09 15:43     ` Stan Shebs
2002-08-09 16:43     ` Alan Lehotsky
2002-08-09 16:49       ` Matt Austern
2002-08-10  2:24         ` Gabriel Dos Reis
2002-08-09 15:26   ` Geoff Keating
2002-08-09 16:06     ` Stan Shebs
2002-08-09 16:14       ` Terry Flannery
2002-08-09 16:29         ` Neil Booth
2002-08-09 16:29       ` Phil Edwards
2002-08-12 16:24         ` Mike Stump
2002-08-12 18:38           ` Phil Edwards
2002-08-13  5:27           ` Theodore Papadopoulo
2002-08-13 10:03             ` Mike Stump
2002-08-12 15:55     ` Mike Stump
2002-08-09 14:59 ` Timothy J. Wood
2002-08-16 13:31   ` Problem with PFE approach [Was: Faster compilation speed] Timothy J. Wood
2002-08-16 13:44     ` Devang Patel
2002-08-16 14:31       ` Timothy J. Wood
2002-08-16 14:39         ` Neil Booth
2002-08-16 14:46         ` Devang Patel
2002-08-16 13:54     ` Devang Patel
2002-08-16 14:42       ` Neil Booth
2002-08-16 14:57         ` Devang Patel
2002-08-17 15:31           ` Timothy J. Wood
2002-08-17 20:04             ` Daniel Berlin
2002-08-17 20:07               ` Andrew Pinski
2002-08-17 20:14               ` Timothy J. Wood
2002-08-17 20:21                 ` Daniel Berlin
2002-08-18  3:17                   ` Kai Henningsen
2002-08-18  7:36                     ` Daniel Berlin
2002-08-18 11:20                       ` jepler
2002-08-18 13:20                         ` Daniel Berlin
2002-08-18 14:31                           ` Timothy J. Wood
2002-08-18 14:35                             ` Andrew Pinski
2002-08-18 14:55                               ` Timothy J. Wood
2002-08-19  2:41                             ` Michael Matz
2002-08-19  6:26                               ` jepler
2002-08-19  6:40                                 ` Daniel Berlin
2002-08-19 11:50                                 ` Devang Patel
2002-08-19 12:55                                   ` Jeff Epler
2002-08-19 13:03                                     ` Ziemowit Laski
2002-08-19 11:53                               ` Devang Patel
2002-08-19 11:59                 ` Devang Patel
2002-08-17 20:15             ` Daniel Berlin
2002-08-19  7:07             ` Stan Shebs
2002-08-19  8:52               ` Timothy J. Wood
2002-08-16 14:45       ` Timothy J. Wood
2002-08-09 16:01 ` Faster compilation speed Richard Henderson
2002-08-10 17:48 ` Aaron Lehmann
2002-08-12 10:36   ` Dale Johannesen
2002-08-09 19:45 Robert Dewar
2002-08-09 20:24 ` Daniel Berlin
2002-08-10  4:35 Robert Dewar
2002-08-10  9:45 ` Linus Torvalds
2002-08-10 10:24   ` David Edelsohn
2002-08-10  4:38 Robert Dewar
2002-08-10  9:47 ` Linus Torvalds
2002-08-10  9:52 Richard Kenner
2002-08-10 10:41 ` Linus Torvalds
2002-08-10 10:43 Robert Dewar
2002-08-10 11:02 ` Linus Torvalds
2002-08-10 10:45 Robert Dewar
2002-08-10 13:26 ` Gianni Mariani
2002-08-10 10:47 Richard Kenner
2002-08-10 11:17 ` Linus Torvalds
2002-08-12 23:33   ` Kai Henningsen
2002-08-10 10:52 Richard Kenner
2002-08-10 10:55 Robert Dewar
2002-08-10 10:56 Robert Dewar
2002-08-10 11:51 Robert Dewar
2002-08-10 13:47 Robert Dewar
2002-08-12 14:10 Morten Welinder
2002-08-12 15:01 ` David S. Miller
2002-08-12 15:21 Robert Dewar
2002-08-12 15:25 ` David S. Miller
2002-08-13 13:46   ` Kai Henningsen
2002-08-12 23:39 Tim Josling
2002-08-13  8:07 Robert Dewar
2002-08-13  8:40 ` Daniel Jacobowitz
2002-08-13  9:10 Robert Dewar
2002-08-13 10:20 ` Theodore Papadopoulo
2002-08-13 21:44   ` Fergus Henderson
2002-08-14  4:00     ` Noel Yap
2002-08-14  4:36       ` Michael Matz
2002-08-14  4:45         ` Noel Yap
2002-08-14 10:06           ` Janis Johnson
2002-08-19  4:58     ` Nick Ing-Simmons
2002-08-13 10:50 ` Matt Austern
2002-08-13 11:53 ` Stan Shebs
2002-08-13 14:53   ` Joe Buck
2002-08-13 10:08 Robert Dewar
2002-08-13 10:36 Robert Dewar
2002-08-13 13:46 ` Kai Henningsen
2002-08-13 16:53 ` Joe Buck
2002-08-13 17:24   ` Paul Koning
2002-08-13 12:02 Robert Dewar
2002-08-13 12:32 ` Robert Lipe
2002-08-13 12:45   ` Gabriel Dos Reis
2002-08-14  2:55 ` Daniel Egger
2002-08-13 12:49 Robert Dewar
2002-08-14 10:17 ` Dale Johannesen
2002-08-14 10:56   ` David S. Miller
2002-08-14 11:04     ` Dale Johannesen
2002-08-14 11:08       ` David S. Miller
2002-08-19  5:15     ` Nick Ing-Simmons
2002-08-19  7:06       ` David S. Miller
2002-08-19 10:29         ` Richard Henderson
2002-08-19 11:33           ` David S. Miller
2002-08-19  9:20       ` Daniel Egger
2002-08-14 11:27   ` Timothy J. Wood
2002-08-14 11:42     ` David S. Miller
2002-08-14 13:13       ` Jamie Lokier
2002-08-19  5:20         ` Nick Ing-Simmons
2002-08-14 13:16     ` Jamie Lokier
2002-08-14 13:29       ` Timothy J. Wood
2002-08-14 13:35         ` Jamie Lokier
2002-08-14 13:43           ` Tim Hollebeek
2002-08-14 13:57             ` Jamie Lokier
2002-08-13 15:00 Tim Josling
2002-08-13 15:48 ` Russ Allbery
2002-08-14 19:11 Tim Josling
     [not found] <1029475232.9572.ezmlm@gcc.gnu.org>
2002-08-16  1:28 ` Mat Hounsell
2002-08-16  5:08 Joe Wilson
2002-08-16  5:51 ` Noel Yap
2002-08-16 11:04 ` Mike Stump
     [not found] <1029519609.8400.ezmlm@gcc.gnu.org>
2002-08-17 22:23 ` Mat Hounsell
2002-08-18  6:27   ` Michael S. Zick
2002-08-20 14:11 Tim Josling
2002-08-20 14:13 ` David S. Miller
2002-08-20 14:43 ` Stan Shebs
2002-08-21  6:59 Richard Kenner
2002-08-21 15:04 ` David S. Miller
2002-08-21 15:35 Tim Josling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).