public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Proposal for a 'gcc -O4': interprocedural optimization
@ 2002-08-23 15:31 Chris Lattner
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Lattner @ 2002-08-23 15:31 UTC (permalink / raw)
  To: gcc

One aspect of GCC that is not very well developed are capabilities for
interprocedural analysis and optimization (especially across translation
unit boundaries).  Here I describe one reasonable (and tested) approach to
adding this to GCC, which would be invoked when compiling at a level of
"-O4" (or perhaps some other interface, the actual command line option
doesn't really matter).  At the end of this email there is some discussion
of how this interacts with the established momentum in the GCC community.

Currently, the GCC pipeline (on the tree-ssa branch) looks like this:

   LanguageSpecificAST     SIMPLE trees               RTL
parser     ->      simplifier -> optimizer -> expander -> RTL Opt -> machine code

Essentially, a front-end generates a language specific version of an AST
for the file being compiled, performs language specific optimizations on
it, and runs it through a language specific simplifier.  The output of the
simplifier is a common language, currently SIMPLE.  Once in the language
independant SIMPLE form, it is optimized, converted to RTL, optimized in
RTL form, then emitted as machine code.  The machine code is assembled and
linked effectively by the system utilities.

I propose that when compiling with "-O4" enabled that the compiler
be reduced into this:

   LanguageSpecificAST     SIMPLE trees
parser     ->      simplifier -> optimizer -> "simple code"

... preserving higher level information (than RTL or machine code that is)
until link time.  The .o files generated by the modified compiler would
not be populated with machine code, but instead would contain a serialized
form of the SIMPLE trees (or some other representation).

The linker then would look like this::

"simple code"  -\
"simple code"  -/  SIMPLE link -> IP optimize -> expander -> RTL opt ->
                       machine code -> assembler--\
object code      ------------------------------------ native link
native library   ---------------------------------/

In other words, we preserve as much code as possible in SIMPLE form until
link time where we can do more aggressive interprocedural optimizations
(for example, real IP alias analysis & inlining).  After these
optimizations are performed, the code is compiled with the traditional GCC
backend just as it is today.

As mentioned in another thread, my group has already done this (with the
exception of reusing the GCC backend for native code generation), so this
is not a pie-in-the-sky idea, and can yield excellent performance for the
compiled application.

There are a few things that are key to making this work:
 1. The representation chosen has to be low level enough that a large part
    of the traditional optimizations can be performed at compile time, to
    avoid really expensive relink times.  SIMPLE fits well here.
 2. Ideally the representation should be cleanly defined so that there is
    a well defined mapping between all of the front-ends involved.  I
    believe this is already a goal with the SIMPLE work.
 3. The representation should have a fast way to serialize and deserialize
    it.  At present, I think that there isn't really a good way for this
    to happen with SIMPLE, although it may be possible to add this in the
    future.  Currently, if I understand correctly, the in-memory
    representation is effectively serialized onto disk, which does not
    seem like a very future-proof or structured format for storing code.

I think that adding this ability to GCC would make a lot of the user base
happy.  Faster executables are always good, and there are often requests
for adding interprocedural optimizations to GCC (mostly from academic type
people that are not very familiar with the GCC architecture).

So I promised to talk about current momentum in the GCC sourcebase:

The tree-ssa/ast-optimizer branch people seem to be doing almost all of
the work required to make this possible already: They are working on
reimplementing optimizations on the SIMPLE representation so that they can
be done more efficiently or better than on RTL.  RTL is wonderful for
machine specific optimization and very late code generation, but is not a
wonderful target for mid-level optimization.

The pfe/precompiled header work is working on one aspect of this work, and
it has been proposed in the past that it be used for interprocedural
optimization: basically they provide a mechanism for serializing the
in-memory representation of the tree data structures at any given time.
This can also be used to serialize the tree representations for entire
functions at a time, to be used for interprocedural optimization at link
time.   The way my proposal differs from this possible strategy is that it
is a bit more structured and is more likely to be implementable in
practice.  This is open for discussion of course.

As I mentioned, or group has a lot of experience with this kind of thing
and a good sized library of efficient optimizations... I would really like
to see GCC get some of the kickbacks from our work, especially if it makes
GCC a more powerful and useful compiler.

If this was implemented, how likely is it to run into political problems
that could prevent it from being accepted into the official GCC CVS tree?
Are there technical details I should expound upon?

Any thoughts/comments?

-Chris

http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposal for a 'gcc -O4': interprocedural optimization
  2002-08-24 15:02         ` Chris Lattner
@ 2002-08-24 18:04           ` Dave Hudson
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Hudson @ 2002-08-24 18:04 UTC (permalink / raw)
  To: Chris Lattner; +Cc: gcc

Hi Chris,

Chris Lattner wrote:

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposal for a 'gcc -O4': interprocedural optimization
  2002-08-24 14:45       ` Dave Hudson
@ 2002-08-24 15:02         ` Chris Lattner
  2002-08-24 18:04           ` Dave Hudson
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Lattner @ 2002-08-24 15:02 UTC (permalink / raw)
  To: Dave Hudson; +Cc: gcc

On Sat, 24 Aug 2002, Dave Hudson wrote:
> > Ok, that makes sense.  In this case you would still benefit a lot from
> > elimination of loads and stores.  Not only do the actual loads and stores
> > consume instruction space, but folding two loads together has a nice
> > cascading effect on other optimizations that can be performed (mostly
> > scalar simplifications arising from better value #ing information).
>
> Hmm - interesting.  Part of what I've set out to understand with my
> recent code is exactly how much we can gain from such situations.
> "Register" moves are so absurdly expensive that every time I can
> eliminate one it makes me very happy (e.g. for the IP2022 each 16-bit
> reg-to-reg, reg-to-mem or mem-to-mem copy costs 4 opcodes).  Just
> recently I've had some pretty surprising success with some constant
> propagation code I wrote and so allowing this to span multiple functions
> could be *very* useful.

Wow.  It certainly sounds like an interesting architecture! :)  I haven't
actually played with interprocedural constant propogation myself, but I
expect that the opportunities are fairly limited without using function
cloning: basically you would only use it when the parameter to a function
is _always_ the _same_ constant.  That said, it sounds like you could get
some impressive savings just form some simple interprocedural register
allocation...

> Hmm - some of our apps have started to use dlls because we've been tight
> on space, but if the interprocedural wins were sufficiently large then
> this would certainly make a very strong case to eliminate their use, at
> least in places.  I guess though that even in these cases because the
> dlls are being used more as a paging mechanism than anything else then
> their use could be worked out statically anyway.

Sure you could do that as well.  The only reason dynamic loading is
problematic is because it allows users of the program to load code that
was not available at static compilation time.  If you _do_ know all of the
code in use, you can obviously lift this restriction.

> >>This is why I think this sort of optimization has great potential in
> >>many of these sorts of embedded apps.
> >
> > Not _just_ embedded apps!  :)
>
> True, but I think embedded apps are very good examples of where the wins
> are huge - typically code and data spaces are limited and also product
> volumes are sufficiently large and costs sufficiently tight that moving
> to the next sized processor up just isn't an option.

That's actually a good point that I hadn't considered.  With desktops and
scientific applications, it's nice for things to be a few percent faster,
but not critical.  With embedded apps, if you bump over the line of what
your architecture can support, you end up having to move to a different
processor/architecture/system, which could add to the final cost of the
product...

-Chris

http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposal for a 'gcc -O4': interprocedural optimization
  2002-08-24 14:11     ` Chris Lattner
@ 2002-08-24 14:45       ` Dave Hudson
  2002-08-24 15:02         ` Chris Lattner
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hudson @ 2002-08-24 14:45 UTC (permalink / raw)
  To: Chris Lattner; +Cc: gcc

Hi Chris,

Chris Lattner wrote:

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposal for a 'gcc -O4': interprocedural optimization
  2002-08-24 14:02   ` Dave Hudson
@ 2002-08-24 14:11     ` Chris Lattner
  2002-08-24 14:45       ` Dave Hudson
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Lattner @ 2002-08-24 14:11 UTC (permalink / raw)
  To: Dave Hudson; +Cc: gcc

On Sat, 24 Aug 2002, Dave Hudson wrote:

> > Certainly that.  With a reasonable alias analysis infrastructure you can
> > expect a lot more loads to be folded and stores to be eliminated: that
> > alone can save a lot more than your 1%.  The nice thing about having the

> For pretty much most of the targets I'm interested in right now memory
> traffic isn't a problem - everything's on-chip and accessing memory
> costs extactly the same as accessing registers (when every instruction
> costs two bytes of code space and pretty much all non-branching
> instructions cost exactly one clock cycle then determining costs is
> pretty easy :-)).  With the ip2k backend I go to a reasonable amount of

Ok, that makes sense.  In this case you would still benefit a lot from
elimination of loads and stores.  Not only do the actual loads and stores
consume instruction space, but folding two loads together has a nice
cascading effect on other optimizations that can be performed (mostly
scalar simplifications arising from better value #ing information).

> What I'd not thought of here though is that in situations where
> arguments are simply propagated through to another function then that
> could give enormous opportunities for localized improvements.  Much of
> the code I'm interested in already uses explicit inlining within modules
> but getting much of the same advantage across modules would be a big win
> at times.

Sure, you could take advantage of these situations as well.  Although GCC
could do some reasonable interprocedural optimization within translation
units, I don't think it currently does (it's mostly function-at-a-time).
I am probably wrong though, because development has certainly been picking
up, especially on the branches.  :)

> > I've found that one of the interesting (optional) transformations that can
> > be run at link time is an "internalizing" pass.  Basically if your
> > application doesn't have shared libraries calling back into it (which I
> > expect is rare in embedded apps :), you can mark all of the functions in
> > the program (except main) as "static".  This allows the optimizer a lot

> You'd be amazed how much of our code uses library code that calls back
> into places - almost every event that doesn't happen synchronously
> triggers callbacks.  With that said, however, I can suddenly see a whole
> range of potential improvements here (at the moment such improvements

Sure that's absolutely no problem.  Library code is really easy to handle:
just compile the library code and interprocedurally optimize it with the
rest of the application.  Shared objects are the problem, because
currently there is no good way to specify which externally visible
functions may be called by dynamically loaded code.

> generally require more experienced developers to refactor significant
> chunks of code within modules and of course none of this works across
> modules).  One interesting problem however is how a developer would go
> about debugging under such circumstances.  Ideally some means would be
> needed to allow some code to be compiled and marked as being unavailable
> for significant transformations - when code space is constrained then
> building and testing things with -O0 just isn't an option :-(

That too is no problem at all.  In my initial proposal I explicitly
allowed the linker to combine some native .o files and some "high-level"
.o files together.  Of course the linker can do better optimization if
more code is in the high level format, but this is an easy way to control
debuggability.  Code you want to debug can just be compiled as normal,
without IPO.

> > Obviously, this is not appropriate for all applications, but gives a
> > flavor of nice things that can be done with bigger scope for analysis and
> > transformation...
>
> This is why I think this sort of optimization has great potential in
> many of these sorts of embedded apps.

Not _just_ embedded apps!  :)

-Chris

http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposal for a 'gcc -O4': interprocedural optimization
  2002-08-24 13:04 ` Chris Lattner
@ 2002-08-24 14:02   ` Dave Hudson
  2002-08-24 14:11     ` Chris Lattner
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hudson @ 2002-08-24 14:02 UTC (permalink / raw)
  To: Chris Lattner; +Cc: gcc

Hi Chris,

Chris Lattner wrote:

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proposal for a 'gcc -O4': interprocedural optimization
       [not found] <3D676E05.7050701@cyclicode.net>
@ 2002-08-24 13:04 ` Chris Lattner
  2002-08-24 14:02   ` Dave Hudson
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Lattner @ 2002-08-24 13:04 UTC (permalink / raw)
  To: Dave Hudson; +Cc: gcc

On Sat, 24 Aug 2002, Dave Hudson wrote:

[ Note, I'm CC'ing the list, because the discussion may be interesting for
  others ]

> One group of users whom I'm sure would love to see interprocedural
> optimization are those of embedded processors.  The AVR, 68HC11 and
> IP2022 ports all run on platforms where just exapanding code memory
> isn't an option - similar things happen with some SoCs too.

Sure, there are a lot of interprocedural optimizations focused solely on
removing dead code, interprocedural GCSE, etc... which are powerful ways
to shrink the size of the final program.  Even inlining functions only
called from one place (but across translation unit boundaries) can be a
simple way to reduce the size of programs.

> My current estimate is that even trivial interprocedural optimizations
> could be worth well over 1% to us on code size and around 2% on speed.
> In addition, being able to better optimize the use of GPRs and
> consequently avoiding using stack slots where not necessary could save
> about 1% of our total data usage (and around 4% of stack space usage).

Certainly that.  With a reasonable alias analysis infrastructure you can
expect a lot more loads to be folded and stores to be eliminated: that
alone can save a lot more than your 1%.  The nice thing about having the
whole program available for analysis is that you can write more powerful
analyses, and reuse the existing transformations without change (assuming
well defined interfaces exist of course).  Of course there are a variety
of techniques that can also give you more than 2% speed, for example
redundant load/store elimination can relieve a lot of memory traffic
alone.

> Much more significant though is that by being able to do things like
> constant propagation through our library code I'd anticipate that in
> some applications we could see huge improvements in code size and data
> usage though - our software has some interesting similarities to much
> C++ code even though it's written in C and consequently has quite a lot
> of places where things could be eliminated in the majority of applications.

I've found that one of the interesting (optional) transformations that can
be run at link time is an "internalizing" pass.  Basically if your
application doesn't have shared libraries calling back into it (which I
expect is rare in embedded apps :), you can mark all of the functions in
the program (except main) as "static".  This allows the optimizer a lot
more freedom to, for example, simplify the prototypes for functions, hoist
conditions out of functions into their caller (for example, because the
predicate is a function of a constant argument, etc), and other
transformations.

Obviously, this is not appropriate for all applications, but gives a
flavor of nice things that can be done with bigger scope for analysis and
transformation...

-Chris

http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-08-24 18:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-23 15:31 Proposal for a 'gcc -O4': interprocedural optimization Chris Lattner
     [not found] <3D676E05.7050701@cyclicode.net>
2002-08-24 13:04 ` Chris Lattner
2002-08-24 14:02   ` Dave Hudson
2002-08-24 14:11     ` Chris Lattner
2002-08-24 14:45       ` Dave Hudson
2002-08-24 15:02         ` Chris Lattner
2002-08-24 18:04           ` Dave Hudson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).