Inter-CU DWARF size optimizations and gcc -flto

public inbox for archer@sourceware.org
 help / color / mirror / Atom feed

* Inter-CU DWARF size optimizations and gcc -flto
@ 2012-02-01 13:23 Jan Kratochvil
  2012-02-01 13:32 ` Jakub Jelinek
  2012-02-22 21:56 ` Tom Tromey
  0 siblings, 2 replies; 8+ messages in thread
From: Jan Kratochvil @ 2012-02-01 13:23 UTC (permalink / raw)
  To: archer; +Cc: Jakub Jelinek

Hi,

I am sorry if it is clear to everyone but I admit I played with it only
yesterday.

With
	gcc -flto -flto-partition=none

gcc outputs only single CU (Compilation Unit).  With default (omitting)
-flto-partition there are multiple CUs but still a few compared to the number
of .o files.

-flto is AFAIK the future for all the compilations.  It is well known -flto
debug info is somehow broken now but that needs to be fixed anyway.

As the DWARF size is being discussed for 5+ years I am in Tools this is
a long-term project and waiting for (helping, heh) working -flto is an
acceptable solution.

This has some implications:

(a) DWARF post-processing optimization tool no longer makes sense with -flto.

    (a1) Intra-CU optimizations in GCC make sense as it is the final output.

(b) .gdb_index will have limited scope, only to select which objfiles to expand,
    no longer to select which CUs to expand.

(c) Partial CU expansion Tom Tromey talks about is a must in such case.
    Although the smaller LTO debug info takes only 63% of GDB memory
    requirements compared to the non-LTO (many-CUs) debug info.
    (GDB memory requirement is not directly proportional ot the DWARF size)

With -flto-partition=none linking of GDB took about 900MB.  Honza Hubicka's
memory requirements for LTO (2.7GB for Mozilla) not sure how were related to
-flto-partition.  Still some GBs of cheap memory for the few hosts in build
farm (Koji) for Mozilla + LibreOffice should not be such a concern IMO.

FYI for gdb with Rawhide -O2-style CFLAGS (-gdwarf-4 -fno-debug-types-section):

-fno-debug-types-section:
                       |  non-LTO  |    LTO
stripped binary size   |   5023064 |   4985864
separate .debug size   |  19190280 |  12484312 =65%
GDB RSS -readnow       | 160136 KB | 106252 KB
GDB RSS without .debug |  14964 KB |  14972 KB
GDB RSS difference     | 145172 KB |  91280 KB =63%

I had an idea those 65% (35% reduction) could be the magic ratio achievable by
the hypothetically optimal "Roland's" DWARF optimizer.  But at least struct
range_bounds is there defined (including all its fields) 49x so this is still
far from optimal/"Roland's one".

Additionally with -fdebug-types-section:
                       v like above
                       |  non-LTO  |  non-LTO .debug_types | LTO .debug_types
stripped binary size   |   5023064 |  5023064              |  4985864
separate .debug size   |  19190280 | 12789960 = 67%        | 12170080 = 63%
GDB RSS -readnow       | 160136 KB |  77524 KB             | 227876 KB
GDB RSS without .debug |  14964 KB |  14968 KB             |  14964 KB
GDB RSS difference     | 145172 KB |  62556 KB = 43%       | 212912 KB = 147%

This has IMO some implications:

(z) gcc/dwarf2out.c is a viable place where to implement "Roland's" DWARF
    optimizer.


Regards,
Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inter-CU DWARF size optimizations and gcc -flto
  2012-02-01 13:23 Inter-CU DWARF size optimizations and gcc -flto Jan Kratochvil
@ 2012-02-01 13:32 ` Jakub Jelinek
  2012-02-22 21:56 ` Tom Tromey
  1 sibling, 0 replies; 8+ messages in thread
From: Jakub Jelinek @ 2012-02-01 13:32 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: archer, Jason Merrill

On Wed, Feb 01, 2012 at 02:23:09PM +0100, Jan Kratochvil wrote:
> I am sorry if it is clear to everyone but I admit I played with it only
> yesterday.
> 
> With
> 	gcc -flto -flto-partition=none
> 
> gcc outputs only single CU (Compilation Unit).  With default (omitting)
> -flto-partition there are multiple CUs but still a few compared to the number
> of .o files.
> 
> -flto is AFAIK the future for all the compilations.  It is well known -flto
> debug info is somehow broken now but that needs to be fixed anyway.

It isn't only somehow broken, it is quite fundamentally broken.  And even
with LTO GCC should output CUs matching the original source, one CU per
source IMHO, which is admittedly going to be very difficult though,
especially when partitioning the compilation, because multiple partitions
might need to add stuff to a single CU.  IMHO at least for us -flto is a
no-go until these problems are solved though.

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inter-CU DWARF size optimizations and gcc -flto
  2012-02-01 13:23 Inter-CU DWARF size optimizations and gcc -flto Jan Kratochvil
  2012-02-01 13:32 ` Jakub Jelinek
@ 2012-02-22 21:56 ` Tom Tromey
  2012-02-26 15:09   ` Daniel Jacobowitz
  1 sibling, 1 reply; 8+ messages in thread
From: Tom Tromey @ 2012-02-22 21:56 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: archer, Jakub Jelinek

Jan> (b) .gdb_index will have limited scope, only to select which
Jan> objfiles to expand, no longer to select which CUs to expand.

I suspect we are going to need a better approach here anyway.
I sometimes hear about programs with more than 800 shared libraries.
If you assume separate debuginfo this means 1600 objfiles.
I think this will just crush most of the existing algorithms in gdb.

Jan> (c) Partial CU expansion Tom Tromey talks about is a must in such case.

I realized I never wrote up how this could work.  The below is sort of a
sketch that devolves into random thoughts.

I have been thinking about it since we discussed it and I think it has a
potentially severe problem.

The basic idea is simple: right now we have two DWARF readers in
dwarf2read.c, the psymtab reader and the full symbol reader.

Right now when we find a psymbol, we expand the whole CU to full
symbols.  This normally isn't too bad -- but there are some CUs out
there in practice that are quite large, and the delay reading them is
noticeable.

So, what if we unified the two readers -- eliminating one source of bugs
-- and also changed CU expansion to be DIE-based.  That is, in symtab.c,
before returning a symbol from a symtab, we would call some back-end
function to expand the symbol.  The DWARF reader would then just read
the DIEs needed to instantiate that one particular symbol plus whatever
dependencies (types usually) it has.

Ok, that sounds good, but there is a problem: struct symbol is really
big, much bigger than a psymbol.  We could just read psymbol-like
structs on our first pass, but we need somewhere to store the DIE offset
for efficient expansion.

We can solve that by updating and applying an old patch that shrinks
psymbol.  Then we can use the saved space to store the DIE -- so this
change can be space-neutral.

However, this neglects the bcache.  In fact, the bcache sinks the whole
project, since DIE offsets will vary by definition.

Well, the DIE offset sinks this particular approach.  Maybe there is
another approach, not space-neutral but also not too bad, that can be
used.  For example, keeping the bcache but having the symtabs contain
both {psymbol+DIE} pairs and fully-expanded symbols (depending on what
has been expanded).

If we went a bit deeper and had hierarchical symbol tables, we could
skip whole DIE subtrees even in the partial reader.

A related idea here that I was idly wondering about is whether we could
make the psymtab reader hierarchical without touching full symbols.

The deeper rewrite seems eventually necessary.  The symbol table code is
pretty horrible, in multiple ways.  However, at least for me it hasn't
yet reached the pain point where we can justify spending months and
months on it, which I think is what it would take.

Your thoughts welcome.

Tom

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inter-CU DWARF size optimizations and gcc -flto
  2012-02-22 21:56 ` Tom Tromey
@ 2012-02-26 15:09   ` Daniel Jacobowitz
  2012-03-03  2:54     ` Tom Tromey
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Jacobowitz @ 2012-02-26 15:09 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Jan Kratochvil, archer, Jakub Jelinek

On Wed, Feb 22, 2012 at 4:56 PM, Tom Tromey <tromey@redhat.com> wrote:
> Jan> (b) .gdb_index will have limited scope, only to select which
> Jan> objfiles to expand, no longer to select which CUs to expand.
>
> I suspect we are going to need a better approach here anyway.
> I sometimes hear about programs with more than 800 shared libraries.
> If you assume separate debuginfo this means 1600 objfiles.
> I think this will just crush most of the existing algorithms in gdb.

You are correct, it does crush GDB :-)  I routinely try - emphasis on
try - to use GDB on programs with between 2500 and 5500 shared
libraries.  It's agonizing.  I have another project I want to work on
first, and not much time for GDB lately, but this is absolutely on my
list to improve.

-- 
Thanks,
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inter-CU DWARF size optimizations and gcc -flto
  2012-02-26 15:09   ` Daniel Jacobowitz
@ 2012-03-03  2:54     ` Tom Tromey
  2012-03-05  0:25       ` Daniel Jacobowitz
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Tromey @ 2012-03-03  2:54 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Jan Kratochvil, archer, Jakub Jelinek

>>>>> "Daniel" == Daniel Jacobowitz <drow@false.org> writes:

Daniel> You are correct, it does crush GDB :-)  I routinely try - emphasis on
Daniel> try - to use GDB on programs with between 2500 and 5500 shared
Daniel> libraries.  It's agonizing.  I have another project I want to work on
Daniel> first, and not much time for GDB lately, but this is absolutely on my
Daniel> list to improve.

I am curious how you plan to improve it.

The plan I mentioned upthread is probably pretty good for scaling to
distro-sized programs, say 200 shared libraries or less (this is
LibreOffice or Mozilla).  Maybe we could get a bit more by putting
minsyms into the index.

I am not so confident it would let gdb scale to 5000 shared libraries
though.

For that size I've had two ideas.

First, and simplest, punt.  Make the user disable automatic reading of
shared library debuginfo (or even minsyms) and make the user explicitly
mention which ones should be used -- either by 'sharedlibrary' or by a
linespec extension.

I guess this one would sort of work today.  (I haven't tried.)

Second, and harder, is the "big data" approach.  This would be something
like -- load all the debuginfo into a server, tagged by build-id,
ideally with global type- and symbol-interning; then change gdb to send
queries to the server and get back the minimal DWARF (or DWARF-esque
bits) needed; crucially, this would be a global operation instead of
per-objfile, so that gdb could exploit parallelism on the server side.

Parallelism seems key to me.  Parallelism on the machine running gdb
probably wouldn't work out, though, on the theory that there'd be too
much disk contention.  Dunno, maybe worth trying.

Tom

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inter-CU DWARF size optimizations and gcc -flto
  2012-03-03  2:54     ` Tom Tromey
@ 2012-03-05  0:25       ` Daniel Jacobowitz
  2012-03-05 22:03         ` Tom Tromey
  2012-03-15 12:51         ` Gary Benson
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel Jacobowitz @ 2012-03-05  0:25 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Jan Kratochvil, archer, Jakub Jelinek

On Fri, Mar 2, 2012 at 9:54 PM, Tom Tromey <tromey@redhat.com> wrote:
>>>>>> "Daniel" == Daniel Jacobowitz <drow@false.org> writes:
>
> Daniel> You are correct, it does crush GDB :-)  I routinely try - emphasis on
> Daniel> try - to use GDB on programs with between 2500 and 5500 shared
> Daniel> libraries.  It's agonizing.  I have another project I want to work on
> Daniel> first, and not much time for GDB lately, but this is absolutely on my
> Daniel> list to improve.
>
> I am curious how you plan to improve it.

I have no idea.  One thing I'd like to revisit is your work on
threaded symbol load; I have plenty of cores available, and the
machine is pretty much useless to me until my test starts.  There's
also a lot of room for profiling to identify bad algorithms; I think
we spend a lot of time reading the solib list from the inferior
(something I thought I and others had fixed thoroughly already...) and
I routinely hit inefficient algorithms e.g. during "next".

>
>
> The plan I mentioned upthread is probably pretty good for scaling to
> distro-sized programs, say 200 shared libraries or less (this is
> LibreOffice or Mozilla).  Maybe we could get a bit more by putting
> minsyms into the index.
>
> I am not so confident it would let gdb scale to 5000 shared libraries
> though.
>
> For that size I've had two ideas.
>
> First, and simplest, punt.  Make the user disable automatic reading of
> shared library debuginfo (or even minsyms) and make the user explicitly
> mention which ones should be used -- either by 'sharedlibrary' or by a
> linespec extension.
>
> I guess this one would sort of work today.  (I haven't tried.)

I am hugely unexcited by this.  Even if did basic usability work on
top of that - e.g. automatically load all solibs that appear in the
backtrace - the inability to find sources by file:line is a huge
problem for me.

>
>
> Second, and harder, is the "big data" approach.  This would be something
> like -- load all the debuginfo into a server, tagged by build-id,
> ideally with global type- and symbol-interning; then change gdb to send
> queries to the server and get back the minimal DWARF (or DWARF-esque
> bits) needed; crucially, this would be a global operation instead of
> per-objfile, so that gdb could exploit parallelism on the server side.
>
> Parallelism seems key to me.  Parallelism on the machine running gdb
> probably wouldn't work out, though, on the theory that there'd be too
> much disk contention.  Dunno, maybe worth trying.

This is an idea I'm excited by.  It works well along with Cary's
http://gcc.gnu.org/wiki/DebugFission, too; a separate process could
handle the changes as individual shared libraries are rebuilt.

Something I've been thinking about is that incrementalism is hard in
GDB because the symbol tables are so entwined... adding any sort of
client/server interface would force us to detangle them, and then
individual objects could have a longer life.

-- 
Thanks,
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inter-CU DWARF size optimizations and gcc -flto
  2012-03-05  0:25       ` Daniel Jacobowitz
@ 2012-03-05 22:03         ` Tom Tromey
  2012-03-15 12:51         ` Gary Benson
  1 sibling, 0 replies; 8+ messages in thread
From: Tom Tromey @ 2012-03-05 22:03 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Jan Kratochvil, archer, Jakub Jelinek

Daniel> I have no idea.  One thing I'd like to revisit is your work on
Daniel> threaded symbol load; I have plenty of cores available, and the
Daniel> machine is pretty much useless to me until my test starts.

This might help, it would be worth trying at least.
I am mildly skeptical about it working well with a very big program.
It seems like you could get into memory trouble, which would need a
different sort of scaling approach.

Also, with .gdb_index, in my tests the startup time of gdb is dominated
by minsym reading, even banal stuff like sorting them.  I think you'd
have to insert some threading bits in there too... easy though.

Daniel> There's
Daniel> also a lot of room for profiling to identify bad algorithms; I think
Daniel> we spend a lot of time reading the solib list from the inferior
Daniel> (something I thought I and others had fixed thoroughly already...) and
Daniel> I routinely hit inefficient algorithms e.g. during "next".

Yeah, I hadn't even gotten to thinking about anything other than the
symbol tables.

Tom> First, and simplest, punt.  Make the user disable automatic reading of
Tom> shared library debuginfo (or even minsyms) and make the user explicitly
Tom> mention which ones should be used -- either by 'sharedlibrary' or by a
Tom> linespec extension.

Daniel> I am hugely unexcited by this.

Yeah, me too.  It would "work" but the user experience would be not be
good.

Daniel> Something I've been thinking about is that incrementalism is hard in
Daniel> GDB because the symbol tables are so entwined... adding any sort of
Daniel> client/server interface would force us to detangle them, and then
Daniel> individual objects could have a longer life.

The symbol tables are my least favorite part of gdb right now, wresting
the crown from linespec this year.  Though maybe that is just because I
don't know all parts equally well ;)

Tom

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inter-CU DWARF size optimizations and gcc -flto
  2012-03-05  0:25       ` Daniel Jacobowitz
  2012-03-05 22:03         ` Tom Tromey
@ 2012-03-15 12:51         ` Gary Benson
  1 sibling, 0 replies; 8+ messages in thread
From: Gary Benson @ 2012-03-15 12:51 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Tom Tromey, Jan Kratochvil, archer, Jakub Jelinek

Daniel Jacobowitz wrote:
> There's also a lot of room for profiling to identify bad algorithms;
> I think we spend a lot of time reading the solib list from the
> inferior (something I thought I and others had fixed thoroughly
> already...) and I routinely hit inefficient algorithms e.g. during
> "next".

I did some work on this recently.  On my setup (with gdb and the
inferior on the same machine) it was spending a huge chunk of time
regenerating symbol tables every time the solib_event_breakpoint
hit.  The final patch I committed is here:

  http://www.cygwin.com/ml/gdb-patches/2011-10/msg00068.html

If you're seeing some sort of qsort comparison function at the top
of the profile it could be that something is bypassing this.

If you find the time is taken up mostly with transferring data from
the inferior to gdb (I never tried remote, for instance) then you
might be interested in some work I did last year on a SystemTap based
interface between glibc and gdb that should be able to be extended to
allow selective reading of the solib list.  That's waiting on Sergio's
SystemTap stuff... also the glibc maintainers seem hostile to the
idea of us inserting SystemTap probes in there.  I can dig up the code
I had for this if you're interested.

I also had a patch floating around that disabled the solib event
breakpoint under certain conditions, but I think the ambiguous
linespec stuff makes this patch invalid as you always have to be
looking out for new functions turning up.  If you're interested the
thread is http://www.cygwin.com/ml/gdb-patches/2011-09/msg00156.html
but it's probably useless :(

Cheers,
Gary

-- 
http://gbenson.net/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-03-15 12:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-01 13:23 Inter-CU DWARF size optimizations and gcc -flto Jan Kratochvil
2012-02-01 13:32 ` Jakub Jelinek
2012-02-22 21:56 ` Tom Tromey
2012-02-26 15:09   ` Daniel Jacobowitz
2012-03-03  2:54     ` Tom Tromey
2012-03-05  0:25       ` Daniel Jacobowitz
2012-03-05 22:03         ` Tom Tromey
2012-03-15 12:51         ` Gary Benson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).