* Inter-CU DWARF size optimizations and gcc -flto @ 2012-02-01 13:23 Jan Kratochvil 2012-02-01 13:32 ` Jakub Jelinek 2012-02-22 21:56 ` Tom Tromey 0 siblings, 2 replies; 8+ messages in thread From: Jan Kratochvil @ 2012-02-01 13:23 UTC (permalink / raw) To: archer; +Cc: Jakub Jelinek Hi, I am sorry if it is clear to everyone but I admit I played with it only yesterday. With gcc -flto -flto-partition=none gcc outputs only single CU (Compilation Unit). With default (omitting) -flto-partition there are multiple CUs but still a few compared to the number of .o files. -flto is AFAIK the future for all the compilations. It is well known -flto debug info is somehow broken now but that needs to be fixed anyway. As the DWARF size is being discussed for 5+ years I am in Tools this is a long-term project and waiting for (helping, heh) working -flto is an acceptable solution. This has some implications: (a) DWARF post-processing optimization tool no longer makes sense with -flto. (a1) Intra-CU optimizations in GCC make sense as it is the final output. (b) .gdb_index will have limited scope, only to select which objfiles to expand, no longer to select which CUs to expand. (c) Partial CU expansion Tom Tromey talks about is a must in such case. Although the smaller LTO debug info takes only 63% of GDB memory requirements compared to the non-LTO (many-CUs) debug info. (GDB memory requirement is not directly proportional ot the DWARF size) With -flto-partition=none linking of GDB took about 900MB. Honza Hubicka's memory requirements for LTO (2.7GB for Mozilla) not sure how were related to -flto-partition. Still some GBs of cheap memory for the few hosts in build farm (Koji) for Mozilla + LibreOffice should not be such a concern IMO. FYI for gdb with Rawhide -O2-style CFLAGS (-gdwarf-4 -fno-debug-types-section): -fno-debug-types-section: | non-LTO | LTO stripped binary size | 5023064 | 4985864 separate .debug size | 19190280 | 12484312 =65% GDB RSS -readnow | 160136 KB | 106252 KB GDB RSS without .debug | 14964 KB | 14972 KB GDB RSS difference | 145172 KB | 91280 KB =63% I had an idea those 65% (35% reduction) could be the magic ratio achievable by the hypothetically optimal "Roland's" DWARF optimizer. But at least struct range_bounds is there defined (including all its fields) 49x so this is still far from optimal/"Roland's one". Additionally with -fdebug-types-section: v like above | non-LTO | non-LTO .debug_types | LTO .debug_types stripped binary size | 5023064 | 5023064 | 4985864 separate .debug size | 19190280 | 12789960 = 67% | 12170080 = 63% GDB RSS -readnow | 160136 KB | 77524 KB | 227876 KB GDB RSS without .debug | 14964 KB | 14968 KB | 14964 KB GDB RSS difference | 145172 KB | 62556 KB = 43% | 212912 KB = 147% This has IMO some implications: (z) gcc/dwarf2out.c is a viable place where to implement "Roland's" DWARF optimizer. Regards, Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Inter-CU DWARF size optimizations and gcc -flto 2012-02-01 13:23 Inter-CU DWARF size optimizations and gcc -flto Jan Kratochvil @ 2012-02-01 13:32 ` Jakub Jelinek 2012-02-22 21:56 ` Tom Tromey 1 sibling, 0 replies; 8+ messages in thread From: Jakub Jelinek @ 2012-02-01 13:32 UTC (permalink / raw) To: Jan Kratochvil; +Cc: archer, Jason Merrill On Wed, Feb 01, 2012 at 02:23:09PM +0100, Jan Kratochvil wrote: > I am sorry if it is clear to everyone but I admit I played with it only > yesterday. > > With > gcc -flto -flto-partition=none > > gcc outputs only single CU (Compilation Unit). With default (omitting) > -flto-partition there are multiple CUs but still a few compared to the number > of .o files. > > -flto is AFAIK the future for all the compilations. It is well known -flto > debug info is somehow broken now but that needs to be fixed anyway. It isn't only somehow broken, it is quite fundamentally broken. And even with LTO GCC should output CUs matching the original source, one CU per source IMHO, which is admittedly going to be very difficult though, especially when partitioning the compilation, because multiple partitions might need to add stuff to a single CU. IMHO at least for us -flto is a no-go until these problems are solved though. Jakub ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Inter-CU DWARF size optimizations and gcc -flto 2012-02-01 13:23 Inter-CU DWARF size optimizations and gcc -flto Jan Kratochvil 2012-02-01 13:32 ` Jakub Jelinek @ 2012-02-22 21:56 ` Tom Tromey 2012-02-26 15:09 ` Daniel Jacobowitz 1 sibling, 1 reply; 8+ messages in thread From: Tom Tromey @ 2012-02-22 21:56 UTC (permalink / raw) To: Jan Kratochvil; +Cc: archer, Jakub Jelinek Jan> (b) .gdb_index will have limited scope, only to select which Jan> objfiles to expand, no longer to select which CUs to expand. I suspect we are going to need a better approach here anyway. I sometimes hear about programs with more than 800 shared libraries. If you assume separate debuginfo this means 1600 objfiles. I think this will just crush most of the existing algorithms in gdb. Jan> (c) Partial CU expansion Tom Tromey talks about is a must in such case. I realized I never wrote up how this could work. The below is sort of a sketch that devolves into random thoughts. I have been thinking about it since we discussed it and I think it has a potentially severe problem. The basic idea is simple: right now we have two DWARF readers in dwarf2read.c, the psymtab reader and the full symbol reader. Right now when we find a psymbol, we expand the whole CU to full symbols. This normally isn't too bad -- but there are some CUs out there in practice that are quite large, and the delay reading them is noticeable. So, what if we unified the two readers -- eliminating one source of bugs -- and also changed CU expansion to be DIE-based. That is, in symtab.c, before returning a symbol from a symtab, we would call some back-end function to expand the symbol. The DWARF reader would then just read the DIEs needed to instantiate that one particular symbol plus whatever dependencies (types usually) it has. Ok, that sounds good, but there is a problem: struct symbol is really big, much bigger than a psymbol. We could just read psymbol-like structs on our first pass, but we need somewhere to store the DIE offset for efficient expansion. We can solve that by updating and applying an old patch that shrinks psymbol. Then we can use the saved space to store the DIE -- so this change can be space-neutral. However, this neglects the bcache. In fact, the bcache sinks the whole project, since DIE offsets will vary by definition. Well, the DIE offset sinks this particular approach. Maybe there is another approach, not space-neutral but also not too bad, that can be used. For example, keeping the bcache but having the symtabs contain both {psymbol+DIE} pairs and fully-expanded symbols (depending on what has been expanded). If we went a bit deeper and had hierarchical symbol tables, we could skip whole DIE subtrees even in the partial reader. A related idea here that I was idly wondering about is whether we could make the psymtab reader hierarchical without touching full symbols. The deeper rewrite seems eventually necessary. The symbol table code is pretty horrible, in multiple ways. However, at least for me it hasn't yet reached the pain point where we can justify spending months and months on it, which I think is what it would take. Your thoughts welcome. Tom ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Inter-CU DWARF size optimizations and gcc -flto 2012-02-22 21:56 ` Tom Tromey @ 2012-02-26 15:09 ` Daniel Jacobowitz 2012-03-03 2:54 ` Tom Tromey 0 siblings, 1 reply; 8+ messages in thread From: Daniel Jacobowitz @ 2012-02-26 15:09 UTC (permalink / raw) To: Tom Tromey; +Cc: Jan Kratochvil, archer, Jakub Jelinek On Wed, Feb 22, 2012 at 4:56 PM, Tom Tromey <tromey@redhat.com> wrote: > Jan> (b) .gdb_index will have limited scope, only to select which > Jan> objfiles to expand, no longer to select which CUs to expand. > > I suspect we are going to need a better approach here anyway. > I sometimes hear about programs with more than 800 shared libraries. > If you assume separate debuginfo this means 1600 objfiles. > I think this will just crush most of the existing algorithms in gdb. You are correct, it does crush GDB :-) I routinely try - emphasis on try - to use GDB on programs with between 2500 and 5500 shared libraries. It's agonizing. I have another project I want to work on first, and not much time for GDB lately, but this is absolutely on my list to improve. -- Thanks, Daniel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Inter-CU DWARF size optimizations and gcc -flto 2012-02-26 15:09 ` Daniel Jacobowitz @ 2012-03-03 2:54 ` Tom Tromey 2012-03-05 0:25 ` Daniel Jacobowitz 0 siblings, 1 reply; 8+ messages in thread From: Tom Tromey @ 2012-03-03 2:54 UTC (permalink / raw) To: Daniel Jacobowitz; +Cc: Jan Kratochvil, archer, Jakub Jelinek >>>>> "Daniel" == Daniel Jacobowitz <drow@false.org> writes: Daniel> You are correct, it does crush GDB :-) I routinely try - emphasis on Daniel> try - to use GDB on programs with between 2500 and 5500 shared Daniel> libraries. It's agonizing. I have another project I want to work on Daniel> first, and not much time for GDB lately, but this is absolutely on my Daniel> list to improve. I am curious how you plan to improve it. The plan I mentioned upthread is probably pretty good for scaling to distro-sized programs, say 200 shared libraries or less (this is LibreOffice or Mozilla). Maybe we could get a bit more by putting minsyms into the index. I am not so confident it would let gdb scale to 5000 shared libraries though. For that size I've had two ideas. First, and simplest, punt. Make the user disable automatic reading of shared library debuginfo (or even minsyms) and make the user explicitly mention which ones should be used -- either by 'sharedlibrary' or by a linespec extension. I guess this one would sort of work today. (I haven't tried.) Second, and harder, is the "big data" approach. This would be something like -- load all the debuginfo into a server, tagged by build-id, ideally with global type- and symbol-interning; then change gdb to send queries to the server and get back the minimal DWARF (or DWARF-esque bits) needed; crucially, this would be a global operation instead of per-objfile, so that gdb could exploit parallelism on the server side. Parallelism seems key to me. Parallelism on the machine running gdb probably wouldn't work out, though, on the theory that there'd be too much disk contention. Dunno, maybe worth trying. Tom ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Inter-CU DWARF size optimizations and gcc -flto 2012-03-03 2:54 ` Tom Tromey @ 2012-03-05 0:25 ` Daniel Jacobowitz 2012-03-05 22:03 ` Tom Tromey 2012-03-15 12:51 ` Gary Benson 0 siblings, 2 replies; 8+ messages in thread From: Daniel Jacobowitz @ 2012-03-05 0:25 UTC (permalink / raw) To: Tom Tromey; +Cc: Jan Kratochvil, archer, Jakub Jelinek On Fri, Mar 2, 2012 at 9:54 PM, Tom Tromey <tromey@redhat.com> wrote: >>>>>> "Daniel" == Daniel Jacobowitz <drow@false.org> writes: > > Daniel> You are correct, it does crush GDB :-) I routinely try - emphasis on > Daniel> try - to use GDB on programs with between 2500 and 5500 shared > Daniel> libraries. It's agonizing. I have another project I want to work on > Daniel> first, and not much time for GDB lately, but this is absolutely on my > Daniel> list to improve. > > I am curious how you plan to improve it. I have no idea. One thing I'd like to revisit is your work on threaded symbol load; I have plenty of cores available, and the machine is pretty much useless to me until my test starts. There's also a lot of room for profiling to identify bad algorithms; I think we spend a lot of time reading the solib list from the inferior (something I thought I and others had fixed thoroughly already...) and I routinely hit inefficient algorithms e.g. during "next". > > > The plan I mentioned upthread is probably pretty good for scaling to > distro-sized programs, say 200 shared libraries or less (this is > LibreOffice or Mozilla). Maybe we could get a bit more by putting > minsyms into the index. > > I am not so confident it would let gdb scale to 5000 shared libraries > though. > > For that size I've had two ideas. > > First, and simplest, punt. Make the user disable automatic reading of > shared library debuginfo (or even minsyms) and make the user explicitly > mention which ones should be used -- either by 'sharedlibrary' or by a > linespec extension. > > I guess this one would sort of work today. (I haven't tried.) I am hugely unexcited by this. Even if did basic usability work on top of that - e.g. automatically load all solibs that appear in the backtrace - the inability to find sources by file:line is a huge problem for me. > > > Second, and harder, is the "big data" approach. This would be something > like -- load all the debuginfo into a server, tagged by build-id, > ideally with global type- and symbol-interning; then change gdb to send > queries to the server and get back the minimal DWARF (or DWARF-esque > bits) needed; crucially, this would be a global operation instead of > per-objfile, so that gdb could exploit parallelism on the server side. > > Parallelism seems key to me. Parallelism on the machine running gdb > probably wouldn't work out, though, on the theory that there'd be too > much disk contention. Dunno, maybe worth trying. This is an idea I'm excited by. It works well along with Cary's http://gcc.gnu.org/wiki/DebugFission, too; a separate process could handle the changes as individual shared libraries are rebuilt. Something I've been thinking about is that incrementalism is hard in GDB because the symbol tables are so entwined... adding any sort of client/server interface would force us to detangle them, and then individual objects could have a longer life. -- Thanks, Daniel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Inter-CU DWARF size optimizations and gcc -flto 2012-03-05 0:25 ` Daniel Jacobowitz @ 2012-03-05 22:03 ` Tom Tromey 2012-03-15 12:51 ` Gary Benson 1 sibling, 0 replies; 8+ messages in thread From: Tom Tromey @ 2012-03-05 22:03 UTC (permalink / raw) To: Daniel Jacobowitz; +Cc: Jan Kratochvil, archer, Jakub Jelinek Daniel> I have no idea. One thing I'd like to revisit is your work on Daniel> threaded symbol load; I have plenty of cores available, and the Daniel> machine is pretty much useless to me until my test starts. This might help, it would be worth trying at least. I am mildly skeptical about it working well with a very big program. It seems like you could get into memory trouble, which would need a different sort of scaling approach. Also, with .gdb_index, in my tests the startup time of gdb is dominated by minsym reading, even banal stuff like sorting them. I think you'd have to insert some threading bits in there too... easy though. Daniel> There's Daniel> also a lot of room for profiling to identify bad algorithms; I think Daniel> we spend a lot of time reading the solib list from the inferior Daniel> (something I thought I and others had fixed thoroughly already...) and Daniel> I routinely hit inefficient algorithms e.g. during "next". Yeah, I hadn't even gotten to thinking about anything other than the symbol tables. Tom> First, and simplest, punt. Make the user disable automatic reading of Tom> shared library debuginfo (or even minsyms) and make the user explicitly Tom> mention which ones should be used -- either by 'sharedlibrary' or by a Tom> linespec extension. Daniel> I am hugely unexcited by this. Yeah, me too. It would "work" but the user experience would be not be good. Daniel> Something I've been thinking about is that incrementalism is hard in Daniel> GDB because the symbol tables are so entwined... adding any sort of Daniel> client/server interface would force us to detangle them, and then Daniel> individual objects could have a longer life. The symbol tables are my least favorite part of gdb right now, wresting the crown from linespec this year. Though maybe that is just because I don't know all parts equally well ;) Tom ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Inter-CU DWARF size optimizations and gcc -flto 2012-03-05 0:25 ` Daniel Jacobowitz 2012-03-05 22:03 ` Tom Tromey @ 2012-03-15 12:51 ` Gary Benson 1 sibling, 0 replies; 8+ messages in thread From: Gary Benson @ 2012-03-15 12:51 UTC (permalink / raw) To: Daniel Jacobowitz; +Cc: Tom Tromey, Jan Kratochvil, archer, Jakub Jelinek Daniel Jacobowitz wrote: > There's also a lot of room for profiling to identify bad algorithms; > I think we spend a lot of time reading the solib list from the > inferior (something I thought I and others had fixed thoroughly > already...) and I routinely hit inefficient algorithms e.g. during > "next". I did some work on this recently. On my setup (with gdb and the inferior on the same machine) it was spending a huge chunk of time regenerating symbol tables every time the solib_event_breakpoint hit. The final patch I committed is here: http://www.cygwin.com/ml/gdb-patches/2011-10/msg00068.html If you're seeing some sort of qsort comparison function at the top of the profile it could be that something is bypassing this. If you find the time is taken up mostly with transferring data from the inferior to gdb (I never tried remote, for instance) then you might be interested in some work I did last year on a SystemTap based interface between glibc and gdb that should be able to be extended to allow selective reading of the solib list. That's waiting on Sergio's SystemTap stuff... also the glibc maintainers seem hostile to the idea of us inserting SystemTap probes in there. I can dig up the code I had for this if you're interested. I also had a patch floating around that disabled the solib event breakpoint under certain conditions, but I think the ambiguous linespec stuff makes this patch invalid as you always have to be looking out for new functions turning up. If you're interested the thread is http://www.cygwin.com/ml/gdb-patches/2011-09/msg00156.html but it's probably useless :( Cheers, Gary -- http://gbenson.net/ ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-03-15 12:51 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-02-01 13:23 Inter-CU DWARF size optimizations and gcc -flto Jan Kratochvil 2012-02-01 13:32 ` Jakub Jelinek 2012-02-22 21:56 ` Tom Tromey 2012-02-26 15:09 ` Daniel Jacobowitz 2012-03-03 2:54 ` Tom Tromey 2012-03-05 0:25 ` Daniel Jacobowitz 2012-03-05 22:03 ` Tom Tromey 2012-03-15 12:51 ` Gary Benson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).