From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 92058 invoked by alias); 13 Dec 2019 15:55:37 -0000 Mailing-List: contact dwz-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Post: List-Help: List-Subscribe: Sender: dwz-owner@sourceware.org Received: (qmail 91993 invoked by uid 48); 13 Dec 2019 15:55:31 -0000 From: "vries at gcc dot gnu.org" To: dwz@sourceware.org Subject: [Bug default/25276] New: create_import_tree performance improvements Date: Tue, 01 Jan 2019 00:00:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: dwz X-Bugzilla-Component: default X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: enhancement X-Bugzilla-Who: vries at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: nobody at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2019-q4/txt/msg00160.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25276 Bug ID: 25276 Summary: create_import_tree performance improvements Product: dwz Version: unspecified Status: NEW Severity: enhancement Priority: P2 Component: default Assignee: nobody at sourceware dot org Reporter: vries at gcc dot gnu.org CC: dwz at sourceware dot org Target Milestone: --- The test-case clang-10 as attached in PR25024, contains ~111 million DIEs. This is larger than the default max die-limit of 50 million, but we can overcome this using -Lnone. Using -Lnone, as well as -lnone to avoid low-mem mode, we get an execution = time of: ... $ time.sh dwz clang-10.debug -o 1 -lnone -Lnone maxmem: 14141844 real: 148.32 user: 131.79 system: 7.07 ... When compiling with -DDEVEL additionally, we get: ... $ time.sh dwz clang-10.debug -o 1 -lnone -Lnone maxmem: 14069588 real: 148.35 user: 132.95 system: 7.31 ... When compiling with the odr patches applied, we get slightly more memory us= ed: ... $ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr maxmem: 14605608 real: 528.92 user: 142.62 system: 16.78 ... but 'real time' grows with a factor 3.5, presumably because of running into swap space usage. The user time only grows with 7% though. Finally, enabling odr in it's basic mode sees an increase of user time by a factor 3.3, in total 336 seconds extra: ... $ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=3Dbasic maxmem: 14513848 real: 533.34 user: 478.39 system: 12.06 ... Using a patch adding --devel-progress, we can see where we spent the time. Without odr: ... $ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr --devel-progress read_debug_info .debug_info user: 58.74 sys : 1.84 partition_dups user: 1.17 sys : 0.01 partition_dups qsort user: 1.35 sys : 0.00 partition_dups after qsort user: 1.37 sys : 0.35 create_import_tree phase 1 user: 0.36 sys : 0.02 create_import_tree phase 2 user: 1.33 sys : 0.00 create_import_tree phase 3 user: 20.15 sys : 0.00 create_import_tree phase 4 (create partial units) user: 0.09 sys : 0.00 read_debug_info .debug_types user: 1.29 sys : 0.00 compute_abbrevs user: 36.34 sys : 0.24 write_abbrev user: 0.09 sys : 0.00 write_info user: 11.87 sys : 0.72 write_loc user: 0.74 sys : 0.68 write_types user: 0.00 sys : 0.00 write_gdb_index user: 0.25 sys : 0.32 write_dso user: 0.19 sys : 2.00 maxmem: 15172216 real: 174.49 user: 135.33 system: 6.41 ... And with odr: ... $ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=3Dbasic --devel-progress=20=20=20 read_debug_info .debug_info user: 43.07 sys : 1.60 partition_dups user: 1.88 sys : 0.01 partition_dups split_dups user: 93.43 sys : 0.04 partition_dups qsort user: 4.10 sys : 0.00 partition_dups after qsort user: 3.71 sys : 0.19 create_import_tree phase 1 user: 0.89 sys : 0.01 create_import_tree phase 2 user: 48.29 sys : 0.00 create_import_tree phase 3 user: 222.30 sys : 0.00 create_import_tree phase 4 (create partial units) user: 0.32 sys : 0.03 read_debug_info .debug_types user: 1.35 sys : 0.00 compute_abbrevs user: 30.36 sys : 0.09 write_abbrev user: 0.08 sys : 0.00 write_info user: 5.70 sys : 0.11 write_loc user: 0.80 sys : 0.65 write_types user: 0.00 sys : 0.00 write_gdb_index user: 0.14 sys : 0.17 write_dso user: 0.00 sys : 1.27 maxmem: 15491032 real: 465.45 user: 456.42 system: 4.29 ... Focussing on the larger bits, without odr: ... read_debug_info .debug_info user: 58.74 create_import_tree phase 3 user: 20.15 compute_abbrevs user: 36.34 write_info user: 11.87 ... and with odr: ... read_debug_info .debug_info user: 43.07 sys : 1.60 partition_dups split_dups user: 93.43 sys : 0.04 create_import_tree phase 2 user: 48.29 sys : 0.00 create_import_tree phase 3 user: 222.30 sys : 0.00 compute_abbrevs user: 30.36 sys : 0.09 write_info user: 5.70 sys : 0.11 ... It's clear something is slow in odr: the split_dups bit takes 93s. But create_import_tree phase 3 increases from 20s to 222s. Looking at the perf data, we spend 51% of the time in dwz, and of that 51%,= we spent 43% here, in the subset/superset edge comparison loop: ... Percent=E2=94=82 e4 =3D ipu2->incoming; 0,01 =E2=94=82 mov %r14,%rax =E2=94=82 while ((maybe_subset || maybe_superset)= && e3 && e4) 0,00 =E2=94=821395: mov %esi,%edi 0,01 =E2=94=82 or %cl,%dil 0,01 =E2=94=82 =E2=86=93 je 13d8 0,02 =E2=94=82139c: test %rdx,%rdx 0,00 =E2=94=82 =E2=86=93 je 1730 0,07 =E2=94=82 test %rax,%rax 0,00 =E2=94=82 =E2=86=93 je 1730 =E2=94=82 if (e3->icu =3D=3D e4->icu) 0,53 =E2=94=82 mov (%rdx),%r8 35,28 =E2=94=82 mov (%rax),%rdi 0,62 =E2=94=82 cmp %rdi,%r8 0,01 =E2=94=82 =E2=86=93 je 1718 =E2=94=82 if (e3->icu->idx < e4->icu->idx) 7,37 =E2=94=82 mov 0x28(%rdi),%edi 0,55 =E2=94=82 cmp %edi,0x28(%r8) 0,04 =E2=94=82 =E2=86=93 jb 1708 =E2=94=82 maybe_superset =3D false; ... This is a draft of a short-cut that might prevent going into the hot loop: ... @@ -7683,6 +7684,26 @@ create_import_tree (void) maybe_subset =3D (e1 =3D=3D ipu->incoming && ipu->incoming_count <=3D ipu2->incoming_co= unt); maybe_superset =3D ipu->incoming_count >=3D ipu2->incoming_co= unt; + if (maybe_superset) + { + /* If the referrer nodes of ipu are a superset of the + referrer nodes of ipu2, then ipu's last referrer node + should have index larger or equal to the last referrer + node of ipu2. */ + maybe_superset + =3D (ipu->incoming[ipu->incoming_count - 1].icu->idx + >=3D ipu2->incoming[ipu2->incoming_count - 1].icu->i= dx); + } + if (maybe_subset) + { + /* If the referrer nodes of ipu are a subset of the + referrer nodes of ipu2, then ipu's last referrer node + should have index smaller or equal to the last referrer + node of ipu2. */ + maybe_subset + =3D (ipu->incoming[ipu->incoming_count - 1].icu->idx + <=3D ipu2->incoming[ipu2->incoming_count -1].icu->id= x); + } e3 =3D e1; e4 =3D ipu2->incoming; intersection =3D 0; ... but it doesn't work currently because we cannot access ipu->incoming[ipu->incoming_count - 1] at this point. Instead, we need to w= alk the next chain. Keeping track of the tail of the edge chain would fix this, and might yield speed improvement. Furthermore, we spent 28% here: ... 7,02 =E2=94=821481: movb $0x0,0x2c(%rax) =E2=94=82 for (icu =3D ipu->next; icu; icu =3D icu->next) 28,04 =E2=94=82 mov 0x18(%rax),%rax 0,00 =E2=94=82 test %rax,%rax 0,87 =E2=94=82 =E2=86=91 jne 1481 ... which is this loop here: ... for (icu =3D ipu->next; icu; icu =3D icu->next) icu->seen =3D false; ... It might be possible to improve here by keeping track of this data in anoth= er way, say an array of bool or a bitvector, which could be memset to zero. It would also be interesting to know how sparse this array is, which could also be exploited. --=20 You are receiving this mail because: You are on the CC list for the bug.