From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dwz-return-545-listarch-dwz=sourceware.org@sourceware.org>
Received: (qmail 92058 invoked by alias); 13 Dec 2019 15:55:37 -0000
Mailing-List: contact dwz-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <dwz.sourceware.org>
List-Post: <mailto:dwz@sourceware.org>
List-Help: <mailto:dwz-help@sourceware.org>
List-Subscribe: <mailto:dwz-subscribe@sourceware.org>
Sender: dwz-owner@sourceware.org
Received: (qmail 91993 invoked by uid 48); 13 Dec 2019 15:55:31 -0000
From: "vries at gcc dot gnu.org" <sourceware-bugzilla@sourceware.org>
To: dwz@sourceware.org
Subject: [Bug default/25276] New: create_import_tree performance improvements
Date: Tue, 01 Jan 2019 00:00:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: dwz
X-Bugzilla-Component: default
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: vries at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: nobody at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter cc target_milestone
Message-ID: <bug-25276-11298@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2019-q4/txt/msg00160.txt.bz2

https://sourceware.org/bugzilla/show_bug.cgi?id=3D25276

            Bug ID: 25276
           Summary: create_import_tree performance improvements
           Product: dwz
           Version: unspecified
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: default
          Assignee: nobody at sourceware dot org
          Reporter: vries at gcc dot gnu.org
                CC: dwz at sourceware dot org
  Target Milestone: ---

The test-case clang-10 as attached in PR25024, contains ~111 million DIEs.

This is larger than the default max die-limit of 50 million, but we can
overcome this using -Lnone.

Using -Lnone, as well as -lnone to avoid low-mem mode, we get an execution =
time
of:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone
maxmem: 14141844
real: 148.32
user: 131.79
system: 7.07
...

When compiling with -DDEVEL additionally, we get:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone
maxmem: 14069588
real: 148.35
user: 132.95
system: 7.31
...

When compiling with the odr patches applied, we get slightly more memory us=
ed:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr
maxmem: 14605608
real: 528.92
user: 142.62
system: 16.78
...
but 'real time' grows with a factor 3.5, presumably because of running into
swap space usage.  The user time only grows with 7% though.

Finally, enabling odr in it's basic mode sees an increase of user time by a
factor 3.3, in total 336 seconds extra:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=3Dbasic
maxmem: 14513848
real: 533.34
user: 478.39
system: 12.06
...

Using a patch adding --devel-progress, we can see where we spent the time.

Without odr:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr --devel-progress
read_debug_info .debug_info
user: 58.74
sys : 1.84
partition_dups
user: 1.17
sys : 0.01
partition_dups qsort
user: 1.35
sys : 0.00
partition_dups after qsort
user: 1.37
sys : 0.35
create_import_tree phase 1
user: 0.36
sys : 0.02
create_import_tree phase 2
user: 1.33
sys : 0.00
create_import_tree phase 3
user: 20.15
sys : 0.00
create_import_tree phase 4 (create partial units)
user: 0.09
sys : 0.00
read_debug_info .debug_types
user: 1.29
sys : 0.00
compute_abbrevs
user: 36.34
sys : 0.24
write_abbrev
user: 0.09
sys : 0.00
write_info
user: 11.87
sys : 0.72
write_loc
user: 0.74
sys : 0.68
write_types
user: 0.00
sys : 0.00
write_gdb_index
user: 0.25
sys : 0.32
write_dso
user: 0.19
sys : 2.00
maxmem: 15172216
real: 174.49
user: 135.33
system: 6.41
...

And with odr:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=3Dbasic
--devel-progress=20=20=20
read_debug_info .debug_info
user: 43.07
sys : 1.60
partition_dups
user: 1.88
sys : 0.01
partition_dups split_dups
user: 93.43
sys : 0.04
partition_dups qsort
user: 4.10
sys : 0.00
partition_dups after qsort
user: 3.71
sys : 0.19
create_import_tree phase 1
user: 0.89
sys : 0.01
create_import_tree phase 2
user: 48.29
sys : 0.00
create_import_tree phase 3
user: 222.30
sys : 0.00
create_import_tree phase 4 (create partial units)
user: 0.32
sys : 0.03
read_debug_info .debug_types
user: 1.35
sys : 0.00
compute_abbrevs
user: 30.36
sys : 0.09
write_abbrev
user: 0.08
sys : 0.00
write_info
user: 5.70
sys : 0.11
write_loc
user: 0.80
sys : 0.65
write_types
user: 0.00
sys : 0.00
write_gdb_index
user: 0.14
sys : 0.17
write_dso
user: 0.00
sys : 1.27
maxmem: 15491032
real: 465.45
user: 456.42
system: 4.29
...

Focussing on the larger bits, without odr:
...
read_debug_info .debug_info
user: 58.74
create_import_tree phase 3
user: 20.15
compute_abbrevs
user: 36.34
write_info
user: 11.87
...
and with odr:
...
read_debug_info .debug_info
user: 43.07
sys : 1.60
partition_dups split_dups
user: 93.43
sys : 0.04
create_import_tree phase 2
user: 48.29
sys : 0.00
create_import_tree phase 3
user: 222.30
sys : 0.00
compute_abbrevs
user: 30.36
sys : 0.09
write_info
user: 5.70
sys : 0.11
...

It's clear something is slow in odr: the split_dups bit takes 93s.

But create_import_tree phase 3 increases from 20s to 222s.

Looking at the perf data, we spend 51% of the time in dwz, and of that 51%,=
 we
spent 43% here, in the subset/superset edge comparison loop:
...
Percent=E2=94=82                    e4 =3D ipu2->incoming;
  0,01 =E2=94=82        mov    %r14,%rax
       =E2=94=82                    while ((maybe_subset || maybe_superset)=
 && e3 &&
e4)
  0,00 =E2=94=821395:   mov    %esi,%edi
  0,01 =E2=94=82        or     %cl,%dil
  0,01 =E2=94=82      =E2=86=93 je     13d8
  0,02 =E2=94=82139c:   test   %rdx,%rdx
  0,00 =E2=94=82      =E2=86=93 je     1730
  0,07 =E2=94=82        test   %rax,%rax
  0,00 =E2=94=82      =E2=86=93 je     1730
       =E2=94=82                        if (e3->icu =3D=3D e4->icu)
  0,53 =E2=94=82        mov    (%rdx),%r8
 35,28 =E2=94=82        mov    (%rax),%rdi
  0,62 =E2=94=82        cmp    %rdi,%r8
  0,01 =E2=94=82      =E2=86=93 je     1718
       =E2=94=82                        if (e3->icu->idx < e4->icu->idx)
  7,37 =E2=94=82        mov    0x28(%rdi),%edi
  0,55 =E2=94=82        cmp    %edi,0x28(%r8)
  0,04 =E2=94=82      =E2=86=93 jb     1708
       =E2=94=82                        maybe_superset =3D false;
...

This is a draft of a short-cut that might prevent going into the hot loop:
...
@@ -7683,6 +7684,26 @@ create_import_tree (void)
              maybe_subset =3D (e1 =3D=3D ipu->incoming
                              && ipu->incoming_count <=3D ipu2->incoming_co=
unt);
              maybe_superset =3D ipu->incoming_count >=3D ipu2->incoming_co=
unt;
+             if (maybe_superset)
+               {
+                 /* If the referrer nodes of ipu are a superset of the
+                    referrer nodes of ipu2, then ipu's last referrer node
+                    should have index larger or equal to the last referrer
+                    node of ipu2.  */
+                 maybe_superset
+                   =3D (ipu->incoming[ipu->incoming_count - 1].icu->idx
+                      >=3D ipu2->incoming[ipu2->incoming_count - 1].icu->i=
dx);
+               }
+             if (maybe_subset)
+               {
+                 /* If the referrer nodes of ipu are a subset of the
+                    referrer nodes of ipu2, then ipu's last referrer node
+                    should have index smaller or equal to the last referrer
+                    node of ipu2.  */
+                 maybe_subset
+                   =3D (ipu->incoming[ipu->incoming_count - 1].icu->idx
+                      <=3D ipu2->incoming[ipu2->incoming_count -1].icu->id=
x);
+               }
              e3 =3D e1;
              e4 =3D ipu2->incoming;
              intersection =3D 0;
...
but it doesn't work currently because we cannot access
ipu->incoming[ipu->incoming_count - 1] at this point. Instead, we need to w=
alk
the next chain.  Keeping track of the tail of the edge chain would fix this,
and might yield speed improvement.

Furthermore, we spent 28% here:
...
  7,02 =E2=94=821481:   movb   $0x0,0x2c(%rax)
       =E2=94=82            for (icu =3D ipu->next; icu; icu =3D icu->next)
 28,04 =E2=94=82        mov    0x18(%rax),%rax
  0,00 =E2=94=82        test   %rax,%rax
  0,87 =E2=94=82      =E2=86=91 jne    1481
...
which is this loop here:
...
      for (icu =3D ipu->next; icu; icu =3D icu->next)
        icu->seen =3D false;
...

It might be possible to improve here by keeping track of this data in anoth=
er
way, say an array of bool or a bitvector, which could be memset to zero. It
would also be interesting to know how sparse this array is, which could also
be exploited.

--=20
You are receiving this mail because:
You are on the CC list for the bug.