public inbox for dwz@sourceware.org
 help / color / mirror / Atom feed
* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2019-01-01  0:00 ` vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #0)
> This is a draft of a short-cut that might prevent going into the hot loop:
> ...
> @@ -7683,6 +7684,26 @@ create_import_tree (void)
>               maybe_subset = (e1 == ipu->incoming
>                               && ipu->incoming_count <=
> ipu2->incoming_count);
>               maybe_superset = ipu->incoming_count >= ipu2->incoming_count;
> +             if (maybe_superset)
> +               {
> +                 /* If the referrer nodes of ipu are a superset of the
> +                    referrer nodes of ipu2, then ipu's last referrer node
> +                    should have index larger or equal to the last referrer
> +                    node of ipu2.  */
> +                 maybe_superset
> +                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
> +                      >= ipu2->incoming[ipu2->incoming_count - 1].icu->idx);
> +               }
> +             if (maybe_subset)
> +               {
> +                 /* If the referrer nodes of ipu are a subset of the
> +                    referrer nodes of ipu2, then ipu's last referrer node
> +                    should have index smaller or equal to the last referrer
> +                    node of ipu2.  */
> +                 maybe_subset
> +                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
> +                      <= ipu2->incoming[ipu2->incoming_count -1].icu->idx);
> +               }
>               e3 = e1;
>               e4 = ipu2->incoming;
>               intersection = 0;
> ...
> but it doesn't work currently because we cannot access
> ipu->incoming[ipu->incoming_count - 1] at this point. Instead, we need to
> walk
> the next chain.  Keeping track of the tail of the edge chain would fix this,
> and might yield speed improvement.
> 

Committed here:
https://sourceware.org/git/?p=dwz.git;a=commit;h=83addcc7c2eadeffb4fe27e433695d316101f6d1

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
  2019-01-01  0:00 ` [Bug default/25276] " vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

--- Comment #4 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #0)
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 
> It might be possible to improve here by keeping track of this data in another
> way, say an array of bool or a bitvector, which could be memset to zero. It
> would also be interesting to know how sparse this array is, which could also
> be exploited.

I've measured on the clang-10 example with --odr --odr-mode=basic, and got an
average occupancy of 1.93%.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2019-01-01  0:00 ` vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

--- Comment #5 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #0)
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 
> It might be possible to improve here by keeping track of this data in another
> way, say an array of bool or a bitvector, which could be memset to zero.

The array of bool + memset approach reduces create_import_tree phase 3 time
from 104.19 to 23.00.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

--- Comment #3 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #0)
> Looking at the perf data, we spend 51% of the time in dwz, 

> Furthermore, we spent 28% here:
> ...
>   7,02 │1481:   movb   $0x0,0x2c(%rax)
>        │            for (icu = ipu->next; icu; icu = icu->next)
>  28,04 │        mov    0x18(%rax),%rax
>   0,00 │        test   %rax,%rax
>   0,87 │      ↑ jne    1481
> ...
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 

With current master, we now spent 41% of execution time in dwz, and of that 41%
we spent 59% in this loop:
...
  1,39 │2895:┌─→test   %rax,%rax
  0,01 │     │↑ je     1fcd
       │     │        icu->seen = false;
 11,10 │     │  movb   $0x0,0x34(%rax)
       │     │      for (icu = ipu->next; icu; icu = icu->next)
 47,87 │     │  mov    0x20(%rax),%rax
  0,36 │     └──jmp    2895
...

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2019-01-01  0:00 ` vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

Tom de Vries <vries at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #8 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #0)
> which is this loop here:
> ...
>       for (icu = ipu->next; icu; icu = icu->next)
>         icu->seen = false;
> ...
> 
> It might be possible to improve here by keeping track of this data in another
> way, say an array of bool or a bitvector, which could be memset to zero. 

Committed as
https://sourceware.org/git/?p=dwz.git;a=commit;h=94e78911f39d3c256331d396def9dcedc80b9f65
.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] New: create_import_tree performance improvements
@ 2019-01-01  0:00 vries at gcc dot gnu.org
  2019-01-01  0:00 ` [Bug default/25276] " vries at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

            Bug ID: 25276
           Summary: create_import_tree performance improvements
           Product: dwz
           Version: unspecified
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: default
          Assignee: nobody at sourceware dot org
          Reporter: vries at gcc dot gnu.org
                CC: dwz at sourceware dot org
  Target Milestone: ---

The test-case clang-10 as attached in PR25024, contains ~111 million DIEs.

This is larger than the default max die-limit of 50 million, but we can
overcome this using -Lnone.

Using -Lnone, as well as -lnone to avoid low-mem mode, we get an execution time
of:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone
maxmem: 14141844
real: 148.32
user: 131.79
system: 7.07
...

When compiling with -DDEVEL additionally, we get:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone
maxmem: 14069588
real: 148.35
user: 132.95
system: 7.31
...

When compiling with the odr patches applied, we get slightly more memory used:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr
maxmem: 14605608
real: 528.92
user: 142.62
system: 16.78
...
but 'real time' grows with a factor 3.5, presumably because of running into
swap space usage.  The user time only grows with 7% though.

Finally, enabling odr in it's basic mode sees an increase of user time by a
factor 3.3, in total 336 seconds extra:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=basic
maxmem: 14513848
real: 533.34
user: 478.39
system: 12.06
...

Using a patch adding --devel-progress, we can see where we spent the time.

Without odr:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --no-odr --devel-progress
read_debug_info .debug_info
user: 58.74
sys : 1.84
partition_dups
user: 1.17
sys : 0.01
partition_dups qsort
user: 1.35
sys : 0.00
partition_dups after qsort
user: 1.37
sys : 0.35
create_import_tree phase 1
user: 0.36
sys : 0.02
create_import_tree phase 2
user: 1.33
sys : 0.00
create_import_tree phase 3
user: 20.15
sys : 0.00
create_import_tree phase 4 (create partial units)
user: 0.09
sys : 0.00
read_debug_info .debug_types
user: 1.29
sys : 0.00
compute_abbrevs
user: 36.34
sys : 0.24
write_abbrev
user: 0.09
sys : 0.00
write_info
user: 11.87
sys : 0.72
write_loc
user: 0.74
sys : 0.68
write_types
user: 0.00
sys : 0.00
write_gdb_index
user: 0.25
sys : 0.32
write_dso
user: 0.19
sys : 2.00
maxmem: 15172216
real: 174.49
user: 135.33
system: 6.41
...

And with odr:
...
$ time.sh dwz clang-10.debug -o 1 -lnone -Lnone --odr --odr-mode=basic
--devel-progress   
read_debug_info .debug_info
user: 43.07
sys : 1.60
partition_dups
user: 1.88
sys : 0.01
partition_dups split_dups
user: 93.43
sys : 0.04
partition_dups qsort
user: 4.10
sys : 0.00
partition_dups after qsort
user: 3.71
sys : 0.19
create_import_tree phase 1
user: 0.89
sys : 0.01
create_import_tree phase 2
user: 48.29
sys : 0.00
create_import_tree phase 3
user: 222.30
sys : 0.00
create_import_tree phase 4 (create partial units)
user: 0.32
sys : 0.03
read_debug_info .debug_types
user: 1.35
sys : 0.00
compute_abbrevs
user: 30.36
sys : 0.09
write_abbrev
user: 0.08
sys : 0.00
write_info
user: 5.70
sys : 0.11
write_loc
user: 0.80
sys : 0.65
write_types
user: 0.00
sys : 0.00
write_gdb_index
user: 0.14
sys : 0.17
write_dso
user: 0.00
sys : 1.27
maxmem: 15491032
real: 465.45
user: 456.42
system: 4.29
...

Focussing on the larger bits, without odr:
...
read_debug_info .debug_info
user: 58.74
create_import_tree phase 3
user: 20.15
compute_abbrevs
user: 36.34
write_info
user: 11.87
...
and with odr:
...
read_debug_info .debug_info
user: 43.07
sys : 1.60
partition_dups split_dups
user: 93.43
sys : 0.04
create_import_tree phase 2
user: 48.29
sys : 0.00
create_import_tree phase 3
user: 222.30
sys : 0.00
compute_abbrevs
user: 30.36
sys : 0.09
write_info
user: 5.70
sys : 0.11
...

It's clear something is slow in odr: the split_dups bit takes 93s.

But create_import_tree phase 3 increases from 20s to 222s.

Looking at the perf data, we spend 51% of the time in dwz, and of that 51%, we
spent 43% here, in the subset/superset edge comparison loop:
...
Percent│                    e4 = ipu2->incoming;
  0,01 │        mov    %r14,%rax
       │                    while ((maybe_subset || maybe_superset) && e3 &&
e4)
  0,00 │1395:   mov    %esi,%edi
  0,01 │        or     %cl,%dil
  0,01 │      ↓ je     13d8
  0,02 │139c:   test   %rdx,%rdx
  0,00 │      ↓ je     1730
  0,07 │        test   %rax,%rax
  0,00 │      ↓ je     1730
       │                        if (e3->icu == e4->icu)
  0,53 │        mov    (%rdx),%r8
 35,28 │        mov    (%rax),%rdi
  0,62 │        cmp    %rdi,%r8
  0,01 │      ↓ je     1718
       │                        if (e3->icu->idx < e4->icu->idx)
  7,37 │        mov    0x28(%rdi),%edi
  0,55 │        cmp    %edi,0x28(%r8)
  0,04 │      ↓ jb     1708
       │                        maybe_superset = false;
...

This is a draft of a short-cut that might prevent going into the hot loop:
...
@@ -7683,6 +7684,26 @@ create_import_tree (void)
              maybe_subset = (e1 == ipu->incoming
                              && ipu->incoming_count <= ipu2->incoming_count);
              maybe_superset = ipu->incoming_count >= ipu2->incoming_count;
+             if (maybe_superset)
+               {
+                 /* If the referrer nodes of ipu are a superset of the
+                    referrer nodes of ipu2, then ipu's last referrer node
+                    should have index larger or equal to the last referrer
+                    node of ipu2.  */
+                 maybe_superset
+                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
+                      >= ipu2->incoming[ipu2->incoming_count - 1].icu->idx);
+               }
+             if (maybe_subset)
+               {
+                 /* If the referrer nodes of ipu are a subset of the
+                    referrer nodes of ipu2, then ipu's last referrer node
+                    should have index smaller or equal to the last referrer
+                    node of ipu2.  */
+                 maybe_subset
+                   = (ipu->incoming[ipu->incoming_count - 1].icu->idx
+                      <= ipu2->incoming[ipu2->incoming_count -1].icu->idx);
+               }
              e3 = e1;
              e4 = ipu2->incoming;
              intersection = 0;
...
but it doesn't work currently because we cannot access
ipu->incoming[ipu->incoming_count - 1] at this point. Instead, we need to walk
the next chain.  Keeping track of the tail of the edge chain would fix this,
and might yield speed improvement.

Furthermore, we spent 28% here:
...
  7,02 │1481:   movb   $0x0,0x2c(%rax)
       │            for (icu = ipu->next; icu; icu = icu->next)
 28,04 │        mov    0x18(%rax),%rax
  0,00 │        test   %rax,%rax
  0,87 │      ↑ jne    1481
...
which is this loop here:
...
      for (icu = ipu->next; icu; icu = icu->next)
        icu->seen = false;
...

It might be possible to improve here by keeping track of this data in another
way, say an array of bool or a bitvector, which could be memset to zero. It
would also be interesting to know how sparse this array is, which could also
be exploited.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2019-01-01  0:00 ` vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

--- Comment #2 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #0)
> But create_import_tree phase 3 increases from 20s to 222s.
> 

With current master, we get with --no-odr:
...
create_import_tree phase 3
user: 5.42
sys : 0.00
...
and with --odr --odr-mode=basic:
...
create_import_tree phase 3
user: 104.19
sys : 0.00
...

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2019-01-01  0:00 ` vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

--- Comment #7 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #6)
> (In reply to Tom de Vries from comment #5)
> > (In reply to Tom de Vries from comment #0)
> > > which is this loop here:
> > > ...
> > >       for (icu = ipu->next; icu; icu = icu->next)
> > >         icu->seen = false;
> > > ...
> > > 
> > > It might be possible to improve here by keeping track of this data in another
> > > way, say an array of bool or a bitvector, which could be memset to zero.
> > 
> > The array of bool + memset approach reduces create_import_tree phase 3 time
> > from 104.19 to 23.00.
> 
> Keeping track of min_seen and max_seen, and only memsetting the affected
> region gets us 21.77.

And the replacing the bool array with a bitvector with base type unsigned int,
gets us 11.66.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug default/25276] create_import_tree performance improvements
  2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
  2019-01-01  0:00 ` [Bug default/25276] " vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
@ 2019-01-01  0:00 ` vries at gcc dot gnu.org
  2019-01-01  0:00 ` vries at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2019-01-01  0:00 UTC (permalink / raw)
  To: dwz

https://sourceware.org/bugzilla/show_bug.cgi?id=25276

--- Comment #6 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #5)
> (In reply to Tom de Vries from comment #0)
> > which is this loop here:
> > ...
> >       for (icu = ipu->next; icu; icu = icu->next)
> >         icu->seen = false;
> > ...
> > 
> > It might be possible to improve here by keeping track of this data in another
> > way, say an array of bool or a bitvector, which could be memset to zero.
> 
> The array of bool + memset approach reduces create_import_tree phase 3 time
> from 104.19 to 23.00.

Keeping track of min_seen and max_seen, and only memsetting the affected region
gets us 21.77.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-12-17 23:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-01  0:00 [Bug default/25276] New: create_import_tree performance improvements vries at gcc dot gnu.org
2019-01-01  0:00 ` [Bug default/25276] " vries at gcc dot gnu.org
2019-01-01  0:00 ` vries at gcc dot gnu.org
2019-01-01  0:00 ` vries at gcc dot gnu.org
2019-01-01  0:00 ` vries at gcc dot gnu.org
2019-01-01  0:00 ` vries at gcc dot gnu.org
2019-01-01  0:00 ` vries at gcc dot gnu.org
2019-01-01  0:00 ` vries at gcc dot gnu.org
2019-01-01  0:00 ` vries at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).