public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed
* Can dwarf_getscopes{,_die} performance be improved?
@ 2020-06-13 17:40 Milian Wolff
  2020-06-15 16:54 ` Josh Stone
  0 siblings, 1 reply; 5+ messages in thread
From: Milian Wolff @ 2020-06-13 17:40 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

Hey all!

In perfparser we are running into a performance issue with elfutils when we 
try to resolve inline frames. We are following the procedure outlined by eu-
addr2line, i.e. for every IP we basically do:

```
Dwarf_Addr bias = 0;
Dwarf_Die *cudie = dwfl_module_addrdie(module, ip, &bias);

Dwarf_Die *subroutine = nullptr;
Dwarf_Die *scopes = nullptr;
int nscopes = dwarf_getscopes(cudie, ip - bias, &scopes);
for (int i = 0; i < nscopes; ++i) {
    Dwarf_Die *scope = &scopes[i];
    const int tag = dwarf_tag(scope);
    if (tag == DW_TAG_subprogram || tag == DW_TAG_inlined_subroutine) {
        subroutine = scope;
        break;
    }
}

Dwarf_Die *scopes_die = nullptr;
int nscopes_die = dwarf_getscopes_die(subroutine, &scopes_die);
for (int i = 0; i < nscopes_die; ++i) {
    Dwarf_Die *scope = &scopes_die[i];
    const int tag = dwarf_tag(scope);
    if (tag == DW_TAG_subprogram || tag == DW_TAG_inlined_subroutine) {
        // do stuff
    }
}

free(scopes_die);
free(scopes);
```

Profiling shows that both, the calls to dwarf_getscopes and 
dwarf_getscopes_die can take a really long time. I have seen cases (in 
libclang.so.11) where a single call can take up to ~50ms on a fast desktop 
machine (Ryzen 3900X CPU, fast SSD, 32GB of RAM).

Now while 50ms may not sound sound too problematic, we have to repeat these 
calls for every IP we encounter in a perf.data file. We already apply heavy 
caching, but even then we can easily encounter tens of thousands of individual 
addresses, which can add up to minutes or even hours of time required to 
process the data - only to get information on inlined frames.

I have learned that the DWARF format is mostly meant for efficient storage and 
that one cannot assume that it is efficient for such mass post processing. 
This is the reason why I'm writing this email:

Has anyone an idea on how to to post-process the DWARF data to optimize the 
lookup of inlined frames?

Or is there some room for optimizations / caching within elfutils to amortize 
the repeated DWARF hierarchy walking that happens when one calls 
dwarf_getscopes{,_die}? From what I'm understanding, both calls will start a 
top-down search within the CU DIE via __libdw_visit_scopes. Once a match is 
found, the DIE scope chain is reported. I guess many times (parts of the) DIE 
scope chain will be shared across different IPs, but I don't see any way to 
leverage this to speed up the processing task.

Thanks, any input would be welcome
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Can dwarf_getscopes{,_die} performance be improved?
  2020-06-13 17:40 Can dwarf_getscopes{,_die} performance be improved? Milian Wolff
@ 2020-06-15 16:54 ` Josh Stone
  2020-06-19 23:45   ` Mark Wielaard
  2020-06-22  8:29   ` Milian Wolff
  0 siblings, 2 replies; 5+ messages in thread
From: Josh Stone @ 2020-06-15 16:54 UTC (permalink / raw)
  To: Milian Wolff, elfutils-devel

On 6/13/20 10:40 AM, Milian Wolff wrote:
> Has anyone an idea on how to to post-process the DWARF data to optimize the 
> lookup of inlined frames?

SystemTap implements its own cache for repeated lookups -- see
dwflpp::get_die_parents().


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Can dwarf_getscopes{,_die} performance be improved?
  2020-06-15 16:54 ` Josh Stone
@ 2020-06-19 23:45   ` Mark Wielaard
  2020-06-22  8:29   ` Milian Wolff
  1 sibling, 0 replies; 5+ messages in thread
From: Mark Wielaard @ 2020-06-19 23:45 UTC (permalink / raw)
  To: Josh Stone, Milian Wolff, elfutils-devel

On Mon, 2020-06-15 at 09:54 -0700, Josh Stone via Elfutils-devel wrote:
> On 6/13/20 10:40 AM, Milian Wolff wrote:
> > Has anyone an idea on how to to post-process the DWARF data to
> > optimize the 
> > lookup of inlined frames?
> 
> SystemTap implements its own cache for repeated lookups -- see
> dwflpp::get_die_parents().

Right, libabigail does something similar.

The issue is indeed that we don't have a parent DIE chain/cache. So
each call of dwarf_getscopes[_die] first goes down the CU die tree to
build up the parent chain on each call...

It would probably make sense to build a parent DIE chain cache for a
CU. The question is when/how we build the cache. There are also lots of
programs that won't need it, or only for some, but not all CUs. It will
take space and time to create.

The dwarf_getscopes[_die] calls might be a good trigger to create/keep
them. And/or have some explicit way to create them, maybe triggered by
some helper function to get at the parent of a DIE.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Can dwarf_getscopes{,_die} performance be improved?
  2020-06-15 16:54 ` Josh Stone
  2020-06-19 23:45   ` Mark Wielaard
@ 2020-06-22  8:29   ` Milian Wolff
  2020-06-25 22:38     ` Mark Wielaard
  1 sibling, 1 reply; 5+ messages in thread
From: Milian Wolff @ 2020-06-22  8:29 UTC (permalink / raw)
  To: elfutils-devel, Josh Stone

[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

On Montag, 15. Juni 2020 18:54:41 CEST Josh Stone wrote:
> On 6/13/20 10:40 AM, Milian Wolff wrote:
> > Has anyone an idea on how to to post-process the DWARF data to optimize
> > the
> > lookup of inlined frames?
> 
> SystemTap implements its own cache for repeated lookups -- see
> dwflpp::get_die_parents().

Thanks, I've come up with something similar over the weekend before reading 
your mail. The performance boost is huge (5x and more).

Looking at your code, I think that I'm not yet handling a few corner cases 
(such as imported units). That, paired with the fact that at least three users 
of this API have apparently by now come up with a similar solution clearly 
makes a case for upstreaming this into a common API.

As Mark wrote:

> It would probably make sense to build a parent DIE chain cache for a
> CU. The question is when/how we build the cache. There are also lots of
> programs that won't need it, or only for some, but not all CUs. It will
> take space and time to create.
> 
> The dwarf_getscopes[_die] calls might be a good trigger to create/keep
> them. And/or have some explicit way to create them, maybe triggered by
> some helper function to get at the parent of a DIE.

I believe that there is a lot of data that potentially needs to be cached. 
Additionally, doing it behind the scenes may raise questions regarding multi 
threaded usage of the API (see my other mail relating to that).

Which means: an explicit API to opt-in to this behavior is safest and best I 
believe. Maybe something as simple as the following would be sufficient?

```
/* Cache parent DIEs chain to speed up repeated dwarf_getscopes calls.

   Returns -1 for errors or 0 if the parent chain was cached already. */
extern int dwarf_cache_parent_dies(Dwarf_Die *cudie);
```

Alternatively a function that returns the cache could be considered, which 
would then require new versions of dwarf_getscopes* that take the cache as an 
argument.

Cheers
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Can dwarf_getscopes{,_die} performance be improved?
  2020-06-22  8:29   ` Milian Wolff
@ 2020-06-25 22:38     ` Mark Wielaard
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Wielaard @ 2020-06-25 22:38 UTC (permalink / raw)
  To: Milian Wolff, elfutils-devel, Josh Stone

Hi Milian,

On Mon, 2020-06-22 at 10:29 +0200, Milian Wolff wrote:
> On Montag, 15. Juni 2020 18:54:41 CEST Josh Stone wrote:
> > On 6/13/20 10:40 AM, Milian Wolff wrote:
> > > Has anyone an idea on how to to post-process the DWARF data to optimize
> > > the
> > > lookup of inlined frames?
> > 
> > SystemTap implements its own cache for repeated lookups -- see
> > dwflpp::get_die_parents().
> 
> Thanks, I've come up with something similar over the weekend before reading 
> your mail. The performance boost is huge (5x and more).
> 
> Looking at your code, I think that I'm not yet handling a few corner cases 
> (such as imported units). That, paired with the fact that at least three users 
> of this API have apparently by now come up with a similar solution clearly 
> makes a case for upstreaming this into a common API.

Yes, I think having an elfutils/libdw API for this would be very
useful. And it would also be useful for eu-addr2line and eu-stack when
looking up inlined functions.

The imported (partial) units are a little tricky because they cross CUs
(and sometimes even Dwarfs for example when dealing with dwz/multi-
files). It also means that a Die inside the partial unit can have
multiple parents, because they might have been imported through
different imports. But I guess that if we associate a parent cache with
one CU, then this is clean (unless the CU imports the same partial unit
multiple times...).

> I believe that there is a lot of data that potentially needs to be cached. 
> Additionally, doing it behind the scenes may raise questions regarding multi 
> threaded usage of the API (see my other mail relating to that).
> 
> Which means: an explicit API to opt-in to this behavior is safest and best I 
> believe. Maybe something as simple as the following would be sufficient?
> 
> ```
> /* Cache parent DIEs chain to speed up repeated dwarf_getscopes calls.
> 
>    Returns -1 for errors or 0 if the parent chain was cached already. */
> extern int dwarf_cache_parent_dies(Dwarf_Die *cudie);
> ```
> 
> Alternatively a function that returns the cache could be considered, which 
> would then require new versions of dwarf_getscopes* that take the cache as an 
> argument.

I think an API that makes the "caching" explicit might be best. Maybe
we can call it a DieTree? We would then have a function to create a
DieTree and (new) functions that take a DieTree (and a Dwarf_Die) to
operate on it. The user can then also destroy the DieTree again when
done.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-25 22:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-13 17:40 Can dwarf_getscopes{,_die} performance be improved? Milian Wolff
2020-06-15 16:54 ` Josh Stone
2020-06-19 23:45   ` Mark Wielaard
2020-06-22  8:29   ` Milian Wolff
2020-06-25 22:38     ` Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).