Performance issue with systemd-coredump and container process linking 2000 shared libraries.

public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed

* Performance issue with systemd-coredump and container process linking 2000 shared libraries.
@ 2023-06-14 13:39 Romain Geissler
  2023-06-19 15:08 ` Mark Wielaard
  0 siblings, 1 reply; 19+ messages in thread
From: Romain Geissler @ 2023-06-14 13:39 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Romain Geissler

Hi,

I am reaching out the elfutils mailing to discuss one issue we have seen in one
of our OpenShift production workload. Note that if you are a Red Hat employee,
this actually corresponds to Red Hat support case 03536980 which I have yet to
update with this informations.

Description of the issue:
While moving from OpenShift 4.12 to 4.13, we also move the system from a RHEL 8
base to a RHEL 9 one. Among the changes, it upgrades systemd to version 252.
Historically systemd-coredump use to mind only about buildids when analysing
core dumps, however now it also tries to parse the package metadata note section
(cf https://systemd.io/ELF_PACKAGE_METADATA/).

We use sytemd-coredump as our default core handler, and being an OpenShift node
we run there mainly containers. As part of our container, we have a particular
binary which links against close to 2000 shared libraries. And when this program
crashes, it takes several minutes for systemd-coredump to "analyze" the core,
which is spent in parsing the elf part of it.

I have been able to extract from the systemd code (cf
https://github.com/systemd/systemd/blob/22afd5cb2cf06418e5a84907b1ff782d29a33e47/src/shared/elf-util.c#L482 )
the logic of systemd into a simplified C program that links against systemd to
show how "slow" is systemd. I will attach this program a bit after, and also
explain a bit my findings and what "fix" I suggested internally we propose to
systemd folks.

First to reproduce, we can create a dummy binary that links against 2000 lib.
Credits for finding and writing this simple reproducer goes to Francois Rigault,
one of my colleague:

cat > lib.c << EOF
int TOTO = 42;
EOF

cat > generate-core.c << EOF
#include <stdio.h>
int main() {
        printf("started\n");
        int* i = 0;
        *i = 42;
}
EOF

N_LIBS=2000
LIBS=""
for i in {1..${N_LIBS}}; do
        [[ -f liba$i.so ]] || gcc -shared -fPIC -o liba$i.so lib.c -D TOTO=var$i
done
echo libs built

gcc -o generate-core generate-core.c -L. -Wl,-rpath,$(pwd) -la{1..${N_LIBS}}
echo running binary
time ./generate-core

# Here a core is generated, rename it to "the-core" in the current directory.

# Finally, remove all libaN.so files, to simulate the fact that systemd-coredump
# runs in a different mount filesystem as the container process which crashed
# (otherwise the following reproducer might be "fast"):
rm liba*.so



Once a core dump has been generated, I tried to mimic the code of systemd-coredump.
It looks like this:

cat > mimic-systemd-coredump.c << EOF
#include "elfutils/libdwfl.h"
#include <fcntl.h>
#include <assert.h>

static const Dwfl_Callbacks callbacks = {
    .find_elf = dwfl_build_id_find_elf,
    .section_address = dwfl_offline_section_address,
    .find_debuginfo = dwfl_standard_find_debuginfo,
};

void print_elf_error()
{
    printf("Elf error: %s\n", elf_errmsg(elf_errno()));
}

void print_dwfl_error()
{
    printf("Dwfl error: %s\n", dwfl_errmsg(dwfl_errno()));
}

// In Systemd, this tries to retrieve the Json package metadata note.
// returns a value > 0 when data is found
// returns 0 when no data is found, but no error happened
// returns a value < 0 when a strong error happen and we have to stop.
int parse_module_package_metadata_best_effort(Elf* elf, const char* name)
{
    size_t n_program_headers;
    size_t r = elf_getphdrnum(elf, &n_program_headers);

    if (r == 0)
    {
        printf("Found %lli program header for module %s\n", n_program_headers, name);
        // Don't implement the full systemd logic here.

        return 1;
    }
    else
    {
        // Best effort parsing: silently ignore the error.
        return 0;
    }
}

static int module_callback(Dwfl_Module *mod __attribute__ ((unused)),
        void **userdata __attribute__ ((unused)),
        const char *name __attribute__ ((unused)),
        Dwarf_Addr start, void *arg)
{
    Elf* elf = NULL;

    GElf_Addr bias;
    // Comment the next line to "improve" performances.
    elf = dwfl_module_getelf(mod, &bias);

    if (elf)
    {
        printf("dwfl_module_getelf has found the elf program %s\n", name);
        int ret = parse_module_package_metadata_best_effort(elf, name);

        if (ret > 0)
            return DWARF_CB_OK;
        else if (ret < 0)
            return DWARF_CB_ABORT;
        else
            elf = arg;
    }
    else
    {
        printf("dwfl_module_getelf returned NULL for module %s\n", name);
        //print_dwfl_error();
        elf = arg;
    }

    size_t n_program_headers;
    assert(elf_getphdrnum(elf, &n_program_headers) >= 0);
    printf("iterate: %s: %lli\n", name, n_program_headers);

    for (size_t i = 0; i < n_program_headers; ++i) {
        GElf_Phdr mem, *program_header;
        Elf_Data *data;

        /* The core file stores the ELF files in the PT_LOAD segment. */
        program_header = gelf_getphdr(elf, i, &mem);
        if (!program_header || program_header->p_type != PT_LOAD)
            continue;

        /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */
        if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz)
        {
            //printf("skip !!!\n");
            continue;
        }

        /* Now get a usable Elf reference, and parse the notes from it. */
        data = elf_getdata_rawchunk(elf,
                program_header->p_offset,
                program_header->p_filesz,
                ELF_T_NHDR);
        if (!data)
            continue;

        Elf *memelf = elf_memory(data->d_buf, data->d_size);
        if (!memelf)
        {
            elf_end(memelf);
            continue;
        }

        int ret = parse_module_package_metadata_best_effort(memelf, name);
        elf_end(memelf);

        if (ret > 0)
            return DWARF_CB_OK;
        else if (ret < 0)
            return DWARF_CB_ABORT;
    }

    return DWARF_CB_OK;
}

int main()
{
    elf_version(EV_CURRENT);

    int fd = open ("the-core", O_RDONLY);
    assert(fd != 0);

    Elf *elf = elf_begin(fd, ELF_C_READ_MMAP, NULL);
    assert(elf);

    int ret;
    GElf_Ehdr elf_header;
    assert(gelf_getehdr(elf, &elf_header) != NULL);
    printf("elf_header.e_type = %i\n", elf_header.e_type);
    assert(elf_header.e_type == ET_CORE);

    Dwfl *dwfl = dwfl_begin(&callbacks);

    ret = dwfl_core_file_report(dwfl, elf, NULL);
    printf("dwfl_core_file_report: %i\n", ret);
    assert(ret >= 0);

    ret = dwfl_report_end(dwfl, NULL, NULL);
    assert(ret == 0);

    ptrdiff_t offset = dwfl_getmodules (dwfl, &module_callback, elf, 0);
    if (offset < 0)
        assert(false);

    dwfl_end (dwfl);

    return 0;
}
EOF


And it can be compiled with:
gcc -O2 -o mimic-systemd-coredump  mimic-systemd-coredump.c -lelf -ldw


When running ./mimic-systemd-coredump we can see that the call to dwfl_module_getelf
always return NULL for libaN.so files, so we fallback on the systemd logic to try
to parse the package metadata section from the core directly. It takes minutes
to run the full program. If however I comment the call to dwfl_module_getelf, then
it runs in less than a second.

My understanding of the will of systemd developers is that they hoped that libdwfl would
return some "partial" Elf* reference when calling dwfl_module_getelf, from the elf
headers found in the core for each and every shared library (the first page of the
PT_LOAD mappings that the kernel always dumps even when the mapping is file backed).
However it seems that behind the hood it doesn't (is it linked to core_file_read_eagerly
which seems to always return false in this case ?), and instead it uses the
.find_elf = dwfl_build_id_find_elf callback which tries to find the file by buildid
on the filesystem. For some unknown reason to me, calling dwfl_module_getelf is very
slow (I wouldn't expect that looking on the filesytem by buildid is that slow actually).

So, is this behavior of dwfl_module_getelf expected ? If yes, do you agree that we shall
advise systemd-coredump developer to invert their logic, to first try to look for partial
elf header from the core's PT_LOAD section, then only fallback to the more reliable
dwfl_module_getelf if it didn't work ? In practice, we have tried the following patch
applied to systemd v253 and it seems ot "fix" the above mentionned case:


commit 9463e1f22cbaaf68de9b547da243e70f1b2f622a
Author: Romain Geissler <romain.geissler@amadeus.com>
Date:   Wed Jun 14 11:59:31 2023 +0000

    First try to fix core dump issue on v253.

diff --git a/src/shared/elf-util.c b/src/shared/elf-util.c
index 8da16f528f..fadb3e81a3 100644
--- a/src/shared/elf-util.c
+++ b/src/shared/elf-util.c
@@ -503,26 +503,13 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name,
         if (r < 0)
                 return DWARF_CB_ABORT;
 
-        /* The .note.package metadata is more difficult. From the module, we need to get a reference
-         * to the ELF object first. We might be lucky and just get it from elfutils. */
-        elf = sym_dwfl_module_getelf(mod, &bias);
-        if (elf) {
-                r = parse_package_metadata(name, id_json, elf, NULL, c);
-                if (r < 0)
-                        return DWARF_CB_ABORT;
-                if (r > 0)
-                        return DWARF_CB_OK;
-        } else
-                elf = c->elf;
-
-        /* We did not get the ELF object, or it's just a reference to the core. That is likely
-         * because we didn't get direct access to the executable, and the version of elfutils does
-         * not yet support parsing it out of the core file directly.
-         * So fallback to manual extraction - get the PT_LOAD section from the core,
+        /* First try with the (maybe partial) ELF object we can get from the core itself. Indeed
+         * the core might run in a different namespace in which case the dwfl_module_getelf
+         * function to try to open the file by build id in the current mount namespace.
+         * So first use manual extraction - get the PT_LOAD section from the core,
          * and if it's the right one we can interpret it as an Elf object, and parse
          * its notes manually. */
-
-        r = sym_elf_getphdrnum(elf, &n_program_headers);
+        r = sym_elf_getphdrnum(c->elf, &n_program_headers);
         if (r < 0) {
                 log_warning("Could not parse number of program headers from core file: %s",
                             sym_elf_errmsg(-1)); /* -1 retrieves the most recent error */
@@ -536,12 +523,16 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name,
                 Elf_Data *data;
 
                 /* The core file stores the ELF files in the PT_LOAD segment. */
-                program_header = sym_gelf_getphdr(elf, i, &mem);
+                program_header = sym_gelf_getphdr(c->elf, i, &mem);
                 if (!program_header || program_header->p_type != PT_LOAD)
                         continue;
 
+                /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */
+                if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz)
+                        continue;
+
                 /* Now get a usable Elf reference, and parse the notes from it. */
-                data = sym_elf_getdata_rawchunk(elf,
+                data = sym_elf_getdata_rawchunk(c->elf,
                                                 program_header->p_offset,
                                                 program_header->p_filesz,
                                                 ELF_T_NHDR);
@@ -555,7 +546,18 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name,
                 if (r < 0)
                         return DWARF_CB_ABORT;
                 if (r > 0)
-                        break;
+                        return DWARF_CB_OK;
+        }
+
+        /* Retrieving package_metadata only from the core fiel failed. Try to to get a reference
+         * to the ELF object instead. We might be lucky and just get it from elfutils or the filesystem. */
+        elf = sym_dwfl_module_getelf(mod, &bias);
+        if (elf) {
+                r = parse_package_metadata(name, id_json, elf, NULL, c);
+                if (r < 0)
+                        return DWARF_CB_ABORT;
+                if (r > 0)
+                        return DWARF_CB_OK;
         }
 
         return DWARF_CB_OK;


Some other side question: on the long run, wouldn't it make sense that elfutils tries to parse the
json package metadata section by itself, just like it does for the buildid, rather than implementing
this logic in systemd ?

Thanks for reading ;)

Cheers,
Romain

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-14 13:39 Performance issue with systemd-coredump and container process linking 2000 shared libraries Romain Geissler
@ 2023-06-19 15:08 ` Mark Wielaard
  2023-06-19 19:56   ` Romain GEISSLER
  2023-06-20 21:37   ` Mark Wielaard
  0 siblings, 2 replies; 19+ messages in thread
From: Mark Wielaard @ 2023-06-19 15:08 UTC (permalink / raw)
  To: Romain Geissler, elfutils-devel

Hi Romain,

Just to let you know I am looking at this. But haven't made much
progress in understanding it yet. Thanks so much for the reproducer. I
have been able to see the (very slow) parsing of the core file with it.

$ time ./mimic-systemd-coredump
[...]
real	3m35.965s
user	0m0.722s
sys	0m0.345s

Note however that a lot of time is "missing".
And in fact running it again is fast!?!

$ time ./mimic-systemd-coredump
real	0m0.327s
user	0m0.272s
sys	0m0.050s

This is because of the kernel inode/dentry cache.
If I do $ echo 2 | sudo tee /proc/sys/vm/drop_caches
before running ./mimic-systemd-coredump it is always slow.

I'll try to figure out what we do to make it so hard for the kernel to
do these lookups.

But that doesn't invalidate the other observation you made, that the
dwfl_module_get_elf call always returns NULL.

> My understanding of the will of systemd developers is that they hoped that libdwfl would
> return some "partial" Elf* reference when calling dwfl_module_getelf, from the elf
> headers found in the core for each and every shared library (the first page of the
> PT_LOAD mappings that the kernel always dumps even when the mapping is file backed).

Right, that is a reasonable hope. And I don't actually know why it
always fails in this case.

> However it seems that behind the hood it doesn't (is it linked to core_file_read_eagerly
> which seems to always return false in this case ?), and instead it uses the
> .find_elf = dwfl_build_id_find_elf callback which tries to find the file by buildid
> on the filesystem. For some unknown reason to me, calling dwfl_module_getelf is very
> slow (I wouldn't expect that looking on the filesytem by buildid is that slow actually).

Apparently we do it in some really slow way if the inodes/dentries
aren't in kernel cache (and the files are not actually on disk).

Which does bring up the question why systemd-coredump isn't running in
the same mount space as the crashing program. Then it would simply find
the files that the crashing program is using. Or it might install a
.find_elf callback that (also) looks under /proc/pid/root/ ?

> So, is this behavior of dwfl_module_getelf expected ? If yes, do you agree that we shall
> advise systemd-coredump developer to invert their logic, to first try to look for partial
> elf header from the core's PT_LOAD section, then only fallback to the more reliable
> dwfl_module_getelf if it didn't work ? In practice, we have tried the following patch
> applied to systemd v253 and it seems ot "fix" the above mentionned case:

I don't think dwfl_module_getelf should always return NULL in this
case. Nor should it be this slow. But given that it does and given that
it is slow that is certainly reasonable advise.

> Some other side question: on the long run, wouldn't it make sense that elfutils tries to parse the
> json package metadata section by itself, just like it does for the buildid, rather than implementing
> this logic in systemd ?

Maybe we could provide this functionality. You are right that we have
no problem getting the build-ids with $ eu-unstrip --core=./the-core -n
So providing some other "static data" might be possible with a simpler
interface.

Thanks for this extensive bug report and reproducer. I play some more
with it to hopefully get you some real answers/fixes.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-19 15:08 ` Mark Wielaard
@ 2023-06-19 19:56   ` Romain GEISSLER
  2023-06-19 20:54     ` Luca Boccassi
  2023-06-20 13:15     ` Mark Wielaard
  2023-06-20 21:37   ` Mark Wielaard
  1 sibling, 2 replies; 19+ messages in thread
From: Romain GEISSLER @ 2023-06-19 19:56 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel

> Le 19 juin 2023 à 17:08, Mark Wielaard <mark@klomp.org> a écrit :
> 
> Hi Romain,
> 
> Just to let you know I am looking at this. But haven't made much
> progress in understanding it yet. Thanks so much for the reproducer. I
> have been able to see the (very slow) parsing of the core file with it.

Hi,

Thanks ! And sorry that Laurent had pinged you directly on Slack, I
wanted to reach you via this mailing list instead of through the Red
Hat customer network ;)

I don’t know if you read the Red Hat case too. There you can find
things a bit more clarified, and splitted into what I think are potentially
3 distinct "problems" which 3 distinct possible fix. Since there is nothing
private, I can write on this here as well on this public mailing list.

So in the end I see 3 points (in addition to not understanding why
finding the elf header returns NULL while it should not and which I
guess you are currently looking at):
 - the idea that systemd developers should invert their logic: first
try to parse elf/program headers from the (maybe partial) core dump
PT_LOAD program headers
 - This special "if" condition that I have added in the original systemd
code:

+                /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */
+                if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz)
+                        continue;

to be added near this line: https://github.com/systemd/systemd/blob/72e7bfe02d7814fff15602726c7218b389324159/src/shared/elf-util.c#L540

on which I would like to ask you if indeed it seems like a "right" fix with
your knowledge of how core dump and elf files are shaped.
 - The idea that maybe this commit https://sourceware.org/git/?p=elfutils.git;a=commitdiff;h=8db849976f07046d27b4217e9ebd08d5623acc4f
which assumed that normally the order of magnitude of program headers
is 10 for a "normal" elf file, so a linked list would be enough might be
wrong in the special case of core dump which may have much more
program headers. And if indeed it makes sense to elf_getdata_rawchunk
for each and every program header of a core, in that case should this
linked list be changed into some set/hashmap indexed by start
address/size ?

> 
> $ time ./mimic-systemd-coredump
> [...]
> real    3m35.965s
> user    0m0.722s
> sys     0m0.345s
> 
> Note however that a lot of time is "missing".
> And in fact running it again is fast!?!
> 
> $ time ./mimic-systemd-coredump
> real    0m0.327s
> user    0m0.272s
> sys     0m0.050s
> 
> This is because of the kernel inode/dentry cache.
> If I do $ echo 2 | sudo tee /proc/sys/vm/drop_caches
> before running ./mimic-systemd-coredump it is always slow.

Interesting ! I didn’t see that (actually I never let the program run till the
end !).

> Which does bring up the question why systemd-coredump isn't running in
> the same mount space as the crashing program. Then it would simply find
> the files that the crashing program is using.

On this point that systemd-coredump might not run in the same mount
namespace, don’t blindly believe me. I think I saw this while reviewing the
systemd code, but it was the first time I looked at it to investigate this issue,
so may be wrong. But I am sure you have access to some systemd
colleagues at Red Hat to double-check the details ;)

Cheers,
Romain

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-19 19:56   ` Romain GEISSLER
@ 2023-06-19 20:54     ` Luca Boccassi
  2023-06-20 11:59       ` Mark Wielaard
  2023-06-20 13:15     ` Mark Wielaard
  1 sibling, 1 reply; 19+ messages in thread
From: Luca Boccassi @ 2023-06-19 20:54 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 777 bytes --]

> > Which does bring up the question why systemd-coredump isn't running
> in
> > the same mount space as the crashing program. Then it would simply
> find
> > the files that the crashing program is using.
> 
> On this point that systemd-coredump might not run in the same mount
> namespace, don’t blindly believe me. I think I saw this while
> reviewing the
> systemd code, but it was the first time I looked at it to investigate
> this issue,
> so may be wrong.

This is correct, in case of containers sd-coredump will run on the host
and collect from all the guests, so they are going to be in different
namespaces. And even when they are not, the original binary might be
long gone by the time it has a chance to run.

-- 
Kind regards,
Luca Boccassi

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-19 20:54     ` Luca Boccassi
@ 2023-06-20 11:59       ` Mark Wielaard
  2023-06-20 12:12         ` Luca Boccassi
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Wielaard @ 2023-06-20 11:59 UTC (permalink / raw)
  To: Luca Boccassi, elfutils-devel

Hi Luca,

On Mon, 2023-06-19 at 21:54 +0100, Luca Boccassi wrote:
> > > Which does bring up the question why systemd-coredump isn't running
> > in
> > > the same mount space as the crashing program. Then it would simply
> > find
> > > the files that the crashing program is using.
> > 
> > On this point that systemd-coredump might not run in the same mount
> > namespace, don’t blindly believe me. I think I saw this while
> > reviewing the
> > systemd code, but it was the first time I looked at it to investigate
> > this issue,
> > so may be wrong.
> 
> This is correct, in case of containers sd-coredump will run on the host
> and collect from all the guests, so they are going to be in different
> namespaces. And even when they are not, the original binary might be
> long gone by the time it has a chance to run.

Since sd-coredump is run on the host it could transition into the mount
namespace of the process it is capturing the coredump for. This makes
sense if it is then going to do some path based lookups. Alternatively
we could look into making the path based lookups look under
/proc/PID/root/ Since sd-coredump is run as special kernel helper
through /proc/sys/kernel/core_pattern the original executable is still
there (it cannot be reaped till sd-coredump has done its thing).

The core (5) manpage states:

       *  The process runs in the initial namespaces (PID, mount, user,
          and so on) and not in the namespaces of the crashing process.
          One can utilize specifiers such as %P to find the right
          /proc/[pid] directory and probe/enter the crashing process's
          namespaces if needed.

https://man7.org/linux/man-pages/man5/core.5.html

Cheers,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-20 11:59       ` Mark Wielaard
@ 2023-06-20 12:12         ` Luca Boccassi
  0 siblings, 0 replies; 19+ messages in thread
From: Luca Boccassi @ 2023-06-20 12:12 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel

On Tue, 20 Jun 2023 at 13:07, Mark Wielaard <mark@klomp.org> wrote:
>
> Hi Luca,
>
> On Mon, 2023-06-19 at 21:54 +0100, Luca Boccassi wrote:
> > > > Which does bring up the question why systemd-coredump isn't running
> > > in
> > > > the same mount space as the crashing program. Then it would simply
> > > find
> > > > the files that the crashing program is using.
> > >
> > > On this point that systemd-coredump might not run in the same mount
> > > namespace, don’t blindly believe me. I think I saw this while
> > > reviewing the
> > > systemd code, but it was the first time I looked at it to investigate
> > > this issue,
> > > so may be wrong.
> >
> > This is correct, in case of containers sd-coredump will run on the host
> > and collect from all the guests, so they are going to be in different
> > namespaces. And even when they are not, the original binary might be
> > long gone by the time it has a chance to run.
>
> Since sd-coredump is run on the host it could transition into the mount
> namespace of the process it is capturing the coredump for. This makes
> sense if it is then going to do some path based lookups. Alternatively
> we could look into making the path based lookups look under
> /proc/PID/root/ Since sd-coredump is run as special kernel helper
> through /proc/sys/kernel/core_pattern the original executable is still
> there (it cannot be reaped till sd-coredump has done its thing).

This requires additional privileges that we really don't want to give
to sd-coredump, as it fundamentally processes untrusted inputs. Also
as mentioned there's no guarantee anyway that the original binary will
still be around when the processing happens, it's all asynchronous, so
there would still be common corner cases where that doesn't help.
There were other suggestions in this thread to fix this issue, to me
it seems better to pursue those instead.

Kind regards,
Luca Boccassi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-19 19:56   ` Romain GEISSLER
  2023-06-19 20:54     ` Luca Boccassi
@ 2023-06-20 13:15     ` Mark Wielaard
  1 sibling, 0 replies; 19+ messages in thread
From: Mark Wielaard @ 2023-06-20 13:15 UTC (permalink / raw)
  To: Romain GEISSLER; +Cc: elfutils-devel

Hi Romain,

On Mon, 2023-06-19 at 19:56 +0000, Romain GEISSLER via Elfutils-devel
wrote:
> 
> Thanks ! And sorry that Laurent had pinged you directly on Slack, I
> wanted to reach you via this mailing list instead of through the Red
> Hat customer network ;)

Slack isn't a very effective way to reach me. Most elfutils hackers do
hang out on the Libera.Chat irc channel #elfutils.

> I don’t know if you read the Red Hat case too. There you can find
> things a bit more clarified, and splitted into what I think are potentially
> 3 distinct "problems" which 3 distinct possible fix. Since there is nothing
> private, I can write on this here as well on this public mailing list.

I haven't looked if I have access to the customer case since you
provided such a great reproducer.

> So in the end I see 3 points (in addition to not understanding why
> finding the elf header returns NULL while it should not and which I
> guess you are currently looking at):
>  - the idea that systemd developers should invert their logic: first
> try to parse elf/program headers from the (maybe partial) core dump
> PT_LOAD program headers

yes, that could in theory also be done through a custom callbacks-
>find_elf.

>  - This special "if" condition that I have added in the original systemd
> code:
> 
> +                /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */
> +                if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz)
> +                        continue;
> 
> to be added near this line: https://github.com/systemd/systemd/blob/72e7bfe02d7814fff15602726c7218b389324159/src/shared/elf-util.c#L540
> 
> on which I would like to ask you if indeed it seems like a "right" fix with
> your knowledge of how core dump and elf files are shaped.

Yes, that does make sense.

>  - The idea that maybe this commit https://sourceware.org/git/?p=elfutils.git;a=commitdiff;h=8db849976f07046d27b4217e9ebd08d5623acc4f
> which assumed that normally the order of magnitude of program headers
> is 10 for a "normal" elf file, so a linked list would be enough might be
> wrong in the special case of core dump which may have much more
> program headers. And if indeed it makes sense to elf_getdata_rawchunk
> for each and every program header of a core, in that case should this
> linked list be changed into some set/hashmap indexed by start
> address/size ?

Interesting. Yeah, a linked list is not the ideal datastructure here.
But I don't think it is causing the really long delay. That clearly
comes from the (negative) inode/dentry file search cache. But we could
look at this after we solve the other issues and we maybe want to speed
things up a bit more.

> 
Cheers,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-19 15:08 ` Mark Wielaard
  2023-06-19 19:56   ` Romain GEISSLER
@ 2023-06-20 21:37   ` Mark Wielaard
  2023-06-20 22:05     ` Romain GEISSLER
  1 sibling, 1 reply; 19+ messages in thread
From: Mark Wielaard @ 2023-06-20 21:37 UTC (permalink / raw)
  To: Romain Geissler, elfutils-devel

Hi,

On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote:
> Just to let you know I am looking at this. But haven't made much
> progress in understanding it yet. Thanks so much for the reproducer. I
> have been able to see the (very slow) parsing of the core file with it.
> 
> $ time ./mimic-systemd-coredump
> [...]
> real	3m35.965s
> user	0m0.722s
> sys	0m0.345s
> 
> Note however that a lot of time is "missing".
> And in fact running it again is fast!?!
> 
> $ time ./mimic-systemd-coredump
> real	0m0.327s
> user	0m0.272s
> sys	0m0.050s
> 
> This is because of the kernel inode/dentry cache.
> If I do $ echo 2 | sudo tee /proc/sys/vm/drop_caches
> before running ./mimic-systemd-coredump it is always slow.
> 
> I'll try to figure out what we do to make it so hard for the kernel to
> do these lookups.

So I made a mistake here. Since I was testing on fedora 38 which has
DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big
slowdown.

Do you have the DEBUGINFOD_URLS environment variable set?

The real sd-coredump will not have DEBUGINFOD_URLS set (I hope).

Thanks,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-20 21:37   ` Mark Wielaard
@ 2023-06-20 22:05     ` Romain GEISSLER
  2023-06-21 16:24       ` Mark Wielaard
  2023-06-22  2:40       ` Frank Ch. Eigler
  0 siblings, 2 replies; 19+ messages in thread
From: Romain GEISSLER @ 2023-06-20 22:05 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Francois RIGAULT

> Le 20 juin 2023 à 23:37, Mark Wielaard <mark@klomp.org> a écrit :
> 
> Hi,
> 
> On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote:
> 
> So I made a mistake here. Since I was testing on fedora 38 which has
> DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big
> slowdown.
> 
> Do you have the DEBUGINFOD_URLS environment variable set?
> 
> The real sd-coredump will not have DEBUGINFOD_URLS set (I hope).
> 
> Thanks,
> 
> Mark

Hi,

Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core OS 9 (which I assume shares a lot of foundations with RHEL 9).

On our side Francois also told me this afternoon that he didn’t really reproduce the same thing with my reproducer posted here and the real systemd-coredump issue he witnessed live, and also noticed that with DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem anymore. What he witnessed on the real case (using perf/gdb) was that apparently lots of time was spend in elf_getdata_rawchunk and often in this kind of stack:

Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 16468500000                                                                                                                                 
Overhead  Command         Shared Object             Symbol                                                                                                                                                 
  98.24%  (sd-parse-elf)  libelf-0.188.so           [.] elf_getdata_rawchunk
   0.48%  (sd-parse-elf)  libelf-0.188.so           [.] 0x00000000000048a3
   0.27%  (sd-parse-elf)  libelf-0.188.so           [.] gelf_getphdr
   0.11%  (sd-parse-elf)  libc.so.6                 [.] _int_malloc
   0.10%  (sd-parse-elf)  libelf-0.188.so           [.] gelf_getnote
   0.06%  (sd-parse-elf)  libc.so.6                 [.] __libc_calloc
   0.05%  (sd-parse-elf)  [kernel.kallsyms]         [k] __softirqentry_text_start
   0.05%  (sd-parse-elf)  libc.so.6                 [.] _int_free


(gdb) bt
#0  0x00007f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1
#1  0x00007f0ba98e5013 in module_callback.lto_priv () from target:/usr/lib64/systemd/libsystemd-shared-252.so
#2  0x00007f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1
#3  0x00007f0ba98e6dc0 in parse_elf_object () from target:/usr/lib64/systemd/libsystemd-shared-252.so
#4  0x0000562c474f2d5e in submit_coredump ()
#5  0x0000562c474f57d1 in process_socket.constprop ()
#6  0x0000562c474efbf8 in main ()

My reproducer actually doesn’t fully re-implement what systemd implements (the parsing of the package metadata is clearly omitted), so I thought I had reproduced the same problem while apparently I didn’t, sorry for that. We will also have to double check if really just using 2000 dummy libraries is enough or if this also needs to have a more complex binary like we have in our real case.

Tomorrow on our side we will have to play a bit with a local build of systemd-coredump and try to run it manually to better understand what’s going wrong.


Note: when I wrote and tested my reproducer, I used a fedora:38 container, which doesn’t have DEBUGINFOD_URLS set (which may be different from a real fedora 38, non containerized)

[root@7563ccfb7a39 /]# printenv|grep DEBUGINFOD_URLS
[root@7563ccfb7a39 /]# find /etc/profile.d/|grep debug
[root@7563ccfb7a39 /]# cat /etc/os-release
NAME="Fedora Linux"
VERSION="38 (Container Image)"

Cheers,
Romain


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-20 22:05     ` Romain GEISSLER
@ 2023-06-21 16:24       ` Mark Wielaard
  2023-06-21 18:14         ` Romain GEISSLER
  2023-06-27 14:09         ` Mark Wielaard
  2023-06-22  2:40       ` Frank Ch. Eigler
  1 sibling, 2 replies; 19+ messages in thread
From: Mark Wielaard @ 2023-06-21 16:24 UTC (permalink / raw)
  To: Romain GEISSLER; +Cc: elfutils-devel, Francois RIGAULT

[-- Attachment #1: Type: text/plain, Size: 3311 bytes --]

Hi Romain,

On Tue, 2023-06-20 at 22:05 +0000, Romain GEISSLER wrote:
> 
> Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core OS 9 (which I assume shares a lot of foundations with RHEL 9).
> 
> On our side Francois also told me this afternoon that he didn’t really reproduce the same thing with my reproducer posted here and the real systemd-coredump issue he witnessed live, and also noticed that with DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem anymore. What he witnessed on the real case (using perf/gdb) was that apparently lots of time was spend in elf_getdata_rawchunk and often in this kind of stack:
> 
> Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 16468500000                                                                                                                                 
> Overhead  Command         Shared Object             Symbol                                                                                                                                                 
>   98.24%  (sd-parse-elf)  libelf-0.188.so           [.] elf_getdata_rawchunk
>    0.48%  (sd-parse-elf)  libelf-0.188.so           [.] 0x00000000000048a3
>    0.27%  (sd-parse-elf)  libelf-0.188.so           [.] gelf_getphdr
>    0.11%  (sd-parse-elf)  libc.so.6                 [.] _int_malloc
>    0.10%  (sd-parse-elf)  libelf-0.188.so           [.] gelf_getnote
>    0.06%  (sd-parse-elf)  libc.so.6                 [.] __libc_calloc
>    0.05%  (sd-parse-elf)  [kernel.kallsyms]         [k] __softirqentry_text_start
>    0.05%  (sd-parse-elf)  libc.so.6                 [.] _int_free
> 
> 
> (gdb) bt
> #0  0x00007f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1
> #1  0x00007f0ba98e5013 in module_callback.lto_priv () from target:/usr/lib64/systemd/libsystemd-shared-252.so
> #2  0x00007f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1
> #3  0x00007f0ba98e6dc0 in parse_elf_object () from target:/usr/lib64/systemd/libsystemd-shared-252.so
> #4  0x0000562c474f2d5e in submit_coredump ()
> #5  0x0000562c474f57d1 in process_socket.constprop ()
> #6  0x0000562c474efbf8 in main ()
> 
> My reproducer actually doesn’t fully re-implement what systemd implements (the parsing of the package metadata is clearly omitted), so I thought I had reproduced the same problem while apparently I didn’t, sorry for that. We will also have to double check if really just using 2000 dummy libraries is enough or if this also needs to have a more complex binary like we have in our real case.
> 
> Tomorrow on our side we will have to play a bit with a local build of systemd-coredump and try to run it manually to better understand what’s going wrong.
> 

Seeing those performance results I understand why you were suspecting
the linked list data structure used in elf_getdata_rawchunk.

Would you be able to test a rebuild libelf with the attached patch,
which replaces that datastructure with a binary search tree?

It didn't really show much speedup locally (in the noise, maybe 0.01
sec faster on ~0.25 sec run). But if there are more than 2000 calls to
elf_getdata_rawchunk it should make things faster.

Thanks,

Mark

[-- Attachment #2: Type: text/x-patch, Size: 7186 bytes --]

From 3aca5b5f1f1617db2220022d9061dcaf129e54c4 Mon Sep 17 00:00:00 2001
From: Mark Wielaard <mark@klomp.org>
Date: Wed, 21 Jun 2023 18:05:12 +0200
Subject: [PATCH] libelf: Replace list of elf_getdata_rawchunk results with a
 tree

elf_getdata_rawchunks did a linear search to see if a chunk was
already fetched. Replace this list with a binary search tree to make
lookup faster when a lot of Elf_Data_Chunk were created.

       * libelf/libelfP.h (Elf_Data_Chunk): Remove next field.
       (struct Elf): Change the rawchunks type from Elf_Data_Chunk *
       to void *.
       * elf_getdata_rawchunk.c (chunk_compare): New static function.
       (elf_getdata_rawchunk): Use tsearch instead of a manual linked
       list.
       * elf_end.c (free_chunk): New static function.
       (elf_end): Call tdestroy instead of walking linked list.

Signed-off-by: Mark Wielaard <mark@klomp.org>
---
 libelf/elf_end.c              | 22 +++++++++-------
 libelf/elf_getdata_rawchunk.c | 47 +++++++++++++++++++++++++----------
 libelf/libelfP.h              | 13 ++++------
 3 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/libelf/elf_end.c b/libelf/elf_end.c
index 5c451f36..3e5d4c86 100644
--- a/libelf/elf_end.c
+++ b/libelf/elf_end.c
@@ -1,5 +1,6 @@
 /* Free resources associated with Elf descriptor.
    Copyright (C) 1998,1999,2000,2001,2002,2004,2005,2007,2015,2016 Red Hat, Inc.
+   Copyright (C) 2023 Mark J. Wielaard <mark@klomp.org>
    This file is part of elfutils.
    Written by Ulrich Drepper <drepper@redhat.com>, 1998.
 
@@ -32,12 +33,22 @@
 #endif
 
 #include <assert.h>
+#include <search.h>
 #include <stddef.h>
 #include <stdlib.h>
 
 #include "libelfP.h"
 
 
+static void
+free_chunk (void *n)
+{
+  Elf_Data_Chunk *rawchunk = (Elf_Data_Chunk *)n;
+  if (rawchunk->dummy_scn.flags & ELF_F_MALLOCED)
+    free (rawchunk->data.d.d_buf);
+  free (rawchunk);
+}
+
 int
 elf_end (Elf *elf)
 {
@@ -112,20 +123,13 @@ elf_end (Elf *elf)
 
     case ELF_K_ELF:
       {
-	Elf_Data_Chunk *rawchunks
+	void *rawchunks
 	  = (elf->class == ELFCLASS32
 	     || (offsetof (struct Elf, state.elf32.rawchunks)
 		 == offsetof (struct Elf, state.elf64.rawchunks))
 	     ? elf->state.elf32.rawchunks
 	     : elf->state.elf64.rawchunks);
-	while (rawchunks != NULL)
-	  {
-	    Elf_Data_Chunk *next = rawchunks->next;
-	    if (rawchunks->dummy_scn.flags & ELF_F_MALLOCED)
-	      free (rawchunks->data.d.d_buf);
-	    free (rawchunks);
-	    rawchunks = next;
-	  }
+	tdestroy (rawchunks, free_chunk);
 
 	Elf_ScnList *list = (elf->class == ELFCLASS32
 			     || (offsetof (struct Elf, state.elf32.scns)
diff --git a/libelf/elf_getdata_rawchunk.c b/libelf/elf_getdata_rawchunk.c
index 5a35ccdc..cfd40396 100644
--- a/libelf/elf_getdata_rawchunk.c
+++ b/libelf/elf_getdata_rawchunk.c
@@ -1,6 +1,6 @@
 /* Return converted data from raw chunk of ELF file.
    Copyright (C) 2007, 2014, 2015 Red Hat, Inc.
-   Copyright (C) 2022 Mark J. Wielaard <mark@klomp.org>
+   Copyright (C) 2022, 2023 Mark J. Wielaard <mark@klomp.org>
    This file is part of elfutils.
 
    This file is free software; you can redistribute it and/or modify
@@ -33,12 +33,28 @@
 
 #include <assert.h>
 #include <errno.h>
+#include <search.h>
 #include <stdlib.h>
 #include <string.h>
 
 #include "libelfP.h"
 #include "common.h"
 
+static int
+chunk_compare (const void *a, const void *b)
+{
+  Elf_Data_Chunk *da = (Elf_Data_Chunk *)a;
+  Elf_Data_Chunk *db = (Elf_Data_Chunk *)b;
+
+  if (da->offset != db->offset)
+    return da->offset - db->offset;
+
+  if (da->data.d.d_size != db->data.d.d_size)
+    return da->data.d.d_size - db->data.d.d_size;
+
+  return da->data.d.d_type - db->data.d.d_type;
+}
+
 Elf_Data *
 elf_getdata_rawchunk (Elf *elf, int64_t offset, size_t size, Elf_Type type)
 {
@@ -75,19 +91,25 @@ elf_getdata_rawchunk (Elf *elf, int64_t offset, size_t size, Elf_Type type)
   rwlock_rdlock (elf->lock);
 
   /* Maybe we already got this chunk?  */
-  Elf_Data_Chunk *rawchunks = elf->state.elf.rawchunks;
-  while (rawchunks != NULL)
+  Elf_Data_Chunk key;
+  key.offset = offset;
+  key.data.d.d_size = size;
+  key.data.d.d_type = type;
+  Elf_Data_Chunk **found = tsearch (&key, &elf->state.elf.rawchunks,
+				    &chunk_compare);
+  if (found == NULL)
+    goto nomem;
+
+  /* Existing entry.  */
+  if (*found != &key && *found != NULL)
     {
-      if ((rawchunks->offset == offset || size == 0)
-	  && rawchunks->data.d.d_size == size
-	  && rawchunks->data.d.d_type == type)
-	{
-	  result = &rawchunks->data.d;
-	  goto out;
-	}
-      rawchunks = rawchunks->next;
+      result = &(*found)->data.d;
+      goto out;
     }
 
+  /* New entry.  */
+  *found = NULL;
+
   size_t align = __libelf_type_align (elf->class, type);
   if (elf->map_address != NULL)
     {
@@ -189,8 +211,7 @@ elf_getdata_rawchunk (Elf *elf, int64_t offset, size_t size, Elf_Type type)
   rwlock_unlock (elf->lock);
   rwlock_wrlock (elf->lock);
 
-  chunk->next = elf->state.elf.rawchunks;
-  elf->state.elf.rawchunks = chunk;
+  *found = chunk;
   result = &chunk->data.d;
 
  out:
diff --git a/libelf/libelfP.h b/libelf/libelfP.h
index 6624f38a..d3c241e5 100644
--- a/libelf/libelfP.h
+++ b/libelf/libelfP.h
@@ -1,5 +1,6 @@
 /* Internal interfaces for libelf.
    Copyright (C) 1998-2010, 2015, 2016 Red Hat, Inc.
+   Copyright (C) 2023 Mark J. Wielaard <mark@klomp.org>
    This file is part of elfutils.
    Contributed by Ulrich Drepper <drepper@redhat.com>, 1998.
 
@@ -262,11 +263,7 @@ typedef struct Elf_ScnList
 typedef struct Elf_Data_Chunk
 {
   Elf_Data_Scn data;
-  union
-  {
-    Elf_Scn dummy_scn;
-    struct Elf_Data_Chunk *next;
-  };
+  Elf_Scn dummy_scn;
   int64_t offset;		/* The original raw offset in the Elf image.  */
 } Elf_Data_Chunk;
 
@@ -324,7 +321,7 @@ struct Elf
       Elf_ScnList *scns_last;	/* Last element in the section list.
 				   If NULL the data has not yet been
 				   read from the file.  */
-      Elf_Data_Chunk *rawchunks; /* List of elf_getdata_rawchunk results.  */
+      void *rawchunks;		/* Tree of elf_getdata_rawchunk results.  */
       unsigned int scnincr;	/* Number of sections allocate the last
 				   time.  */
       int ehdr_flags;		/* Flags (dirty) for ELF header.  */
@@ -343,7 +340,7 @@ struct Elf
       Elf_ScnList *scns_last;	/* Last element in the section list.
 				   If NULL the data has not yet been
 				   read from the file.  */
-      Elf_Data_Chunk *rawchunks; /* List of elf_getdata_rawchunk results.  */
+      void *rawchunks;		/* Tree of elf_getdata_rawchunk results.  */
       unsigned int scnincr;	/* Number of sections allocate the last
 				   time.  */
       int ehdr_flags;		/* Flags (dirty) for ELF header.  */
@@ -368,7 +365,7 @@ struct Elf
       Elf_ScnList *scns_last;	/* Last element in the section list.
 				   If NULL the data has not yet been
 				   read from the file.  */
-      Elf_Data_Chunk *rawchunks; /* List of elf_getdata_rawchunk results.  */
+      void *rawchunks;		/* Tree of elf_getdata_rawchunk results.  */
       unsigned int scnincr;	/* Number of sections allocate the last
 				   time.  */
       int ehdr_flags;		/* Flags (dirty) for ELF header.  */
-- 
2.40.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-21 16:24       ` Mark Wielaard
@ 2023-06-21 18:14         ` Romain GEISSLER
  2023-06-21 19:39           ` Mark Wielaard
  2023-06-27 14:09         ` Mark Wielaard
  1 sibling, 1 reply; 19+ messages in thread
From: Romain GEISSLER @ 2023-06-21 18:14 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Francois RIGAULT

> Le 21 juin 2023 à 18:24, Mark Wielaard <mark@klomp.org> a écrit :
> 
> 
> 
> Hi Romain,
> 
> Seeing those performance results I understand why you were suspecting
> the linked list data structure used in elf_getdata_rawchunk.
> 
> Would you be able to test a rebuild libelf with the attached patch,
> which replaces that datastructure with a binary search tree?
> 
> It didn't really show much speedup locally (in the noise, maybe 0.01
> sec faster on ~0.25 sec run). But if there are more than 2000 calls to
> elf_getdata_rawchunk it should make things faster.

Hi Mark,

Actually I have spent some time today to reproduce the issue for real, using
directly systemd. The details can be found in
https://github.com/systemd/systemd/pull/28093 which was just merged. This
change in systemd is enough to fix the biggest part of the "slow" parsing of
cores. Yet even with my systemd patch, we will still call elf_getdata_rawchunk
2000 times if the crashing process had 2000 shared libraries, so indeed your
patch in elfutils might still be welcome. I will test it later when I go home, or
tomorrow.

Thanks,
Romain

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-21 18:14         ` Romain GEISSLER
@ 2023-06-21 19:39           ` Mark Wielaard
  2023-06-22  8:10             ` Romain GEISSLER
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Wielaard @ 2023-06-21 19:39 UTC (permalink / raw)
  To: Romain GEISSLER; +Cc: elfutils-devel, Francois RIGAULT

Hi Romain,

On Wed, Jun 21, 2023 at 06:14:27PM +0000, Romain GEISSLER wrote:
> > Le 21 juin 2023 à 18:24, Mark Wielaard <mark@klomp.org> a écrit :
> > 
> > Seeing those performance results I understand why you were suspecting
> > the linked list data structure used in elf_getdata_rawchunk.
> > 
> > Would you be able to test a rebuild libelf with the attached patch,
> > which replaces that datastructure with a binary search tree?
> > 
> > It didn't really show much speedup locally (in the noise, maybe 0.01
> > sec faster on ~0.25 sec run). But if there are more than 2000 calls to
> > elf_getdata_rawchunk it should make things faster.
> 
> Actually I have spent some time today to reproduce the issue for
> real, using directly systemd. The details can be found in
> https://github.com/systemd/systemd/pull/28093 which was just
> merged. This change in systemd is enough to fix the biggest part of
> the "slow" parsing of cores. Yet even with my systemd patch, we will
> still call elf_getdata_rawchunk 2000 times if the crashing process
> had 2000 shared libraries, so indeed your patch in elfutils might
> still be welcome. I will test it later when I go home, or tomorrow.

That patch looks good. It should reduce the number of calls to
elf_getdata_rawchunk a lot. Making it less urgent that function is
fast. But if you could test it that would be appreciated. I'll also
test it a bit more and will probably merge it if no issues show up
since it does seem better than keep using the linear list.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-20 22:05     ` Romain GEISSLER
  2023-06-21 16:24       ` Mark Wielaard
@ 2023-06-22  2:40       ` Frank Ch. Eigler
  2023-06-22 10:59         ` Mark Wielaard
  1 sibling, 1 reply; 19+ messages in thread
From: Frank Ch. Eigler @ 2023-06-22  2:40 UTC (permalink / raw)
  To: Romain GEISSLER; +Cc: Mark Wielaard, elfutils-devel, Francois RIGAULT

Hi -

> On our side Francois also told me this afternoon that he didn’t
> really reproduce the same thing with my reproducer posted here and
> the real systemd-coredump issue he witnessed live, and also noticed
> that with DEBUGINFOD_URLS unset/set to the empty string my
> reproducer has no problem anymore. [...]

Just doing the math from the debuginfod-client point of view (ignoring
other the later systemd side fix that moots this):

For an application that processes these elf/dwarf files sequentially,
queries for each synthetic solib are going to result in 2000 https-404
transactions, sans debuginfod caching.  If you're lucky (reusing a
dwfl object), elfutils may be able to reuse a long-lived https
connection to a server, otherwise a new https connection might have to
be spun up for each.  But even with reuse, we're talking about 2000
pingponging messages.  That will take a handful of minutes of elapsed
time just by itself.

If the calling code made these queries in parallel batches, it would
be much faster overall.

- FChE

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-21 19:39           ` Mark Wielaard
@ 2023-06-22  8:10             ` Romain GEISSLER
  2023-06-22 12:03               ` Mark Wielaard
  0 siblings, 1 reply; 19+ messages in thread
From: Romain GEISSLER @ 2023-06-22  8:10 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Francois RIGAULT

> Le 21 juin 2023 à 21:39, Mark Wielaard <mark@klomp.org> a écrit :
> 
> 
> Hi Romain,
> 
> That patch looks good. It should reduce the number of calls to
> elf_getdata_rawchunk a lot. Making it less urgent that function is
> fast. But if you could test it that would be appreciated. I'll also
> test it a bit more and will probably merge it if no issues show up
> since it does seem better than keep using the linear list.
> 
> Thanks,
> 
> Mark


Hi,

So I have done some testing, running the command:

[root@fa28b3d254fd systemd]# rm -rf /var/lib/systemd/coredump/* && journalctl --flush --rotate && journalctl --vacuum-time=1s && time cat the-core | build/systemd-coredump $$ 0 0 11 1687360485 1000000000000000 localhost && journalctl

Where "the-core" is our real core dump we had in production, with around 1700+ shared libraries loaded, and the uncompressed core dump size is 2GB+.

Without the systemd patch, without the elfutils patch.
real    3m42.308s
user    3m39.579s
sys     0m2.294s


Without the systemd patch, with the elfutils patch (3 runs, first one excluded to make sure the kernel caches what it caches):
real    0m15.543s
user    0m13.662s
sys     0m2.405s

real    0m15.976s                                     
user    0m13.832s                                     
sys     0m2.481s

real    0m15.612s
user    0m13.687s                                     
sys     0m2.470s


With the systemd patch, without the elf utils patch (3 runs, first one excluded to make sure the kernel caches what it caches):
real    0m2.994s                                      
user    0m1.104s
sys     0m2.477s

real    0m3.011s                                      
user    0m1.154s                                      
sys     0m2.447s

real    0m3.009s
user    0m1.141s                                      
sys     0m2.457s


With the systemd patch, with the elf utils patch (3 runs, first one excluded to make sure the kernel caches what it caches):
real    0m2.921s                                      
user    0m1.093s
sys     0m2.399s

real    0m2.950s
user    0m1.129s                                      
sys     0m2.401s

real    0m2.933s                                      
user    0m1.136s                                      
sys     0m2.371s


Overall we can see that both fix independently fix the performance issue (yet the systemd patch seems a bit more effective), so I guess both fixes are welcome.

Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on elfutils side ? If you need a ticket, we have Red Hat case 03536980 which lead to RHBZ 2215412.

Thanks !
Romain

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-22  2:40       ` Frank Ch. Eigler
@ 2023-06-22 10:59         ` Mark Wielaard
  0 siblings, 0 replies; 19+ messages in thread
From: Mark Wielaard @ 2023-06-22 10:59 UTC (permalink / raw)
  To: Frank Ch. Eigler, Romain GEISSLER; +Cc: elfutils-devel, Francois RIGAULT

Hi Frank,

On Wed, 2023-06-21 at 22:40 -0400, Frank Ch. Eigler wrote:
> For an application that processes these elf/dwarf files sequentially,
> queries for each synthetic solib are going to result in 2000 https-404
> transactions, sans debuginfod caching.  If you're lucky (reusing a
> dwfl object), elfutils may be able to reuse a long-lived https
> connection to a server, otherwise a new https connection might have to
> be spun up for each.  But even with reuse, we're talking about 2000
> pingponging messages.  That will take a handful of minutes of elapsed
> time just by itself.
> 
> If the calling code made these queries in parallel batches, it would
> be much faster overall.

I have been thinking about this too. But don't know of a good solution
that doesn't negate the (iterative) lazy model that Dwfl uses.

libdwfl tries to do the least amount of work possible so that you don't
"pay" for looking up extra information for Dwfl_Modules (libraries) you
don't care about. So for the use case of getting a backtrace of a
particular thread in that core file (Dwfl) you only fetch/load the
register notes and stack memory from the core file. Then, only if you
translate those stack addresses to symbols, only for those modules
associated with those stack addresses it will try to fetch/load the
symbol tables, which might involve resolving build-ids to Elf or
separate Dwarf debug files. This is normally an iterative process and
for something like generating a backtrace it often involves just a
handful of Dwfl_Modules (libraries), not all 2000.

In this case this falls down a bit since the application creates a Dwfl
from a core file and then requests information (the elf file) from all
Dwfl_Modules, so it can get at the package note description for each of
them. As Romain noted it would be really nice if elfutils/libdwfl had a
way to get at the package note description (just like it has for
getting the build-id, which is another loaded ELF note). So such a
function would know it doesn't need to get the full ELF file.

Maybe another solution might be an "get me everything for this Dwfl,
all symbol tables, all elf files, all separate Dwarf debug files, etc."
function so an application can "pay upfront" for not having to fetch
each item lazily? Such a function could then do a "parallel/batched
fetch" through debuginfod.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-22  8:10             ` Romain GEISSLER
@ 2023-06-22 12:03               ` Mark Wielaard
  0 siblings, 0 replies; 19+ messages in thread
From: Mark Wielaard @ 2023-06-22 12:03 UTC (permalink / raw)
  To: Romain GEISSLER; +Cc: elfutils-devel, Francois RIGAULT

Hi Romain,

On Thu, 2023-06-22 at 08:10 +0000, Romain GEISSLER via Elfutils-devel
wrote:

> 
> Overall we can see that both fix independently fix the performance issue (yet the systemd patch seems a bit more effective), so I guess both fixes are welcome.

Yes, thanks so much for the testing of the individual fixes.

> Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on elfutils side ? If you need a ticket, we have Red Hat case 03536980 which lead to RHBZ 2215412.

Yes, I'll look into it. The systemd fix seems most important since it
also fixes a correctness issue and fixes the root cause of the
performance issue. But I'll see if we can backport the elfutils fix
too. I can add it to Fedora first.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-21 16:24       ` Mark Wielaard
  2023-06-21 18:14         ` Romain GEISSLER
@ 2023-06-27 14:09         ` Mark Wielaard
  2023-06-30 16:09           ` Romain GEISSLER
  1 sibling, 1 reply; 19+ messages in thread
From: Mark Wielaard @ 2023-06-27 14:09 UTC (permalink / raw)
  To: Romain GEISSLER; +Cc: elfutils-devel, Francois RIGAULT

Hi,

On Wed, 2023-06-21 at 18:24 +0200, Mark Wielaard wrote:
> libelf: Replace list of elf_getdata_rawchunk results with a
>  tree
> 
> elf_getdata_rawchunks did a linear search to see if a chunk was
> already fetched. Replace this list with a binary search tree to make
> lookup faster when a lot of Elf_Data_Chunk were created.
> 
>        * libelf/libelfP.h (Elf_Data_Chunk): Remove next field.
>        (struct Elf): Change the rawchunks type from Elf_Data_Chunk *
>        to void *.
>        * elf_getdata_rawchunk.c (chunk_compare): New static function.
>        (elf_getdata_rawchunk): Use tsearch instead of a manual linked
>        list.
>        * elf_end.c (free_chunk): New static function.
>        (elf_end): Call tdestroy instead of walking linked list.

I pushed this now. It is also in Fedora Rawhide:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-1026cbdd9a
But there are some gating issues (nothing to do with the new code
though, it seems to be dnf5/annobin/rpminspect issues).

I'll try to also push it to centos stream.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-27 14:09         ` Mark Wielaard
@ 2023-06-30 16:09           ` Romain GEISSLER
  2023-06-30 22:35             ` Mark Wielaard
  0 siblings, 1 reply; 19+ messages in thread
From: Romain GEISSLER @ 2023-06-30 16:09 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: elfutils-devel, Francois RIGAULT

> Le 27 juin 2023 à 16:09, Mark Wielaard <mark@klomp.org> a écrit :
>
> Hi,
>
> On Wed, 2023-06-21 at 18:24 +0200, Mark Wielaard wrote:
>
> I pushed this now. It is also in Fedora Rawhide:
> https://bodhi.fedoraproject.org/updates/FEDORA-2023-1026cbdd9a
> But there are some gating issues (nothing to do with the new code
> though, it seems to be dnf5/annobin/rpminspect issues).
>
> I'll try to also push it to centos stream.
>
> Cheers,
>
> Mark

Thanks ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
  2023-06-30 16:09           ` Romain GEISSLER
@ 2023-06-30 22:35             ` Mark Wielaard
  0 siblings, 0 replies; 19+ messages in thread
From: Mark Wielaard @ 2023-06-30 22:35 UTC (permalink / raw)
  To: Romain GEISSLER; +Cc: elfutils-devel, Francois RIGAULT

On Fri, Jun 30, 2023 at 04:09:37PM +0000, Romain GEISSLER wrote:
> > Le 27 juin 2023 à 16:09, Mark Wielaard <mark@klomp.org> a écrit :
> > On Wed, 2023-06-21 at 18:24 +0200, Mark Wielaard wrote:
> >
> > I pushed this now. It is also in Fedora Rawhide:
> > https://bodhi.fedoraproject.org/updates/FEDORA-2023-1026cbdd9a
> > But there are some gating issues (nothing to do with the new code
> > though, it seems to be dnf5/annobin/rpminspect issues).
> >
> > I'll try to also push it to centos stream.
> >
> Thanks ;)

https://kojihub.stream.centos.org/koji/buildinfo?buildID=34064
https://kojihub.stream.centos.org/koji/buildinfo?buildID=34065

Cheers,

Mark

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-06-30 22:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-14 13:39 Performance issue with systemd-coredump and container process linking 2000 shared libraries Romain Geissler
2023-06-19 15:08 ` Mark Wielaard
2023-06-19 19:56   ` Romain GEISSLER
2023-06-19 20:54     ` Luca Boccassi
2023-06-20 11:59       ` Mark Wielaard
2023-06-20 12:12         ` Luca Boccassi
2023-06-20 13:15     ` Mark Wielaard
2023-06-20 21:37   ` Mark Wielaard
2023-06-20 22:05     ` Romain GEISSLER
2023-06-21 16:24       ` Mark Wielaard
2023-06-21 18:14         ` Romain GEISSLER
2023-06-21 19:39           ` Mark Wielaard
2023-06-22  8:10             ` Romain GEISSLER
2023-06-22 12:03               ` Mark Wielaard
2023-06-27 14:09         ` Mark Wielaard
2023-06-30 16:09           ` Romain GEISSLER
2023-06-30 22:35             ` Mark Wielaard
2023-06-22  2:40       ` Frank Ch. Eigler
2023-06-22 10:59         ` Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).