Hi Romain, On Tue, 2023-06-20 at 22:05 +0000, Romain GEISSLER wrote: > > Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core OS 9 (which I assume shares a lot of foundations with RHEL 9). > > On our side Francois also told me this afternoon that he didn’t really reproduce the same thing with my reproducer posted here and the real systemd-coredump issue he witnessed live, and also noticed that with DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem anymore. What he witnessed on the real case (using perf/gdb) was that apparently lots of time was spend in elf_getdata_rawchunk and often in this kind of stack: > > Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 16468500000 > Overhead Command Shared Object Symbol >   98.24% (sd-parse-elf) libelf-0.188.so [.] elf_getdata_rawchunk >    0.48% (sd-parse-elf) libelf-0.188.so [.] 0x00000000000048a3 >    0.27% (sd-parse-elf) libelf-0.188.so [.] gelf_getphdr >    0.11% (sd-parse-elf) libc.so.6 [.] _int_malloc >    0.10% (sd-parse-elf) libelf-0.188.so [.] gelf_getnote >    0.06% (sd-parse-elf) libc.so.6 [.] __libc_calloc >    0.05% (sd-parse-elf) [kernel.kallsyms] [k] __softirqentry_text_start >    0.05% (sd-parse-elf) libc.so.6 [.] _int_free > > > (gdb) bt > #0 0x00007f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1 > #1 0x00007f0ba98e5013 in module_callback.lto_priv () from target:/usr/lib64/systemd/libsystemd-shared-252.so > #2 0x00007f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1 > #3 0x00007f0ba98e6dc0 in parse_elf_object () from target:/usr/lib64/systemd/libsystemd-shared-252.so > #4 0x0000562c474f2d5e in submit_coredump () > #5 0x0000562c474f57d1 in process_socket.constprop () > #6 0x0000562c474efbf8 in main () > > My reproducer actually doesn’t fully re-implement what systemd implements (the parsing of the package metadata is clearly omitted), so I thought I had reproduced the same problem while apparently I didn’t, sorry for that. We will also have to double check if really just using 2000 dummy libraries is enough or if this also needs to have a more complex binary like we have in our real case. > > Tomorrow on our side we will have to play a bit with a local build of systemd-coredump and try to run it manually to better understand what’s going wrong. > Seeing those performance results I understand why you were suspecting the linked list data structure used in elf_getdata_rawchunk. Would you be able to test a rebuild libelf with the attached patch, which replaces that datastructure with a binary search tree? It didn't really show much speedup locally (in the noise, maybe 0.01 sec faster on ~0.25 sec run). But if there are more than 2000 calls to elf_getdata_rawchunk it should make things faster. Thanks, Mark