From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR05-VI1-obe.outbound.protection.outlook.com (mail-vi1eur05on2071.outbound.protection.outlook.com [40.107.21.71]) by sourceware.org (Postfix) with ESMTPS id 9A6FC3858D1E for ; Wed, 14 Jun 2023 13:39:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9A6FC3858D1E Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=amadeus.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=amadeus.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=bzuixxQ9bRZ3lOt9oeUPtpqrMHIdCTd1rIpyJHEWCqcmIvasNivixoXF4T0Shek75EVnXWSYxMeDpGw3OAereOUUWDQSZreayjcAxO/FipNI2R83kRFlforZ/bGIIjNU3VM5Z02U/QPJsBeFPdUd/Ju3gb8Y0cKhm++sSxVaorXEJzOwAEBHhNxbP/zWaYktVGucovBi4t/noNJtyhbGBcM8cKv4iW4LXhwmvK5Md+8bD7oPWU17DPEfpC6qIgY/vIzYl1ZVCfFdw9QvUI6qFFNEopJ7mOaoH/V8CFU92hUUGcDBDFyA2hBRxO5OOHBAY8R2aPnfcs1T2OA/jBf63g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Eu3G+KmK4CQDl71LvNxTtaHrnOOT0KF5JpbOET5mhnU=; b=IduA107X/qCKn47ovjV25SqhO7lJvSWt6SYu/KH+mjwnyJ+s3US3+gtywboBwwPpYFZAL8To0bbl0bndzc+kkbX+9i8J0RsK/u5+TpRygK0kV4aFuBYNU8qmWqh/ey+q1QaDVc7WE0SJFbO+mFFNtZ1soJfNSxRFNjW60SmR4gmtU4UkypHgEn9HFM+VBJwB2Zc85ALaphLgcwcq9mU8tD9P1yCglBkx+R0Eu/hOruKF958ElDwJHZUkvfADBlZnDnaruDN4om1DJ3gyygAxZxgGXNjxRMq4vQgIXFDclIspCKbFVsDpJmBqNc/wlvql8y1OUYwR7cQrNaLBqDkDFw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 171.17.131.35) smtp.rcpttodomain=sourceware.org smtp.mailfrom=amadeus.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=amadeus.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amadeus.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Eu3G+KmK4CQDl71LvNxTtaHrnOOT0KF5JpbOET5mhnU=; b=FeRPGO9+UXDwHgU6pwpDxyfLtCW+GfmHFSlKDztG3VEtIEBICV4RPO7hPxHE5e4QRBIxFwTUcQ8LHU8p/yLyw+sN0FGF/xtwMqhDDGqyyD3VTOr+qwaQ/EV0klBBBK2GmfSnhQHVeimRXlLVB0hHOpcmbuLpp6qRSk6jm65NvAjX8OEvv21YBuz4lWHgUwujHsySw0/8peMz1Ywp+Nt1h79I6d/Ujwn+ZvqY40U2lujFp9HizsZKZLwUADXbJl/Iz6ZH5smQUarKtgdMCgZeLxJS44gRPghn8aRlm7BNLPgC3xbkXssHTuuHEKuUwrS/yxijOcwhyIfvS5Tc/Jw/bA== Received: from GV3P280CA0117.SWEP280.PROD.OUTLOOK.COM (2603:10a6:150:8::33) by AS8PR10MB6339.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:20b:53d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6455.39; Wed, 14 Jun 2023 13:39:29 +0000 Received: from HE1EUR01FT007.eop-EUR01.prod.protection.outlook.com (2603:10a6:150:8:cafe::ba) by GV3P280CA0117.outlook.office365.com (2603:10a6:150:8::33) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6477.37 via Frontend Transport; Wed, 14 Jun 2023 13:39:29 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 171.17.131.35) smtp.mailfrom=amadeus.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amadeus.com; Received-SPF: Pass (protection.outlook.com: domain of amadeus.com designates 171.17.131.35 as permitted sender) receiver=protection.outlook.com; client-ip=171.17.131.35; helo=smtpexch.amadeus.com; pr=C Received: from smtpexch.amadeus.com (171.17.131.35) by HE1EUR01FT007.mail.protection.outlook.com (10.152.1.243) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6500.25 via Frontend Transport; Wed, 14 Jun 2023 13:39:29 +0000 Received: from MUCEX20MBX001.iis.amadeus.net (172.19.131.74) by smtpexch.amadeus.com (172.19.134.54) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.26; Wed, 14 Jun 2023 13:38:32 +0000 Received: from 3b4057da50a8.rnd.amadeus.net (10.64.176.26) by MUCEX20MBX001.iis.amadeus.net (172.19.131.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.26; Wed, 14 Jun 2023 13:39:28 +0000 From: Romain Geissler To: CC: Romain Geissler Subject: Performance issue with systemd-coredump and container process linking 2000 shared libraries. Date: Wed, 14 Jun 2023 13:39:20 +0000 Message-ID: <20230614133920.10-1-romain.geissler@amadeus.com> X-Mailer: git-send-email 2.39.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: MUCEXHYBP01.iis.amadeus.net (172.19.131.107) To MUCEX20MBX001.iis.amadeus.net (172.19.131.74) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: HE1EUR01FT007:EE_|AS8PR10MB6339:EE_ X-MS-Office365-Filtering-Correlation-Id: 6c60fdd7-3bb7-47c4-369b-08db6cdcc713 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: fCnPAqiJq7I/+TFWwfmm2k9fRfx/fIR17wiQxtePlilgUzzayyaN9aC61szucumoR0GwZZ8+dGmLX0H0KPv8IhmS+MhPYjE9lOA6dyqFgjMfJ2IO0MZaDddbjiAxUsxW23vulpTT7xdUUsy/MIqQQgccalbKFwtYyn56raOGoU8j4u01x3u8d0TaDEqDm8ca71gVwLt6Mu3zSUoBwN4ZfQs4Pruik4KUcThVbZA0VO3CLV+X7m5vxBUk+0gNf1q01mFM9iPLUTrxQtFTguf8WLxrEPQpIWjM+zpMhc7otJYNRAvF7yxzeF0f+bcUvtKoM3xyhwQ9WcR/feTJ438JeOFgpiZFK2AIk2V5UD52NHdtgzmWPy+JhL3ypjTei4TSC1dmnXCX3rRCa4ifhxva2EmjNFPPfYWcdz+Z8P7J91p/Yp9zRyuR6FBVv+fdlTqCypnZEkno1omESySLRatmECizOb03EkEC+QFOCuW3xduuKgxG2CaGLERchtQSxnEM8CZyV1Trxe/TLQe/yTcLcjXSRRaVOWcMNAf7PV9mEOfuuLku2pBSYbsQQ4Rq6zoBvIvPV/IycIn4bSEdUt3doybwJmiJ14lOqofIwTofvSCYmygWP/pe7PHYsJZaMaT5SgzNfwejSehTPDpAOk1bxekR4zQAcNCwCLZK2fHxM9eV3TKoR/ZlrwDuq3976NhI+VU4y4iNxaLn01lgQawXVK46/dqDmqsK3xmfg3vl5uGyJpfrtrPpkvaF+dXm3/tLTWrhl7tHy3B3UDcdDsfhqQ== X-Forefront-Antispam-Report: CIP:171.17.131.35;CTRY:DE;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:smtpexch.amadeus.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230028)(4636009)(39860400002)(136003)(346002)(396003)(376002)(451199021)(46966006)(36840700001)(40470700004)(36756003)(4326008)(70206006)(478600001)(86362001)(70586007)(966005)(6916009)(6666004)(107886003)(316002)(44832011)(30864003)(40480700001)(41300700001)(2906002)(8936002)(5660300002)(8676002)(81166007)(356005)(82960400001)(2616005)(82740400003)(186003)(1076003)(26005)(83380400001)(36860700001)(336012)(47076005)(82310400005)(40460700003)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amadeus.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Jun 2023 13:39:29.0337 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 6c60fdd7-3bb7-47c4-369b-08db6cdcc713 X-MS-Exchange-CrossTenant-Id: b3f4f7c2-72ce-4192-aba4-d6c7719b5766 X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=b3f4f7c2-72ce-4192-aba4-d6c7719b5766;Ip=[171.17.131.35];Helo=[smtpexch.amadeus.com] X-MS-Exchange-CrossTenant-AuthSource: HE1EUR01FT007.eop-EUR01.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: AS8PR10MB6339 X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi, I am reaching out the elfutils mailing to discuss one issue we have seen in one of our OpenShift production workload. Note that if you are a Red Hat employee, this actually corresponds to Red Hat support case 03536980 which I have yet to update with this informations. Description of the issue: While moving from OpenShift 4.12 to 4.13, we also move the system from a RHEL 8 base to a RHEL 9 one. Among the changes, it upgrades systemd to version 252. Historically systemd-coredump use to mind only about buildids when analysing core dumps, however now it also tries to parse the package metadata note section (cf https://systemd.io/ELF_PACKAGE_METADATA/). We use sytemd-coredump as our default core handler, and being an OpenShift node we run there mainly containers. As part of our container, we have a particular binary which links against close to 2000 shared libraries. And when this program crashes, it takes several minutes for systemd-coredump to "analyze" the core, which is spent in parsing the elf part of it. I have been able to extract from the systemd code (cf https://github.com/systemd/systemd/blob/22afd5cb2cf06418e5a84907b1ff782d29a33e47/src/shared/elf-util.c#L482 ) the logic of systemd into a simplified C program that links against systemd to show how "slow" is systemd. I will attach this program a bit after, and also explain a bit my findings and what "fix" I suggested internally we propose to systemd folks. First to reproduce, we can create a dummy binary that links against 2000 lib. Credits for finding and writing this simple reproducer goes to Francois Rigault, one of my colleague: cat > lib.c << EOF int TOTO = 42; EOF cat > generate-core.c << EOF #include int main() { printf("started\n"); int* i = 0; *i = 42; } EOF N_LIBS=2000 LIBS="" for i in {1..${N_LIBS}}; do [[ -f liba$i.so ]] || gcc -shared -fPIC -o liba$i.so lib.c -D TOTO=var$i done echo libs built gcc -o generate-core generate-core.c -L. -Wl,-rpath,$(pwd) -la{1..${N_LIBS}} echo running binary time ./generate-core # Here a core is generated, rename it to "the-core" in the current directory. # Finally, remove all libaN.so files, to simulate the fact that systemd-coredump # runs in a different mount filesystem as the container process which crashed # (otherwise the following reproducer might be "fast"): rm liba*.so Once a core dump has been generated, I tried to mimic the code of systemd-coredump. It looks like this: cat > mimic-systemd-coredump.c << EOF #include "elfutils/libdwfl.h" #include #include static const Dwfl_Callbacks callbacks = { .find_elf = dwfl_build_id_find_elf, .section_address = dwfl_offline_section_address, .find_debuginfo = dwfl_standard_find_debuginfo, }; void print_elf_error() { printf("Elf error: %s\n", elf_errmsg(elf_errno())); } void print_dwfl_error() { printf("Dwfl error: %s\n", dwfl_errmsg(dwfl_errno())); } // In Systemd, this tries to retrieve the Json package metadata note. // returns a value > 0 when data is found // returns 0 when no data is found, but no error happened // returns a value < 0 when a strong error happen and we have to stop. int parse_module_package_metadata_best_effort(Elf* elf, const char* name) { size_t n_program_headers; size_t r = elf_getphdrnum(elf, &n_program_headers); if (r == 0) { printf("Found %lli program header for module %s\n", n_program_headers, name); // Don't implement the full systemd logic here. return 1; } else { // Best effort parsing: silently ignore the error. return 0; } } static int module_callback(Dwfl_Module *mod __attribute__ ((unused)), void **userdata __attribute__ ((unused)), const char *name __attribute__ ((unused)), Dwarf_Addr start, void *arg) { Elf* elf = NULL; GElf_Addr bias; // Comment the next line to "improve" performances. elf = dwfl_module_getelf(mod, &bias); if (elf) { printf("dwfl_module_getelf has found the elf program %s\n", name); int ret = parse_module_package_metadata_best_effort(elf, name); if (ret > 0) return DWARF_CB_OK; else if (ret < 0) return DWARF_CB_ABORT; else elf = arg; } else { printf("dwfl_module_getelf returned NULL for module %s\n", name); //print_dwfl_error(); elf = arg; } size_t n_program_headers; assert(elf_getphdrnum(elf, &n_program_headers) >= 0); printf("iterate: %s: %lli\n", name, n_program_headers); for (size_t i = 0; i < n_program_headers; ++i) { GElf_Phdr mem, *program_header; Elf_Data *data; /* The core file stores the ELF files in the PT_LOAD segment. */ program_header = gelf_getphdr(elf, i, &mem); if (!program_header || program_header->p_type != PT_LOAD) continue; /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */ if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz) { //printf("skip !!!\n"); continue; } /* Now get a usable Elf reference, and parse the notes from it. */ data = elf_getdata_rawchunk(elf, program_header->p_offset, program_header->p_filesz, ELF_T_NHDR); if (!data) continue; Elf *memelf = elf_memory(data->d_buf, data->d_size); if (!memelf) { elf_end(memelf); continue; } int ret = parse_module_package_metadata_best_effort(memelf, name); elf_end(memelf); if (ret > 0) return DWARF_CB_OK; else if (ret < 0) return DWARF_CB_ABORT; } return DWARF_CB_OK; } int main() { elf_version(EV_CURRENT); int fd = open ("the-core", O_RDONLY); assert(fd != 0); Elf *elf = elf_begin(fd, ELF_C_READ_MMAP, NULL); assert(elf); int ret; GElf_Ehdr elf_header; assert(gelf_getehdr(elf, &elf_header) != NULL); printf("elf_header.e_type = %i\n", elf_header.e_type); assert(elf_header.e_type == ET_CORE); Dwfl *dwfl = dwfl_begin(&callbacks); ret = dwfl_core_file_report(dwfl, elf, NULL); printf("dwfl_core_file_report: %i\n", ret); assert(ret >= 0); ret = dwfl_report_end(dwfl, NULL, NULL); assert(ret == 0); ptrdiff_t offset = dwfl_getmodules (dwfl, &module_callback, elf, 0); if (offset < 0) assert(false); dwfl_end (dwfl); return 0; } EOF And it can be compiled with: gcc -O2 -o mimic-systemd-coredump mimic-systemd-coredump.c -lelf -ldw When running ./mimic-systemd-coredump we can see that the call to dwfl_module_getelf always return NULL for libaN.so files, so we fallback on the systemd logic to try to parse the package metadata section from the core directly. It takes minutes to run the full program. If however I comment the call to dwfl_module_getelf, then it runs in less than a second. My understanding of the will of systemd developers is that they hoped that libdwfl would return some "partial" Elf* reference when calling dwfl_module_getelf, from the elf headers found in the core for each and every shared library (the first page of the PT_LOAD mappings that the kernel always dumps even when the mapping is file backed). However it seems that behind the hood it doesn't (is it linked to core_file_read_eagerly which seems to always return false in this case ?), and instead it uses the .find_elf = dwfl_build_id_find_elf callback which tries to find the file by buildid on the filesystem. For some unknown reason to me, calling dwfl_module_getelf is very slow (I wouldn't expect that looking on the filesytem by buildid is that slow actually). So, is this behavior of dwfl_module_getelf expected ? If yes, do you agree that we shall advise systemd-coredump developer to invert their logic, to first try to look for partial elf header from the core's PT_LOAD section, then only fallback to the more reliable dwfl_module_getelf if it didn't work ? In practice, we have tried the following patch applied to systemd v253 and it seems ot "fix" the above mentionned case: commit 9463e1f22cbaaf68de9b547da243e70f1b2f622a Author: Romain Geissler Date: Wed Jun 14 11:59:31 2023 +0000 First try to fix core dump issue on v253. diff --git a/src/shared/elf-util.c b/src/shared/elf-util.c index 8da16f528f..fadb3e81a3 100644 --- a/src/shared/elf-util.c +++ b/src/shared/elf-util.c @@ -503,26 +503,13 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name, if (r < 0) return DWARF_CB_ABORT; - /* The .note.package metadata is more difficult. From the module, we need to get a reference - * to the ELF object first. We might be lucky and just get it from elfutils. */ - elf = sym_dwfl_module_getelf(mod, &bias); - if (elf) { - r = parse_package_metadata(name, id_json, elf, NULL, c); - if (r < 0) - return DWARF_CB_ABORT; - if (r > 0) - return DWARF_CB_OK; - } else - elf = c->elf; - - /* We did not get the ELF object, or it's just a reference to the core. That is likely - * because we didn't get direct access to the executable, and the version of elfutils does - * not yet support parsing it out of the core file directly. - * So fallback to manual extraction - get the PT_LOAD section from the core, + /* First try with the (maybe partial) ELF object we can get from the core itself. Indeed + * the core might run in a different namespace in which case the dwfl_module_getelf + * function to try to open the file by build id in the current mount namespace. + * So first use manual extraction - get the PT_LOAD section from the core, * and if it's the right one we can interpret it as an Elf object, and parse * its notes manually. */ - - r = sym_elf_getphdrnum(elf, &n_program_headers); + r = sym_elf_getphdrnum(c->elf, &n_program_headers); if (r < 0) { log_warning("Could not parse number of program headers from core file: %s", sym_elf_errmsg(-1)); /* -1 retrieves the most recent error */ @@ -536,12 +523,16 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name, Elf_Data *data; /* The core file stores the ELF files in the PT_LOAD segment. */ - program_header = sym_gelf_getphdr(elf, i, &mem); + program_header = sym_gelf_getphdr(c->elf, i, &mem); if (!program_header || program_header->p_type != PT_LOAD) continue; + /* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */ + if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz) + continue; + /* Now get a usable Elf reference, and parse the notes from it. */ - data = sym_elf_getdata_rawchunk(elf, + data = sym_elf_getdata_rawchunk(c->elf, program_header->p_offset, program_header->p_filesz, ELF_T_NHDR); @@ -555,7 +546,18 @@ static int module_callback(Dwfl_Module *mod, void **userdata, const char *name, if (r < 0) return DWARF_CB_ABORT; if (r > 0) - break; + return DWARF_CB_OK; + } + + /* Retrieving package_metadata only from the core fiel failed. Try to to get a reference + * to the ELF object instead. We might be lucky and just get it from elfutils or the filesystem. */ + elf = sym_dwfl_module_getelf(mod, &bias); + if (elf) { + r = parse_package_metadata(name, id_json, elf, NULL, c); + if (r < 0) + return DWARF_CB_ABORT; + if (r > 0) + return DWARF_CB_OK; } return DWARF_CB_OK; Some other side question: on the long run, wouldn't it make sense that elfutils tries to parse the json package metadata section by itself, just like it does for the buildid, rather than implementing this logic in systemd ? Thanks for reading ;) Cheers, Romain