From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id C6CCF3858CDA; Fri, 6 Oct 2023 00:24:43 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C6CCF3858CDA DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1696551883; bh=7Kc2mvqTz7OBPmKDmzg+opGr0EtESgjCj/WG4hRgPEo=; h=From:To:Subject:Date:From; b=Ld0HPHzfD+dsrK5IzQCQut1jUSdAt5o9uqaoDapYTa18p0958VzpNeUxVI5z7Kl94 Z9Q56s5oDMa88DSDafGSB5RBjovx57Xq3ysXf7KfBzXAWAnhKNe3Q3LAmqVHmUt0AT 4X/HU+EI7WuSvqChSXaKN8aWzlNBRQKOoZC/hsI0= From: "mail at roychan dot org" To: glibc-bugs@sourceware.org Subject: [Bug malloc/30945] New: Core affinity setting incurs lock contentions between threads Date: Fri, 06 Oct 2023 00:24:43 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: malloc X-Bugzilla-Version: 2.38 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: mail at roychan dot org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D30945 Bug ID: 30945 Summary: Core affinity setting incurs lock contentions between threads Product: glibc Version: 2.38 Status: UNCONFIRMED Severity: normal Priority: P2 Component: malloc Assignee: unassigned at sourceware dot org Reporter: mail at roychan dot org Target Milestone: --- Created attachment 15156 --> https://sourceware.org/bugzilla/attachment.cgi?id=3D15156&action=3Ded= it the example program to reproduce the issue Hi, I recently encounter poor malloc/free performance when building a data-intensive application. The deserialization library we used works 10x slower than expected. Investigations show that this is due to the arena_get2 function uses __get_nprocs_sched instead of __get_nprocs. Without changing = core affinity settings, this call returns the real number of cores so the upper limit of total arenas is set correctly. However, if a thread is pinned to a core, further malloc calls only sees n =3D 1 because the function returns o= nly schedulable cores. Therefore, the maximum number of arenas will be 8 on 64-= bit platforms. This leads to arena lock contentions between threads if: - The program spans multiple cores (say, more than 8 cores). - Threads are pinned to cores before any malloc calls, so they have not attached to any arenas. - Later memory allocations are served from the arenas. - No MALLOC_ARENA_MAX tunable is set to manually increase the limit. A mail thread about this briefly discussed this issue last year: https://sourceware.org/pipermail/libc-alpha/2022-June/140123.html However, it did not give a program that can be used to easily reproduce the (un)expected behaviors. Here I would like to provide a minimal example that= can will expose the problem, and, if possible, initiate further discussions abo= ut whether the core counting in arena_get can be better implemented. The program accepts 3 arguments. The first one is the number of cores, the second one is whether the thread is pinned to a core right after its creati= on, and the third one is whether we would like to apply a small "fix". The fix = is add a free(malloc(8)) right before we set the affinity in each thread. In t= his case, each thread can see all the cores so they can create and attach to a "local" arena that is not shared. The output is the average time each thread uses to finish a bunch of malloc/frees. The following is the result I collected from my PC with 16-core Ryzen 9 595= 0X, running Linux kernel 6.5.5 and glibc 2.38. The program is compiled using gcc 13.2.1 without optimizations flags. ./a.out 32 false false --- nr_cpu: 32 pin: no fix: no thread average (ms): 16.233663 ./a.out 32 true false --- nr_cpu: 32 pin: yes fix: no thread average (ms): 1360.919047 ./a.out 32 true true --- nr_cpu: 32 pin: yes fix: yes thread average (ms): 15.505453 env GLIBC_TUNABLES=3D'glibc.malloc.arena_max=3D32' ./a.out 32 true false --- nr_cpu: 32 pin: yes fix: no thread average (ms): 16.036667 Also recorded a few runs with perf. It suggested massive overheads in __lll_lock_wait_private and __lll_lock_wake_private calls. --=20 You are receiving this mail because: You are on the CC list for the bug.=