From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
	id C6CCF3858CDA; Fri,  6 Oct 2023 00:24:43 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C6CCF3858CDA
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1696551883;
	bh=7Kc2mvqTz7OBPmKDmzg+opGr0EtESgjCj/WG4hRgPEo=;
	h=From:To:Subject:Date:From;
	b=Ld0HPHzfD+dsrK5IzQCQut1jUSdAt5o9uqaoDapYTa18p0958VzpNeUxVI5z7Kl94
	 Z9Q56s5oDMa88DSDafGSB5RBjovx57Xq3ysXf7KfBzXAWAnhKNe3Q3LAmqVHmUt0AT
	 4X/HU+EI7WuSvqChSXaKN8aWzlNBRQKOoZC/hsI0=
From: "mail at roychan dot org" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug malloc/30945] New: Core affinity setting incurs lock
 contentions between threads
Date: Fri, 06 Oct 2023 00:24:43 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: malloc
X-Bugzilla-Version: 2.38
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: mail at roychan dot org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
 attachments.created
Message-ID: <bug-30945-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <glibc-bugs.sourceware.org>

https://sourceware.org/bugzilla/show_bug.cgi?id=3D30945

            Bug ID: 30945
           Summary: Core affinity setting incurs lock contentions between
                    threads
           Product: glibc
           Version: 2.38
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: malloc
          Assignee: unassigned at sourceware dot org
          Reporter: mail at roychan dot org
  Target Milestone: ---

Created attachment 15156
  --> https://sourceware.org/bugzilla/attachment.cgi?id=3D15156&action=3Ded=
it
the example program to reproduce the issue

Hi,

I recently encounter poor malloc/free performance when building a
data-intensive application. The deserialization library we used works 10x
slower than expected. Investigations show that this is due to the arena_get2
function uses __get_nprocs_sched instead of __get_nprocs. Without changing =
core
affinity settings, this call returns the real number of cores so the upper
limit of total arenas is set correctly. However, if a thread is pinned to a
core, further malloc calls only sees n =3D 1 because the function returns o=
nly
schedulable cores. Therefore, the maximum number of arenas will be 8 on 64-=
bit
platforms.

This leads to arena lock contentions between threads if:

- The program spans multiple cores (say, more than 8 cores).
- Threads are pinned to cores before any malloc calls, so they have not
  attached to any arenas.
- Later memory allocations are served from the arenas.
- No MALLOC_ARENA_MAX tunable is set to manually increase the limit.

A mail thread about this briefly discussed this issue last year:
https://sourceware.org/pipermail/libc-alpha/2022-June/140123.html
However, it did not give a program that can be used to easily reproduce the
(un)expected behaviors. Here I would like to provide a minimal example that=
 can
will expose the problem, and, if possible, initiate further discussions abo=
ut
whether the core counting in arena_get can be better implemented.

The program accepts 3 arguments. The first one is the number of cores, the
second one is whether the thread is pinned to a core right after its creati=
on,
and the third one is whether we would like to apply a small "fix". The fix =
is
add a free(malloc(8)) right before we set the affinity in each thread. In t=
his
case, each thread can see all the cores so they can create and attach to a
"local" arena that is not shared. The output is the average time each thread
uses to finish a bunch of malloc/frees.

The following is the result I collected from my PC with 16-core Ryzen 9 595=
0X,
running Linux kernel 6.5.5 and glibc 2.38. The program is compiled using gcc
13.2.1 without optimizations flags.

    ./a.out 32 false false
    ---
    nr_cpu: 32 pin: no fix: no
    thread average (ms): 16.233663

    ./a.out 32 true false
    ---
    nr_cpu: 32 pin: yes fix: no
    thread average (ms): 1360.919047

    ./a.out 32 true true
    ---
    nr_cpu: 32 pin: yes fix: yes
    thread average (ms): 15.505453

    env GLIBC_TUNABLES=3D'glibc.malloc.arena_max=3D32' ./a.out 32 true false
    ---
    nr_cpu: 32 pin: yes fix: no
    thread average (ms): 16.036667

Also recorded a few runs with perf. It suggested massive overheads in
__lll_lock_wait_private and __lll_lock_wake_private calls.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=