From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 25C533898512; Fri, 21 Jun 2024 10:18:41 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 25C533898512
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1718965121;
	bh=tOhU/KDhJCg5Su8Sfk6lH4017yGfGQYu8xK9Aq/tRUw=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=UfqOI80/fok7i+IZ25zsDINKGPcSucyud1oxE04iDhePlFg1uN89hF9QThqgPJedk
	 NGqH6ELwPT31x9gg46XltTUmyCgpXuZBxplBgFsgzb7Ksi5P+O3woje+svP0jbJySi
	 Bz6T9eHQ9Ux98sgewy3RTrQ5UTjupKXk876g5O0I=
From: "vries at gcc dot gnu.org" <sourceware-bugzilla@sourceware.org>
To: gdb-prs@sourceware.org
Subject: [Bug gdb/31832] [gdb] FAIL:
 gdb.threads/attach-many-short-lived-threads.exp: iter 3: attach (timeout)
Date: Fri, 21 Jun 2024 10:18:40 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gdb
X-Bugzilla-Component: gdb
X-Bugzilla-Version: HEAD
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: vries at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-31832-4717-kFck64MLUG@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-31832-4717@http.sourceware.org/bugzilla/>
References: <bug-31832-4717@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gdb-prs.sourceware.org>

https://sourceware.org/bugzilla/show_bug.cgi?id=3D31832
--- Comment #8 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Thiago Jung Bauermann from comment #7)
> My machine has many cores but it's an older CPU model (Neoverse N1). These
> numbers show that the POWER10 system has a much higher capacity to churn
> out new threads than my system (no surprise there).  My understanding is
> that GDB is overwhelmed by the constant stream of newly spawned threads
> and takes a while to attach to all of them.
>=20

Agreed.

[ FYI, one particular thing about my setup is that I build at -O0.  I just
tried at -O2, but I still run into this problem. ]

> As Pedro mentioned elsewhere=C2=B9, Linux doesn't provide a way for GDB t=
o stop
> all of a process' threads, or cause new ones to spawn in a
> "ptrace-stopped" state. Without such mechanism, the only way I can see of
> addressing this problem is by making GDB parallelize the job of attaching
> to all inferior threads using its worker threads =E2=80=94 i.e., fight fi=
re with
> fire. :)
>=20
> That wouldn't be a trivial change though. IIUC it would mean that
> different inferior threads would have different tracers (the various GDB
> worker threads), and GDB would need to take care to use the correct worker
> thread to send ptrace commands to each inferior thread.
>=20

One approach could be to have the gdb main thread do only the ptrace bit and
offload the rest of the loop in another thread.  But I'm not sure if that
actually addresses the bottleneck.

> Another approach would be to see if there's a way to make
> attach_proc_task_lwp_callback () faster, but from reading the code it
> doesn't look like there's anything too slow there =E2=80=94 except perhap=
s the
> call to linux_proc_pid_is_gone (), which reads /proc/$LWP/status. Though
> even that would be just mitigation since the fundamental limitation would
> still be there.
>=20

I've played around a bit with this for half a day or so, but didn't get
anywhere.

> Alternatively, (considering that the testcase is contrived) can the
> testcase increase the timeout proportionally to the number of CPUs on the
> system?
>=20

Or conversely, put a proportional limit on the number threads in the test-c=
ase.

> > dir_entries: 594518^M
> > no_lwp: 4412^M
> > lookup: 119037^M
> > skipped: 118355^M
> > insert: 682^M
> > attach: 471751^M
> > start_over: 2091^M
> > Cannot attach to lwp 2340832: Operation not permitted (1)^M
> > ...
> >
> > I'm not sure what this means, but I do notice the big difference between
> > dir_entries and lookup.  So only 20% of the time we find the starttime =
and
> > can use the cache.
>=20
> I thought that not being able to read starttime from /proc meant that the
> thread was gone. But from the statistics I pasted above, in about 34% of
> the time GDB didn't find the starttime and still was able to attach to all
> but one of the new threads. My understanding is that there's a race
> condition between GDB and the Linux kernel when reading the stat file for
> a newly created thread.
>=20
> This is harmless though: if starttime can't be obtained, GDB will try to
> attach to the thread anyway.
>=20

Agreed, it's harmless (though slow of course).

--=20
You are receiving this mail because:
You are on the CC list for the bug.=