From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 89FC73858D37; Tue, 19 Mar 2024 19:10:17 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 89FC73858D37
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1710875417;
	bh=LBZ9Vxt9eqiL27HSFxxiWF9x6MuTqlHigisIq2aaA4I=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=EEFPHZPI9rmYXh/tYPdzyl/dofDrVWr6ogkOqwKolNo1Nxcxfq6WQBkwlojfqU39y
	 7QQGJWyds9qNKzwGSTiRIlzvEcpGUnyS0GtwOTLcc0A6W0HpHGzjzA6dMqHHA+3HPI
	 5z1BUCB52Yv5rz6qeIOzXXlT0zIT5hsye9CcHWB8=
From: "thiago.bauermann at linaro dot org"
 <sourceware-bugzilla@sourceware.org>
To: gdb-prs@sourceware.org
Subject: [Bug testsuite/31312] attach-many-short-lived-threads gives
 inconsistent results
Date: Tue, 19 Mar 2024 19:10:16 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gdb
X-Bugzilla-Component: testsuite
X-Bugzilla-Version: HEAD
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: thiago.bauermann at linaro dot org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: cel at linux dot ibm.com
X-Bugzilla-Target-Milestone: 15.1
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-31312-4717-FDwHluC5iL@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-31312-4717@http.sourceware.org/bugzilla/>
References: <bug-31312-4717@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gdb-prs.sourceware.org>

https://sourceware.org/bugzilla/show_bug.cgi?id=3D31312

--- Comment #23 from Thiago Jung Bauermann <thiago.bauermann at linaro dot =
org> ---
(In reply to Thiago Jung Bauermann from comment #19)
> 1. The issue I mentioned in comment #17 (which I have since confirmed is
>    what is going on), where the linux_proc_attach_tgid_threads () never
>    ends when there are zombie threads present in the inferior. Since
>    attach-many-short-lived-threads.c constantly creates and finishes
>    joinable threads, the chance of having zombie threads is high.
>=20
>    From looking at the gdb.log files Carl provided, I believe he is
>    seeing the same problem.
>=20
>    The solution is to make GDB remember when it has already visited the
>    /proc directory of a given LWP, and skip it in the following iteration=
s.
>    I implemented the attached patch to do that, and now I don't observe G=
DB
>    hanging anymore in the aarch64-linux server in which I used to easily
>    reproduce this problem. If Carl could test it on POWER10, it would be
>    helpful. I'll clean up the code and post it on the mailing list.

>From looking at the Power 10 gdb.log files attached to this bugzilla, and
also Carl's results with my proposed fix I believe this bugzilla is
specifically about the issue described above.

> 2. Behaviour 2 which I described in comment #12. I'll repeat it here for

Sorry, I referenced the wrong comment. It's actually comment #16.

>    completeness:
>=20
>    (gdb) attach 2039552
>    Attaching to process 2039552
>    Cannot attach to lwp 2689792: Operation not permitted (1), process
>    2689792 is already traced by process 2039527
>=20
>    PID 2039552 is the testcase inferior, and 2039527 is GDB. GDB didn't
>    report any success in attaching to the process.
>=20
>    This is very rarely observed on my test system. I saw it only 3 times =
in
>    thousands of testcase runs. I wasn't able to investigate it yet.
>=20
>    I'll open a separate bugzilla about this.

I didn't find any existing bugzilla about this problem, so I opened
bug #31512 about it, and pasted there Tom Tromey's suggestion from
comment #18 about modifying the testcase to generate an strace log file
(thanks for the suggestion).

> 3. This one isn't a bug, but an issue that arises from the way
>    attach-many-short-lived-threads.c behaves: since it's constantly
>    creating new threads it's impossible for GDB to know when it has
>    attached to all of them so that it can finish the loop in
>    linux_proc_attach_tgid_threads (). Because of this, even with the fix
>    for issue #1 applied, the testcase fails once in a while =E2=80=94 I l=
eft the
>    test running in a loop overnight and it failed after about 2500
>    iterations.

There is already bug #26286 about this issue, so I updated it with the
results reported here, and my understanding of the problem.

>    The only way I can see to improve GDB's behaviour is to increase the
>    number of iterations of the loop that checks for new threads. I suspect
>    that the ability of the inferior to create new threads is proportional
>    to the number of CPUs present in the system (my test machine has 160
>    cores), so I will propose a patch that makes the number of iterations
>    proportinal to the number of CPUs.

As I mentioned in bug #26286, I've changed my mind about making the number
of iterations proportional to the number of CPUs, because on the machines I
have at hand, the one where it takes longest to reproduce the problem has
the most CPUs (160, vs 8 CPUs on the other machines). I'm not sure how to
move forward about this.

(In reply to Carl E Love from comment #22)
> Thiago:
>=20
> Yes, the log files where the failures "the program is no longer running"
> occur has the line:
>=20
> Program terminated with signal SIGTRAP, Trace/breakpoint trap.
> The program no longer exists.
>=20
> So yes, that does match issue 3, comment #19.

Nice, thank you for confirming.

> Fixing the detach issue would go a long way to making the test a lot more
> reliable.

Just a minor correction, to avoid confusion: this GDB hang happens at
attach time and is not related to any previous detach command.

> The SIGTRAP issue happens about 0.5% of the time.

Yes, it's also not very common on my machines. Somewhat surprisingly, my
experience is that it's easier to reproduce on x86_64-linux than on
aarch64-linux.

> I haven't seen issue 2 yet, at least not that I can tell.  But based on
> what you said it is really unlikely to hit.

Yes, it's very uncommon. Though I did hit it or something like it on an
x86_64-linux machine just now (reported on bug #31512).

--=20
You are receiving this mail because:
You are on the CC list for the bug.=