From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 89FC73858D37; Tue, 19 Mar 2024 19:10:17 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 89FC73858D37 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1710875417; bh=LBZ9Vxt9eqiL27HSFxxiWF9x6MuTqlHigisIq2aaA4I=; h=From:To:Subject:Date:In-Reply-To:References:From; b=EEFPHZPI9rmYXh/tYPdzyl/dofDrVWr6ogkOqwKolNo1Nxcxfq6WQBkwlojfqU39y 7QQGJWyds9qNKzwGSTiRIlzvEcpGUnyS0GtwOTLcc0A6W0HpHGzjzA6dMqHHA+3HPI 5z1BUCB52Yv5rz6qeIOzXXlT0zIT5hsye9CcHWB8= From: "thiago.bauermann at linaro dot org" To: gdb-prs@sourceware.org Subject: [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results Date: Tue, 19 Mar 2024 19:10:16 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gdb X-Bugzilla-Component: testsuite X-Bugzilla-Version: HEAD X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: thiago.bauermann at linaro dot org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: cel at linux dot ibm.com X-Bugzilla-Target-Milestone: 15.1 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D31312 --- Comment #23 from Thiago Jung Bauermann --- (In reply to Thiago Jung Bauermann from comment #19) > 1. The issue I mentioned in comment #17 (which I have since confirmed is > what is going on), where the linux_proc_attach_tgid_threads () never > ends when there are zombie threads present in the inferior. Since > attach-many-short-lived-threads.c constantly creates and finishes > joinable threads, the chance of having zombie threads is high. >=20 > From looking at the gdb.log files Carl provided, I believe he is > seeing the same problem. >=20 > The solution is to make GDB remember when it has already visited the > /proc directory of a given LWP, and skip it in the following iteration= s. > I implemented the attached patch to do that, and now I don't observe G= DB > hanging anymore in the aarch64-linux server in which I used to easily > reproduce this problem. If Carl could test it on POWER10, it would be > helpful. I'll clean up the code and post it on the mailing list. >From looking at the Power 10 gdb.log files attached to this bugzilla, and also Carl's results with my proposed fix I believe this bugzilla is specifically about the issue described above. > 2. Behaviour 2 which I described in comment #12. I'll repeat it here for Sorry, I referenced the wrong comment. It's actually comment #16. > completeness: >=20 > (gdb) attach 2039552 > Attaching to process 2039552 > Cannot attach to lwp 2689792: Operation not permitted (1), process > 2689792 is already traced by process 2039527 >=20 > PID 2039552 is the testcase inferior, and 2039527 is GDB. GDB didn't > report any success in attaching to the process. >=20 > This is very rarely observed on my test system. I saw it only 3 times = in > thousands of testcase runs. I wasn't able to investigate it yet. >=20 > I'll open a separate bugzilla about this. I didn't find any existing bugzilla about this problem, so I opened bug #31512 about it, and pasted there Tom Tromey's suggestion from comment #18 about modifying the testcase to generate an strace log file (thanks for the suggestion). > 3. This one isn't a bug, but an issue that arises from the way > attach-many-short-lived-threads.c behaves: since it's constantly > creating new threads it's impossible for GDB to know when it has > attached to all of them so that it can finish the loop in > linux_proc_attach_tgid_threads (). Because of this, even with the fix > for issue #1 applied, the testcase fails once in a while =E2=80=94 I l= eft the > test running in a loop overnight and it failed after about 2500 > iterations. There is already bug #26286 about this issue, so I updated it with the results reported here, and my understanding of the problem. > The only way I can see to improve GDB's behaviour is to increase the > number of iterations of the loop that checks for new threads. I suspect > that the ability of the inferior to create new threads is proportional > to the number of CPUs present in the system (my test machine has 160 > cores), so I will propose a patch that makes the number of iterations > proportinal to the number of CPUs. As I mentioned in bug #26286, I've changed my mind about making the number of iterations proportional to the number of CPUs, because on the machines I have at hand, the one where it takes longest to reproduce the problem has the most CPUs (160, vs 8 CPUs on the other machines). I'm not sure how to move forward about this. (In reply to Carl E Love from comment #22) > Thiago: >=20 > Yes, the log files where the failures "the program is no longer running" > occur has the line: >=20 > Program terminated with signal SIGTRAP, Trace/breakpoint trap. > The program no longer exists. >=20 > So yes, that does match issue 3, comment #19. Nice, thank you for confirming. > Fixing the detach issue would go a long way to making the test a lot more > reliable. Just a minor correction, to avoid confusion: this GDB hang happens at attach time and is not related to any previous detach command. > The SIGTRAP issue happens about 0.5% of the time. Yes, it's also not very common on my machines. Somewhat surprisingly, my experience is that it's easier to reproduce on x86_64-linux than on aarch64-linux. > I haven't seen issue 2 yet, at least not that I can tell. But based on > what you said it is really unlikely to hit. Yes, it's very uncommon. Though I did hit it or something like it on an x86_64-linux machine just now (reported on bug #31512). --=20 You are receiving this mail because: You are on the CC list for the bug.=