public inbox for gdb-prs@sourceware.org
help / color / mirror / Atom feed
* [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results
@ 2024-01-29 18:06 cel at linux dot ibm.com
  2024-01-29 18:08 ` [Bug testsuite/31312] " cel at linux dot ibm.com
                   ` (30 more replies)
  0 siblings, 31 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-01-29 18:06 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

            Bug ID: 31312
           Summary: attach-many-short-lived-threads gives inconsistent
                    results
           Product: gdb
           Version: HEAD
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: testsuite
          Assignee: unassigned at sourceware dot org
          Reporter: cel at linux dot ibm.com
  Target Milestone: ---

Created attachment 15340
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15340&action=edit
run 1, Power 10 gdb.log for the attach-many-short-lived-thread

The test when run on Power  gives inconsistent results.  It seems like the
Power 10 system running:

 Fedora Linux 38 (Server Edition)
 GNU gdb (GDB) 15.0.50.20240126-git
 gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4) 

seem to be more unstable then the Power 9 system where the gdb daily builds are
run.  

I have attached the gdb/testsuite/gdb.log file for the test run with the
command:

  make check RUNTESTFLAGS='GDB=/home/carll/bin/gdb
gdb.threads/attach-many-short-lived-threads.exp '

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
@ 2024-01-29 18:08 ` cel at linux dot ibm.com
  2024-01-29 18:20 ` tromey at sourceware dot org
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-01-29 18:08 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #1 from Carl E Love <cel at linux dot ibm.com> ---
Created attachment 15341
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15341&action=edit
run 2, Power 10 gdb.log for the attach-many-short-lived-thread

Here is the log from a second run.  The number of failures is a little
different as well as when the tests fail.  Adding the file in the hope that
seeing the variability in the failures will be helpful.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
  2024-01-29 18:08 ` [Bug testsuite/31312] " cel at linux dot ibm.com
@ 2024-01-29 18:20 ` tromey at sourceware dot org
  2024-01-29 20:55 ` vries at gcc dot gnu.org
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: tromey at sourceware dot org @ 2024-01-29 18:20 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

Tom Tromey <tromey at sourceware dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tromey at sourceware dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
  2024-01-29 18:08 ` [Bug testsuite/31312] " cel at linux dot ibm.com
  2024-01-29 18:20 ` tromey at sourceware dot org
@ 2024-01-29 20:55 ` vries at gcc dot gnu.org
  2024-01-29 21:35 ` cel at linux dot ibm.com
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: vries at gcc dot gnu.org @ 2024-01-29 20:55 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

Tom de Vries <vries at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vries at gcc dot gnu.org

--- Comment #2 from Tom de Vries <vries at gcc dot gnu.org> ---
It would be good to understan whether the timeouts you see are a just question
of more or less time, or actual hangs?

Can you try out this:
...
diff --git a/gdb/testsuite/gdb.threads/attach-many-short-lived-threads.exp
b/gdb/testsuite/gdb.threads/attach-many-short-lived-threads.exp
index 6bee0ad2e2c..6195c720d57 100644
--- a/gdb/testsuite/gdb.threads/attach-many-short-lived-threads.exp
+++ b/gdb/testsuite/gdb.threads/attach-many-short-lived-threads.exp
@@ -175,4 +175,6 @@ if {[prepare_for_testing "failed to prepare" $testfile
$srcfile \
     return -1
 }

-test
+with_timeout_factor 2 {
+    test
+}
...
?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (2 preceding siblings ...)
  2024-01-29 20:55 ` vries at gcc dot gnu.org
@ 2024-01-29 21:35 ` cel at linux dot ibm.com
  2024-01-29 21:44 ` cel at linux dot ibm.com
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-01-29 21:35 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #3 from Carl E Love <cel at linux dot ibm.com> ---
per comment 2, I added the patch and have run the test 5 times with no
regression errors.  So, it would appear that changing the timeout factor fixes
things.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (3 preceding siblings ...)
  2024-01-29 21:35 ` cel at linux dot ibm.com
@ 2024-01-29 21:44 ` cel at linux dot ibm.com
  2024-01-29 22:38 ` cel at linux dot ibm.com
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-01-29 21:44 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #4 from Carl E Love <cel at linux dot ibm.com> ---
Oops, comment 3 is wrong.  I was ran the wrong test.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (4 preceding siblings ...)
  2024-01-29 21:44 ` cel at linux dot ibm.com
@ 2024-01-29 22:38 ` cel at linux dot ibm.com
  2024-01-30  7:21 ` vries at gcc dot gnu.org
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-01-29 22:38 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #5 from Carl E Love <cel at linux dot ibm.com> ---
Created attachment 15344
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15344&action=edit
run3, with longer time out.

Running the correct test in the correct window....with the patch

The test ran fine the first two times then had errors on the third run.
Fourth run, ran fine.  
Fifth run, had errors.

So the longer timeout might be helping some.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (5 preceding siblings ...)
  2024-01-29 22:38 ` cel at linux dot ibm.com
@ 2024-01-30  7:21 ` vries at gcc dot gnu.org
  2024-01-30 10:13 ` vries at gcc dot gnu.org
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: vries at gcc dot gnu.org @ 2024-01-30  7:21 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #6 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Carl E Love from comment #5)
> Created attachment 15344 [details]
> run3, with longer time out.
> 
> Running the correct test in the correct window....with the patch
> 
> The test ran fine the first two times then had errors on the third run.
> Fourth run, ran fine.  
> Fifth run, had errors.
> 
> So the longer timeout might be helping some.

Does increasing the timeout factor further help?

Does increasing the scope of the timeout factor to include the timeout sampling
in "set options" help?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (6 preceding siblings ...)
  2024-01-30  7:21 ` vries at gcc dot gnu.org
@ 2024-01-30 10:13 ` vries at gcc dot gnu.org
  2024-01-31 16:14 ` cel at linux dot ibm.com
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: vries at gcc dot gnu.org @ 2024-01-30 10:13 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #7 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #2)
> -test
> +with_timeout_factor 2 {
> +    test
> +}
> ...

FTR, with this I get 10/10 ok runs on cfarm120.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (7 preceding siblings ...)
  2024-01-30 10:13 ` vries at gcc dot gnu.org
@ 2024-01-31 16:14 ` cel at linux dot ibm.com
  2024-02-06 18:59 ` cel at linux dot ibm.com
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-01-31 16:14 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #8 from Carl E Love <cel at linux dot ibm.com> ---
I put together a quick script to run the test 10 times.  

I tried it a couple of Power 10 systems.  Here is the patch I have been playing
with to see if it helps:

+set timeout 20
+
 set options { "additional_flags=-DTIMEOUT=$timeout" debug pthreads }

 if {[prepare_for_testing "failed to prepare" $testfile $srcfile \
         $options] == -1} {
     return -1
 }

-test
+with_timeout_factor 3 {
+    test
+}

The default value of timeout was 10.

I have tried a couple of different distros to see if that makes any difference.
I tried to watch top to see what the load average was during the runs.  

Power 10, Fedora Linux 38
Failed 2 out of 10 times with the patch to increase the timeout
Failed 7 out of 10 times with patch, timeout factor of 2.  Note machine was
heavily used some of the time.
Passed 10 out of 10, with patch and timeout factor of 10.  Machine was not very
busy, ran overnight.


Power 10, Red Hat Enterprise Linux 9.3
Failed 3 out of 10 times without the patch

Power 10 Ubuntu 22.04
Failed 2 out of 10 times, without patch
Failed 2 out of 10 times with patch, and a time out factor of 10.

Power 9, Ubuntu 22.04
Success 20 out of 20 runs.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (8 preceding siblings ...)
  2024-01-31 16:14 ` cel at linux dot ibm.com
@ 2024-02-06 18:59 ` cel at linux dot ibm.com
  2024-02-12 18:58 ` tromey at sourceware dot org
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-02-06 18:59 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #9 from Carl E Love <cel at linux dot ibm.com> ---
I spent time playing with changing the timeout with no luck.  So, looking more
carefully at the test and the output,...  In the expect script we have the
following check:

            gdb_test_multiple "attach $testpid" $test {
 ...
                -re "Cannot attach to lwp $decimal: Operation not permitted" {
                    # On Linux, PTRACE_ATTACH sometimes fails with         
                    # EPERM, even though /proc/PID/status indicates        
                    # the thread is running.                               
                    set eperm 1
                    exp_continue
                }
...
                -re "$gdb_prompt $" {
                    if {$eperm} {
                        xfail "$test (EPERM)"
                    } else {
                        pass $test
                    }


When I look at the log file, with some additional put statements to print the
testpid, I see that once we hit the above result, we do an XFAIL.  Then the
test loops around and tries to do the attach for the same testpid again.  This
time it times out and all the rest of the tests end up timing out for all of
the remaining iterations.  

From the comment in the test, it seems to imply that the test expects the
situation to be transient and the next attempt to attach should succeed.  Well,
that doesn't seem to be the case, at least for Power 10. So, it seems we need
to "fix" the handling for this error?

A few possibilities come to mind, 1) just exit the test on this failure; 2) try
sleeping a little in the hope that the "issue" will clear up and the next
attach will succeed; 3) get a new testpid and continue the test.

1)  I am not really excited by this option in that if the failure occurred on
the first iteration then we really haven't tested things properly.

2)  I tried putting a sleep 1 in before the exp_continue.  Unfortunately, that
didn't fix things.  In one case, I to messages on a subsequent iteration that
the "program is no longer running".  In another case, things just timed out as
before.

3)  This option basically throws out the problem testid and gets a new one.  I
tried this with the following change to the test:

diff --git a/gdb/testsuite/gdb.threads/attach-many-short-lived-threads.exp
b/gdb
/testsuite/gdb.threads/attach-many-short-lived-threads.exp
index 872473aa550..2b5c80e4323 100644
--- a/gdb/testsuite/gdb.threads/attach-many-short-lived-threads.exp
+++ b/gdb/testsuite/gdb.threads/attach-many-short-lived-threads.exp
@@ -87,6 +87,15 @@ proc test {} {
                -re "$gdb_prompt $" {
                    if {$eperm} {
                        xfail "$test (EPERM)"
+                       # The attach failed.  No point in doing the rest
+                       # of the tests since we are not attached?  So
+                       # should we either 1) exit the test; or 2)
+                       # try again with a new testpid?
+                       puts "CARLL, xfail EPERM, testpid $testpid"
+
+                       # Try a new process
+                       set test_spawn_id [spawn_wait_for_attach $binfile]
+                       set testpid [spawn_id_get_pid $test_spawn_id]
                    } else {
                        pass $test
                    }

With this test we can complete all the test iterations but with different
testpids.  Output from the modified test for one of my test runs:

Running target unix
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for
target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using
/home/carll/GDB/build-current/gdb/testsuite/../../../binutils-gdb-current/gdb/testsuite/config/unix\
.exp as tool-and-target-specific interface file.
Running
/home/carll/GDB/build-current/gdb/testsuite/../../../binutils-gdb-current/gdb/testsuite/gdb.threa\
ds/attach-many-short-lived-threads.exp ...
CARLL, timeout = 10
CARLL, run test on testpid = 3050726
CARLL, attempt = 1
CARLL, attempt = 2
CARLL, EPERM failue testpid = 3050726, attempt = 2
CARLL, xfail EPERM, testpid 3050726
CARLL, attempt = 3
CARLL, EPERM failue testpid = 3102706, attempt = 3
CARLL, xfail EPERM, testpid 3102706
CARLL, attempt = 4
CARLL, attempt = 5
CARLL, attempt = 6
CARLL, attempt = 7
CARLL, attempt = 8
CARLL, attempt = 9
CARLL, attempt = 10

                === gdb Summary ===

# of expected passes            87
# of expected failures          2


This fixes the failures on Power 10.  We still don't know the underlying reason
for the EPERM failure in the first place.  All we do is abandon that pid and
continue with a new one.  

Any thoughts of other ways to handle the case of EPERM failure?  Is there a
better solution?  Thoughts?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (9 preceding siblings ...)
  2024-02-06 18:59 ` cel at linux dot ibm.com
@ 2024-02-12 18:58 ` tromey at sourceware dot org
  2024-02-12 18:59 ` tromey at sourceware dot org
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: tromey at sourceware dot org @ 2024-02-12 18:58 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

Tom Tromey <tromey at sourceware dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |15.1

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (10 preceding siblings ...)
  2024-02-12 18:58 ` tromey at sourceware dot org
@ 2024-02-12 18:59 ` tromey at sourceware dot org
  2024-02-16  4:42 ` cel at linux dot ibm.com
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: tromey at sourceware dot org @ 2024-02-12 18:59 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #10 from Tom Tromey <tromey at sourceware dot org> ---
(In reply to Carl E Love from comment #9)

> This fixes the failures on Power 10.  We still don't know the underlying
> reason for the EPERM failure in the first place.  All we do is abandon that
> pid and continue with a new one.  
> 
> Any thoughts of other ways to handle the case of EPERM failure?  Is there a
> better solution?  Thoughts?

Based on comments I think it's a bug in the Linux kernel.

I marked this as required for 15.1, but TBH I think this mostly
means detecting the failure and moving on.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (11 preceding siblings ...)
  2024-02-12 18:59 ` tromey at sourceware dot org
@ 2024-02-16  4:42 ` cel at linux dot ibm.com
  2024-03-09  0:45 ` tromey at sourceware dot org
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-02-16  4:42 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #11 from Carl E Love <cel at linux dot ibm.com> ---
Created attachment 15370
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15370&action=edit
Proposed patch to continue after detecting error


Tom, you said "I think this mostly means detecting the failure and moving on". 
Detecting the issue is easy, not clear from your comments what you think should
be done to move on.  in comment 3 I gave a possible way to continue after
detecting the issue.  Not sure if this is the solution you have in mind or not. 

I have attached a proposed patch.  If you feel this is an acceptable solution,
let me know and I will post it to the mailing list.  Otherwise, please let me
know if you have some other solution in mind.  Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (12 preceding siblings ...)
  2024-02-16  4:42 ` cel at linux dot ibm.com
@ 2024-03-09  0:45 ` tromey at sourceware dot org
  2024-03-09  1:29 ` cel at linux dot ibm.com
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: tromey at sourceware dot org @ 2024-03-09  0:45 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #12 from Tom Tromey <tromey at sourceware dot org> ---
> It then attempts to attach again.  At this point, the attach
> times out and all subsequent gdb commands for the rest of the iterations
> all time out.

It seems unusual to me that a single failure could somehow cause
subsequent ones to fail.  Like, why would that be?

Anyway if you don't mind essentially disabling the test on this
platform, I guess you could just exit the loop after the first
EPERM, on Power 10.

I looked at the patch but IIUC it leaves the previous processes
running, so that seems bad.

Doing any of this on all platforms seems to make the test a bit
meaningless.  Really it ought to work fine; but the EPERM kernel
bug makes it slightly random -- but Power 10 seems worse in that
regard and also, IIUC, has an additional bug.

I wonder if anybody has ever tried reporting the kernel problem.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (13 preceding siblings ...)
  2024-03-09  0:45 ` tromey at sourceware dot org
@ 2024-03-09  1:29 ` cel at linux dot ibm.com
  2024-03-09  6:59 ` brobecker at gnat dot com
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-03-09  1:29 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #13 from Carl E Love <cel at linux dot ibm.com> ---

I did reach out to the kernel community about this.  There are some timing
issues that can cause the kernel to legitimately return EPERM.  He pointed me
to the ptrace Linux man page.  Another possible cause for the EPERM is if
ptrace is already connected to the process.  I tried to determine if this was
in fact the case.  Specifically if the detach hadn't completed yet but was not
able to show that was the failure case.  

Basically, the kernel team didn't seem to think it the EPERM result was a
kernel bug.  That discussion really didn't go very far.

I will put together a patch to have Power 10 just exit the test on an EPERM
error and attach it to the bugzilla for review.

Thanks for the feedback.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (14 preceding siblings ...)
  2024-03-09  1:29 ` cel at linux dot ibm.com
@ 2024-03-09  6:59 ` brobecker at gnat dot com
  2024-03-09 16:43 ` tromey at sourceware dot org
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: brobecker at gnat dot com @ 2024-03-09  6:59 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

Joel Brobecker <brobecker at gnat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |brobecker at gnat dot com
           Assignee|unassigned at sourceware dot org   |cel at linux dot ibm.com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (15 preceding siblings ...)
  2024-03-09  6:59 ` brobecker at gnat dot com
@ 2024-03-09 16:43 ` tromey at sourceware dot org
  2024-03-15 16:41 ` cel at linux dot ibm.com
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: tromey at sourceware dot org @ 2024-03-09 16:43 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #14 from Tom Tromey <tromey at sourceware dot org> ---
(In reply to Carl E Love from comment #13)
> There are some timing
> issues that can cause the kernel to legitimately return EPERM.  He pointed
> me to the ptrace Linux man page

Ugh :}  Disappointing but what can we really do about it, I suppose.
I guess if the race happens more for you, all we can conclude is
that Power 10 is just too darn fast.

 Another possible cause for the EPERM is if
> ptrace is already connected to the process.  I tried to determine if this
> was in fact the case.  Specifically if the detach hadn't completed yet but
> was not able to show that was the failure case.  

Yeah, I was wondering about this as a theory for why subsequent
attempts all fail.  I probably distracted us a bit by ranting but
IIUC this part still isn't understood.  Could you maybe verify
that gdb thinks it has detached all the threads it attached to?
If so and the bug persists, I think we can just write it off
as another kernel bug.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (16 preceding siblings ...)
  2024-03-09 16:43 ` tromey at sourceware dot org
@ 2024-03-15 16:41 ` cel at linux dot ibm.com
  2024-03-15 21:57 ` thiago.bauermann at linaro dot org
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-03-15 16:41 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #15 from Carl E Love <cel at linux dot ibm.com> ---

Tommy:

I spent some time trying to dig into this again.

The gdb log says that it detached from the pid.  I don't find any way to verify
that.  I don't see any gdb attached thread status command that would verify it. 

I did try putting in sleep commands in the hope that if it was a race of trying
to delay the attach a bit more to give time for the detach to really finish. 
But that didn't make any difference.

I was able to get the expect script to issue the ps command on the fail so I
can go look at the processes running:

                   if {$eperm} {
                        xfail "$test (EPERM)"
                        # The attach failed.  No point in doing the rest        
                        # of the tests since we are not attached?  So           
                        # should we either 1) exit the test; or 2)              
                        # try again with a new testpid?                         
                        puts "CARLL, xfail EPERM, testpid $testpid"

                        gdb_test "detach" "Detaching from.*"

                        # try to figure out the state of the process            
                        set prompt {\$ $}
                        spawn bash
                        expect -re $prompt
                        send "ps -eaf \r"
                        expect {
                            "aa" {
                                send "CARLL, Output\r"
                            }
                            -re $prompt
                        }
                        expect eof

Based on the ps output, it does look like the process is still running.

With regards to my patch, you mentioned that you were concerned about leaving
the process running and then starting another one. Perhaps we should kill the
process that we couldn't attach to then create a new one?  That way we
shouldn't 
be leaving anything running.  Specifically:

                -re "$gdb_prompt $" {
                    if {$eperm} {
                        xfail "$test (EPERM)"
                        # Kill the current process and start a new one    <<
NEW                           
                        kill_wait_spawned_process $test_spawn_id          <<
NEW

                        # The attach failed.  No point in doing the rest        
                        # of the tests since we are not attached?  So           
                        # should we either 1) exit the test; or 2)              
                        # try again with a new testpid?                         
                        puts "CARLL, xfail EPERM, testpid $testpid"

                        # Try a new process                                     
                        set test_spawn_id [spawn_wait_for_attach $binfile]
                        set testpid [spawn_id_get_pid $test_spawn_id]
                    } else {
                        pass $test
                    }
                }

We could just end the test as you suggested.  That is still another option.  

Thoughts on killing the previous thread and then creating a new one?  Or would
you still prefer just ending the test?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (17 preceding siblings ...)
  2024-03-15 16:41 ` cel at linux dot ibm.com
@ 2024-03-15 21:57 ` thiago.bauermann at linaro dot org
  2024-03-16  1:37 ` thiago.bauermann at linaro dot org
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: thiago.bauermann at linaro dot org @ 2024-03-15 21:57 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

Thiago Jung Bauermann <thiago.bauermann at linaro dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |thiago.bauermann at linaro dot org

--- Comment #16 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
Hello,

(In reply to Tom Tromey from comment #12)
> > It then attempts to attach again.  At this point, the attach
> > times out and all subsequent gdb commands for the rest of the iterations
> > all time out.
> 
> It seems unusual to me that a single failure could somehow cause
> subsequent ones to fail.  Like, why would that be?

I also see this on aarch64-linux (sometimes), and I spent a bit of time
exploring the problem. I don't know yet what is going on, but I found two
interesting behaviours when trying to reproduce manually what the testcase is
doing¹:

1. Most of the time the attach fails with EPERM (which is the XFAIL case), but
occasionally GDB starts to use 100% of the CPU and never brings back the
prompt. At least in my case, this is why a single failure — e.g., "iter 8:
attach (timeout)" — causes all the subsequent ones to fail: GDB simply hangs
and the testcase can't make forward progress anymore.

2. Just now, the attach command did something surprising:

   (gdb) attach 2039552
   Attaching to process 2039552
   Cannot attach to lwp 2689792: Operation not permitted (1), process 2689792
is already traced by process 2039527

   PID 2039552 is the testcase inferior, and 2039527 is GDB. GDB didn't report
any success in attaching to the process.

I haven't digged deep enough to say anything about what exactly is going on
yet.

-- 
¹ that is, I'm running the attach-many-short-lived-threads in one terminal and
then repeatedly trying to attach to it from GDB in another terminal

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (18 preceding siblings ...)
  2024-03-15 21:57 ` thiago.bauermann at linaro dot org
@ 2024-03-16  1:37 ` thiago.bauermann at linaro dot org
  2024-03-16 17:42 ` tromey at sourceware dot org
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: thiago.bauermann at linaro dot org @ 2024-03-16  1:37 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #17 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
Created attachment 15405
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15405&action=edit
Shameless workaround.

I don't have any certainty yet, but I have a few suspicions, and the attached
workaround...

I believe (still have to confirm this) that when GDB is stuck using 100% of the
CPU, it's here in linux_proc_attach_tgid_threads ():

  /* Scan the task list for existing threads.  While we go through the
     threads, new threads may be spawned.  Cycle through the list of
     threads until we have done two iterations without finding new
     threads.  */
  for (iterations = 0; iterations < 20; iterations++)
    {
      struct dirent *dp;

      new_threads_found = 0;
      while ((dp = readdir (dir.get ())) != NULL)
        {
          unsigned long lwp;

          /* Fetch one lwp.  */
          lwp = strtoul (dp->d_name, NULL, 10);
          if (lwp != 0)
            {
              ptid_t ptid = ptid_t (pid, lwp);

              if (attach_lwp (ptid))
                new_threads_found = 1;
            }
        }

      if (new_threads_found)
        {
          /* Start over.  */
          iterations = -1;
        }

      rewinddir (dir.get ());
    }

In this case, the attach_lwp function pointer being called is
attach_proc_task_lwp_callback (), and the relevant part of it is:

  if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
    {
      int err = errno;

      /* Be quiet if we simply raced with the thread exiting.
         EPERM is returned if the thread's task still exists, and
         is marked as exited or zombie, as well as other
         conditions, so in that case, confirm the status in
         /proc/PID/status.  */
      if (err == ESRCH
          || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
        {
          linux_nat_debug_printf
            ("Cannot attach to lwp %d: thread is gone (%d: %s)",
             lwpid, err, safe_strerror (err));
        }

So this is what I think is going on (again, I still need to confirm):

1. linux_proc_attach_tgid_threads () loops through tasks in /proc/PID/task,
calling attach_proc_task_lwp_callback () on each of them.

2. ptrace (PTRACE_ATTACH) returns -1 with errno = EPERM, causing
linux_proc_pid_is_gone () to get called.

3. linux_proc_pid_is_gone () opens /proc/LWP/status and sees that the thread
state is zombie or dead.

4. attach_proc_task_lwp_callback () returns 1, indicating that a new thread was
found.

5. linux_proc_attach_tgid_threads () sets new_threads_found = 1 and loops
again, finding the same thread in /proc/PID/task again because for some reason
the kernel isn't removing its proc entry any time soon.

6. GOTO 1.

So my suspicion is that what is confusing GDB is that the kernel (probably!
have to confirm...) is keeping the /proc entry for zombie and dead threads
around indefinitely.

Anyway, regarding the workaround: it's not very satisfying because increasing
the number of iterations in linux_proc_attach_tgid_threads () goes back to the
heuristic that Pedro's commit 8784d56326e7 ("Linux: on attach, attach to lwps
listed under /proc/$pid/task/") removed. Not increasing it makes GDB leave some
threads unattached and the inferior dies with a SIGTRAP due to the breakpoint
(which is exactly the scenario the testcase is designed to catch). Using 20
still triggers the problem relatively easily for me, after 100 tries of running
the testcase in a loop.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (19 preceding siblings ...)
  2024-03-16  1:37 ` thiago.bauermann at linaro dot org
@ 2024-03-16 17:42 ` tromey at sourceware dot org
  2024-03-18 18:45 ` thiago.bauermann at linaro dot org
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: tromey at sourceware dot org @ 2024-03-16 17:42 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #18 from Tom Tromey <tromey at sourceware dot org> ---
> The gdb log says that it detached from the pid.  I don't find any way to verify that.  I don't see any gdb attached thread status command that would verify it.  

Yeah, what I would suggest is maybe modifying the test case to
invoke 'strace gdb ..' with some options to log the strace output
to a file; then examine the file to see if gdb correctly detaches
from all the threads it attached to.

That is, I'm wondering if there's some underlying bug that we haven't
properly recognized -- this would account for the weirdness where
one failed attach leaves gdb unable to attach in the future.  Thiago's
experiments here seem to indicate this could be the case.

If we are really convinced there is a kernel bug then I think re-trying
the test is not so important.  On the first failure it can just bail out.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (20 preceding siblings ...)
  2024-03-16 17:42 ` tromey at sourceware dot org
@ 2024-03-18 18:45 ` thiago.bauermann at linaro dot org
  2024-03-19 15:14 ` cel at linux dot ibm.com
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: thiago.bauermann at linaro dot org @ 2024-03-18 18:45 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

Thiago Jung Bauermann <thiago.bauermann at linaro dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #15405|0                           |1
        is obsolete|                            |

--- Comment #19 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
Created attachment 15415
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15415&action=edit
Proposed fix. Needs to be cleaned up.

I investigated this problem some more, and I see three separate issues
revealed by attach-many-short-lived-threads.exp:

1. The issue I mentioned in comment #17 (which I have since confirmed is
   what is going on), where the linux_proc_attach_tgid_threads () never
   ends when there are zombie threads present in the inferior. Since
   attach-many-short-lived-threads.c constantly creates and finishes
   joinable threads, the chance of having zombie threads is high.

   From looking at the gdb.log files Carl provided, I believe he is
   seeing the same problem.

   The solution is to make GDB remember when it has already visited the
   /proc directory of a given LWP, and skip it in the following iterations.
   I implemented the attached patch to do that, and now I don't observe GDB
   hanging anymore in the aarch64-linux server in which I used to easily
   reproduce this problem. If Carl could test it on POWER10, it would be
   helpful. I'll clean up the code and post it on the mailing list.

2. Behaviour 2 which I described in comment #12. I'll repeat it here for
   completeness:

   (gdb) attach 2039552
   Attaching to process 2039552
   Cannot attach to lwp 2689792: Operation not permitted (1), process
   2689792 is already traced by process 2039527

   PID 2039552 is the testcase inferior, and 2039527 is GDB. GDB didn't
   report any success in attaching to the process.

   This is very rarely observed on my test system. I saw it only 3 times in
   thousands of testcase runs. I wasn't able to investigate it yet.

   I'll open a separate bugzilla about this.

3. This one isn't a bug, but an issue that arises from the way
   attach-many-short-lived-threads.c behaves: since it's constantly
   creating new threads it's impossible for GDB to know when it has
   attached to all of them so that it can finish the loop in
   linux_proc_attach_tgid_threads (). Because of this, even with the fix
   for issue #1 applied, the testcase fails once in a while — I left the
   test running in a loop overnight and it failed after about 2500
   iterations.

   The only way I can see to improve GDB's behaviour is to increase the
   number of iterations of the loop that checks for new threads. I suspect
   that the ability of the inferior to create new threads is proportional
   to the number of CPUs present in the system (my test machine has 160
   cores), so I will propose a patch that makes the number of iterations
   proportinal to the number of CPUs.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (21 preceding siblings ...)
  2024-03-18 18:45 ` thiago.bauermann at linaro dot org
@ 2024-03-19 15:14 ` cel at linux dot ibm.com
  2024-03-19 15:35 ` thiago.bauermann at linaro dot org
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-03-19 15:14 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #20 from Carl E Love <cel at linux dot ibm.com> ---
I tried the prototype patch from Thiago.  It seems to fix the hangs on detach. 
I ran the many-short-lived threads test 500 times.  I did have three runs that
encountered a new error I haven't seen before.  

FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: break at
break_fn: 1
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: break at
break_fn: 2 (the program is no longer running)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: break at
break_fn: 3 (the program is no longer running)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: reset timer in
the inferior
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: detach (the
program is no longer running)

It seems the workload had finished before the expect script finished running. 
So we may want to address that as a separate patch later. 

Thiago's fix seems to work well on Power 10.

I did work on the strace as suggested.  I was trying to get strace to attach to
the gdb thread from the expect script.  Haven't got the script to get the
correct gdb PID yet.  I think I was trying to attach to the expect script which
strace fails to attach to.  I tried writing a script that I could run after the
workload started that would call ps and try to grep out the gdb process id and
attach to it but again I haven't got that working yet either.  

I will try and work on the strace thing some more but not sure if it is really
needed at this point given that Thiago seems to have figured out the issues.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (22 preceding siblings ...)
  2024-03-19 15:14 ` cel at linux dot ibm.com
@ 2024-03-19 15:35 ` thiago.bauermann at linaro dot org
  2024-03-19 15:57 ` cel at linux dot ibm.com
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: thiago.bauermann at linaro dot org @ 2024-03-19 15:35 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #21 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
(In reply to Carl E Love from comment #20)
> I tried the prototype patch from Thiago.  It seems to fix the hangs on
> detach.  I ran the many-short-lived threads test 500 times.  I did have
> three runs that encountered a new error I haven't seen before.  
> 
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: break at
> break_fn: 1
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: break at
> break_fn: 2 (the program is no longer running)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: break at
> break_fn: 3 (the program is no longer running)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: reset timer
> in the inferior
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 6: detach (the
> program is no longer running)
> 
> It seems the workload had finished before the expect script finished
> running.  So we may want to address that as a separate patch later. 

If you see "Program terminated with signal SIGTRAP, Trace/breakpoint trap." in
gdb.log, then it's issue 3 in comment #19. What happens is that two iterations
without seeing new thread in linux_proc_attach_tgid_threads () isn't always
enough for GDB to attach to all inferior threads, and then an unattached thread
trips on the breakpoint instruction that GDB put in the inferior. I was also
able to reproduce it on two x86_64-linux machines after hundreds of runs of the
testcase.

> Thiago's fix seems to work well on Power 10.

That's great! Thank you for testing it.

> I did work on the strace as suggested.  I was trying to get strace to attach
> to the gdb thread from the expect script.  Haven't got the script to get the
> correct gdb PID yet.  I think I was trying to attach to the expect script
> which strace fails to attach to.  I tried writing a script that I could run
> after the workload started that would call ps and try to grep out the gdb
> process id and attach to it but again I haven't got that working yet either.
> 
> 
> I will try and work on the strace thing some more but not sure if it is
> really needed at this point given that Thiago seems to have figured out the
> issues.

It's not necessary for the issue you saw. It would probably be helpful in the
case of issue 2, but that one is hard to reproduce, and I haven't started
investigating it yet (also, I don't think I'll have time to dive into it in the
next couple of weeks).

I'll open separate bugzillas for these other issues to untangle the discussion.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (23 preceding siblings ...)
  2024-03-19 15:35 ` thiago.bauermann at linaro dot org
@ 2024-03-19 15:57 ` cel at linux dot ibm.com
  2024-03-19 19:10 ` thiago.bauermann at linaro dot org
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: cel at linux dot ibm.com @ 2024-03-19 15:57 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #22 from Carl E Love <cel at linux dot ibm.com> ---
Thiago:

Yes, the log files where the failures "the program is no longer running" occur
has the line:

Program terminated with signal SIGTRAP, Trace/breakpoint trap.
The program no longer exists.

So yes, that does match issue 3, comment #19.

Fixing the detach issue would go a long way to making the test a lot more
reliable.  The SIGTRAP issue happens about 0.5% of the time.  I haven't seen
issue 2 yet, at least not that I can tell.  But based on what you said it is
really unlikely to hit.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (24 preceding siblings ...)
  2024-03-19 15:57 ` cel at linux dot ibm.com
@ 2024-03-19 19:10 ` thiago.bauermann at linaro dot org
  2024-03-21 23:17 ` thiago.bauermann at linaro dot org
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: thiago.bauermann at linaro dot org @ 2024-03-19 19:10 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #23 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
(In reply to Thiago Jung Bauermann from comment #19)
> 1. The issue I mentioned in comment #17 (which I have since confirmed is
>    what is going on), where the linux_proc_attach_tgid_threads () never
>    ends when there are zombie threads present in the inferior. Since
>    attach-many-short-lived-threads.c constantly creates and finishes
>    joinable threads, the chance of having zombie threads is high.
> 
>    From looking at the gdb.log files Carl provided, I believe he is
>    seeing the same problem.
> 
>    The solution is to make GDB remember when it has already visited the
>    /proc directory of a given LWP, and skip it in the following iterations.
>    I implemented the attached patch to do that, and now I don't observe GDB
>    hanging anymore in the aarch64-linux server in which I used to easily
>    reproduce this problem. If Carl could test it on POWER10, it would be
>    helpful. I'll clean up the code and post it on the mailing list.

From looking at the Power 10 gdb.log files attached to this bugzilla, and
also Carl's results with my proposed fix I believe this bugzilla is
specifically about the issue described above.

> 2. Behaviour 2 which I described in comment #12. I'll repeat it here for

Sorry, I referenced the wrong comment. It's actually comment #16.

>    completeness:
> 
>    (gdb) attach 2039552
>    Attaching to process 2039552
>    Cannot attach to lwp 2689792: Operation not permitted (1), process
>    2689792 is already traced by process 2039527
> 
>    PID 2039552 is the testcase inferior, and 2039527 is GDB. GDB didn't
>    report any success in attaching to the process.
> 
>    This is very rarely observed on my test system. I saw it only 3 times in
>    thousands of testcase runs. I wasn't able to investigate it yet.
> 
>    I'll open a separate bugzilla about this.

I didn't find any existing bugzilla about this problem, so I opened
bug #31512 about it, and pasted there Tom Tromey's suggestion from
comment #18 about modifying the testcase to generate an strace log file
(thanks for the suggestion).

> 3. This one isn't a bug, but an issue that arises from the way
>    attach-many-short-lived-threads.c behaves: since it's constantly
>    creating new threads it's impossible for GDB to know when it has
>    attached to all of them so that it can finish the loop in
>    linux_proc_attach_tgid_threads (). Because of this, even with the fix
>    for issue #1 applied, the testcase fails once in a while — I left the
>    test running in a loop overnight and it failed after about 2500
>    iterations.

There is already bug #26286 about this issue, so I updated it with the
results reported here, and my understanding of the problem.

>    The only way I can see to improve GDB's behaviour is to increase the
>    number of iterations of the loop that checks for new threads. I suspect
>    that the ability of the inferior to create new threads is proportional
>    to the number of CPUs present in the system (my test machine has 160
>    cores), so I will propose a patch that makes the number of iterations
>    proportinal to the number of CPUs.

As I mentioned in bug #26286, I've changed my mind about making the number
of iterations proportional to the number of CPUs, because on the machines I
have at hand, the one where it takes longest to reproduce the problem has
the most CPUs (160, vs 8 CPUs on the other machines). I'm not sure how to
move forward about this.

(In reply to Carl E Love from comment #22)
> Thiago:
> 
> Yes, the log files where the failures "the program is no longer running"
> occur has the line:
> 
> Program terminated with signal SIGTRAP, Trace/breakpoint trap.
> The program no longer exists.
> 
> So yes, that does match issue 3, comment #19.

Nice, thank you for confirming.

> Fixing the detach issue would go a long way to making the test a lot more
> reliable.

Just a minor correction, to avoid confusion: this GDB hang happens at
attach time and is not related to any previous detach command.

> The SIGTRAP issue happens about 0.5% of the time.

Yes, it's also not very common on my machines. Somewhat surprisingly, my
experience is that it's easier to reproduce on x86_64-linux than on
aarch64-linux.

> I haven't seen issue 2 yet, at least not that I can tell.  But based on
> what you said it is really unlikely to hit.

Yes, it's very uncommon. Though I did hit it or something like it on an
x86_64-linux machine just now (reported on bug #31512).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (25 preceding siblings ...)
  2024-03-19 19:10 ` thiago.bauermann at linaro dot org
@ 2024-03-21 23:17 ` thiago.bauermann at linaro dot org
  2024-04-14 17:56 ` brobecker at gnat dot com
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: thiago.bauermann at linaro dot org @ 2024-03-21 23:17 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #24 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
Hello,

I finally posted a patch series fixing the problem here:

https://inbox.sourceware.org/gdb-patches/20240321231149.519549-1-thiago.bauermann@linaro.org/

I hit a regression when remote debugging on armv8l-linux-gnueabihf as I
describe in the cover letter, which is why it took me a while.

Unfortunately I ran out of time to debug the issue for now, so I had to post
the series as an RFC.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (26 preceding siblings ...)
  2024-03-21 23:17 ` thiago.bauermann at linaro dot org
@ 2024-04-14 17:56 ` brobecker at gnat dot com
  2024-04-16  4:56 ` thiago.bauermann at linaro dot org
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: brobecker at gnat dot com @ 2024-04-14 17:56 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #25 from Joel Brobecker <brobecker at gnat dot com> ---
Hi Thiago,

Would you be able to share an update on this PR? As I understand it,
Pedro suggested an alternative approach:

I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:

      if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
        {
          int err = errno;

          /* Be quiet if we simply raced with the thread exiting.
             EPERM is returned if the thread's task still exists, and
             is marked as exited or zombie, as well as other
             conditions, so in that case, confirm the status in
             /proc/PID/status.  */
          if (err == ESRCH
              || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
            {
              linux_nat_debug_printf
                ("Cannot attach to lwp %d: thread is gone (%d: %s)",
                 lwpid, err, safe_strerror (err));

              return 0;        <<<< NEW RETURN
            }

Have you had a chance to try that, by any chance?

Thank you!

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (27 preceding siblings ...)
  2024-04-14 17:56 ` brobecker at gnat dot com
@ 2024-04-16  4:56 ` thiago.bauermann at linaro dot org
  2024-04-17 14:52 ` pedro at palves dot net
  2024-04-30  2:37 ` cvs-commit at gcc dot gnu.org
  30 siblings, 0 replies; 32+ messages in thread
From: thiago.bauermann at linaro dot org @ 2024-04-16  4:56 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #26 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
(In reply to Joel Brobecker from comment #25)
> Hi Thiago,

Hello Joel,

> Would you be able to share an update on this PR? As I understand it,
> Pedro suggested an alternative approach:
> 
> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:

<snip>

> Have you had a chance to try that, by any chance?

Sorry for not coming back to this earlier. I had tried Pedro's approach (only
today I realized that I had accidentally sent my first response just to Pedro
back when he suggested it), but I needed to take some time to investigate why
it didn't work back then. I was finally able to understand it today, and I
explain it here:

https://inbox.sourceware.org/gdb-patches/87msptgbey.fsf@linaro.org/

TL;DR: there are two ways of solving the problem, neither of which is ideal
(because it's impossible for GDB to be certain when the inferior has stopped
creating new threads). So it's a matter of choosing between my patch series or
Pedro's suggestion.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (28 preceding siblings ...)
  2024-04-16  4:56 ` thiago.bauermann at linaro dot org
@ 2024-04-17 14:52 ` pedro at palves dot net
  2024-04-30  2:37 ` cvs-commit at gcc dot gnu.org
  30 siblings, 0 replies; 32+ messages in thread
From: pedro at palves dot net @ 2024-04-17 14:52 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

Pedro Alves <pedro at palves dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pedro at palves dot net

--- Comment #27 from Pedro Alves <pedro at palves dot net> ---
> I did reach out to the kernel community about this.  There are some timing issues 
> that can cause the kernel to legitimately return EPERM.  He pointed me to the ptrace 
> Linux man page.  Another possible cause for the EPERM is if ptrace is already 
> connected to the process.  I tried to determine if this was in fact the case.  
> Specifically if the detach hadn't completed yet but was not able to show that was the 
> failure case.  

I'm not convinced the kernel folks understood the issue completely.  The ptrace
man page doesn't talk about timing issues wrt EPERM.  Also, if the attach fails
with EPERM due to a timing issue, then sleeping a while and then trying again
should succeed, but that is not what you observed, IIUC.

> The gdb log says that it detached from the pid.  I don't find any way to verify that.  
> I don't see any gdb attached thread status command that would verify it. 

There isn't one.  You can however look at /proc/PID/status, check the State:,
and the TracerPid: lines.  TracerPid in particular, as it tells you the pid of
the ptracer, which should be either "0" is not being traced (debugged), or
GDB's pid.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Bug testsuite/31312] attach-many-short-lived-threads gives inconsistent results
  2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
                   ` (29 preceding siblings ...)
  2024-04-17 14:52 ` pedro at palves dot net
@ 2024-04-30  2:37 ` cvs-commit at gcc dot gnu.org
  30 siblings, 0 replies; 32+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-04-30  2:37 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=31312

--- Comment #28 from Sourceware Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Thiago Bauermann
<bauermann@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=c930a077225ec042287379d8e49b4d547f97d1ba

commit c930a077225ec042287379d8e49b4d547f97d1ba
Author: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Date:   Sun Mar 17 02:40:05 2024 -0300

    gdb/nat/linux: Fix attaching to process when it has zombie threads

    When GDB attaches to a multi-threaded process, it calls
    linux_proc_attach_tgid_threads () to go through all threads found in
    /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
    them.  If it does that twice without the callback reporting that a new
    thread was found, then it considers that all inferior threads have been
    found and returns.

    The problem is that the callback considers any thread that it hasn't
    attached to yet as new.  This causes problems if the process has one or
    more zombie threads, because GDB can't attach to it and the loop will
    always "find" a new thread (the zombie one), and get stuck in an
    infinite loop.

    This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
    with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
    its test program constantly creates and finishes joinable threads so the
    chance of having zombie threads is high.

    This problem causes the following failures:

    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach
(timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: no new
threads (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set
breakpoint always-inserted on (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break
break_fn (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at
break_fn: 1 (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at
break_fn: 2 (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at
break_fn: 3 (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: reset timer
in the inferior (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: print
seconds_left (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: detach
(timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set
breakpoint always-inserted off (timeout)
    FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: delete all
breakpoints, watchpoints, tracepoints, and catchpoints in delete_breakpoints
(timeout)
    ERROR: breakpoints not deleted

    The iteration number is random, and all tests in the subsequent iterations
    fail too, because GDB is stuck in the attach command at the beginning of
    the iteration.

    The solution is to make linux_proc_attach_tgid_threads () remember when it
    has already processed a given LWP and skip it in the subsequent iterations.

    PR testsuite/31312
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31312

    Reviewed-By: Luis Machado <luis.machado@arm.com>
    Approved-By: Pedro Alves <pedro@palves.net>

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-04-30  2:37 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-29 18:06 [Bug testsuite/31312] New: attach-many-short-lived-threads gives inconsistent results cel at linux dot ibm.com
2024-01-29 18:08 ` [Bug testsuite/31312] " cel at linux dot ibm.com
2024-01-29 18:20 ` tromey at sourceware dot org
2024-01-29 20:55 ` vries at gcc dot gnu.org
2024-01-29 21:35 ` cel at linux dot ibm.com
2024-01-29 21:44 ` cel at linux dot ibm.com
2024-01-29 22:38 ` cel at linux dot ibm.com
2024-01-30  7:21 ` vries at gcc dot gnu.org
2024-01-30 10:13 ` vries at gcc dot gnu.org
2024-01-31 16:14 ` cel at linux dot ibm.com
2024-02-06 18:59 ` cel at linux dot ibm.com
2024-02-12 18:58 ` tromey at sourceware dot org
2024-02-12 18:59 ` tromey at sourceware dot org
2024-02-16  4:42 ` cel at linux dot ibm.com
2024-03-09  0:45 ` tromey at sourceware dot org
2024-03-09  1:29 ` cel at linux dot ibm.com
2024-03-09  6:59 ` brobecker at gnat dot com
2024-03-09 16:43 ` tromey at sourceware dot org
2024-03-15 16:41 ` cel at linux dot ibm.com
2024-03-15 21:57 ` thiago.bauermann at linaro dot org
2024-03-16  1:37 ` thiago.bauermann at linaro dot org
2024-03-16 17:42 ` tromey at sourceware dot org
2024-03-18 18:45 ` thiago.bauermann at linaro dot org
2024-03-19 15:14 ` cel at linux dot ibm.com
2024-03-19 15:35 ` thiago.bauermann at linaro dot org
2024-03-19 15:57 ` cel at linux dot ibm.com
2024-03-19 19:10 ` thiago.bauermann at linaro dot org
2024-03-21 23:17 ` thiago.bauermann at linaro dot org
2024-04-14 17:56 ` brobecker at gnat dot com
2024-04-16  4:56 ` thiago.bauermann at linaro dot org
2024-04-17 14:52 ` pedro at palves dot net
2024-04-30  2:37 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).