[Bug gdb/31832] [gdb] FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 3: attach (timeout)

public inbox for gdb-prs@sourceware.org
help / color / mirror / Atom feed

From: "thiago.bauermann at linaro dot org" <sourceware-bugzilla@sourceware.org>
To: gdb-prs@sourceware.org
Subject: [Bug gdb/31832] [gdb] FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 3: attach (timeout)
Date: Fri, 21 Jun 2024 01:21:34 +0000	[thread overview]
Message-ID: <bug-31832-4717-nIdie5PrlW@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-31832-4717@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=31832

--- Comment #7 from Thiago Jung Bauermann <thiago.bauermann at linaro dot org> ---
Created attachment 15586
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15586&action=edit
Tom's patch with statistics, plus a few more.

(In reply to Tom de Vries from comment #6)
> Created attachment 15582 [details]
> gdb.log (with statistics-gathering patch)
>
> I applied the following statistics-gathering patch:
> ...
> ...

Great idea. I applied your patch, with just some additional statistics:

- no_starttime, when starttime.has_value () is false (can be calculated
  from others, but I wanted to see it easily)
- no_new_thread_found, when attach_lwp (ptid) returns false
- one counter for every reason that starttime can't be obtained

And ran it on an aarch64 machine with 160 cores.

> and ran the test-case.
>
> The first iteration gives us:
> ...
> (gdb) builtin_spawn
> /home/vries/gdb/build/gdb/testsuite/outputs/gdb.threads/attach-many-short-
> lived-threads/attach-many-short-lived-threads^M
> attach 1301317^M
> Attaching to program:
> /home/vries/gdb/build/gdb/testsuite/outputs/gdb.threads/attach-many-short-
> lived-threads/attach-many-short-lived-threads, process 1301317^M
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach
> (timeout)
> total_iterations: 2092^M

Wow. On my test aarch64 system, the highest I've seen is 177. Here are the
numbers for that run:

total_iterations: 177
dir_entries: 32930
no_lwp: 365
lookup: 21822
skipped: 21259
insert: 563
attach: 11306
start_over: 176
no_starttime: 10743
no_new_thread_found: 1
stat: cant_open_file: 10735
stat: empty_file: 8
stat: no_parens: 0
stat: no_separator: 0
stat: no_field_beginning: 0
stat: invalid_starttime: 0
stat: unexpected_chars: 0

My machine has many cores but it's an older CPU model (Neoverse N1). These
numbers show that the POWER10 system has a much higher capacity to churn
out new threads than my system (no surprise there).  My understanding is
that GDB is overwhelmed by the constant stream of newly spawned threads
and takes a while to attach to all of them.

As Pedro mentioned elsewhere¹, Linux doesn't provide a way for GDB to stop
all of a process' threads, or cause new ones to spawn in a
"ptrace-stopped" state. Without such mechanism, the only way I can see of
addressing this problem is by making GDB parallelize the job of attaching
to all inferior threads using its worker threads — i.e., fight fire with
fire. :)

That wouldn't be a trivial change though. IIUC it would mean that
different inferior threads would have different tracers (the various GDB
worker threads), and GDB would need to take care to use the correct worker
thread to send ptrace commands to each inferior thread.

Another approach would be to see if there's a way to make
attach_proc_task_lwp_callback () faster, but from reading the code it
doesn't look like there's anything too slow there — except perhaps the
call to linux_proc_pid_is_gone (), which reads /proc/$LWP/status. Though
even that would be just mitigation since the fundamental limitation would
still be there.

Alternatively, (considering that the testcase is contrived) can the
testcase increase the timeout proportionally to the number of CPUs on the
system?

> dir_entries: 594518^M
> no_lwp: 4412^M
> lookup: 119037^M
> skipped: 118355^M
> insert: 682^M
> attach: 471751^M
> start_over: 2091^M
> Cannot attach to lwp 2340832: Operation not permitted (1)^M
> ...
>
> I'm not sure what this means, but I do notice the big difference between
> dir_entries and lookup.  So only 20% of the time we find the starttime and
> can use the cache.

I thought that not being able to read starttime from /proc meant that the
thread was gone. But from the statistics I pasted above, in about 34% of
the time GDB didn't find the starttime and still was able to attach to all
but one of the new threads. My understanding is that there's a race
condition between GDB and the Linux kernel when reading the stat file for
a newly created thread.

This is harmless though: if starttime can't be obtained, GDB will try to
attach to the thread anyway.

On the bright side, this means that the problem isn't with the
std::unordered_set (as I was fearing could be the case). :)

The statistics on why GDB can't get starttime are also interesting: at
least on my system, it turns out that almost all of the time it's because
GDB can't open /proc/$PID/task/$LWP/stat. The only other reason (on the
order of 1%-2% of the cases) it's because the stat file is empty. No other
early return was taken in my experiments.

--
¹
https://inbox.sourceware.org/gdb-patches/9680e3cf-b8ad-4329-a51c-2aafb98d9476@palves.net/

-- 
You are receiving this mail because:
You are on the CC list for the bug.

next prev parent reply	other threads:[~2024-06-21  1:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  7:17 [Bug gdb/31832] New: " vries at gcc dot gnu.org
2024-06-01  7:23 ` [Bug gdb/31832] " vries at gcc dot gnu.org
2024-06-01  8:10 ` vries at gcc dot gnu.org
2024-06-02 12:36 ` bernd.edlinger at hotmail dot de
2024-06-03 16:58 ` vries at gcc dot gnu.org
2024-06-07  0:13 ` thiago.bauermann at linaro dot org
2024-06-12  9:53 ` vries at gcc dot gnu.org
2024-06-21  1:21 ` thiago.bauermann at linaro dot org [this message]
2024-06-21 10:18 ` vries at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-31832-4717-nIdie5PrlW@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=gdb-prs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).