[RFC 0/3] [gdb/dap] Fix issues triggered by gdb.dap/eof.exp

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* [RFC 0/3] [gdb/dap] Fix issues triggered by gdb.dap/eof.exp
@ 2024-02-07  9:02 Tom de Vries
  2024-02-07  9:02 ` [RFC 1/3] [gdb/dap] Fix exit race Tom de Vries
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Tom de Vries @ 2024-02-07  9:02 UTC (permalink / raw)
  To: gdb-patches

When running test-case gdb.dap/eof.exp on aarch64-linux, sometimes I run into
a gdb segfault.

This series makes the test-case pass reliably.

The series consists of three patches.

The second one seems obvious to me, and doesn't depend on the other patches.

The first and last I'm not sure about, so this series is an RFC.

Tested on aarch64-linux.

Tested by running the dap test-cases 500 times in a row.

Tom de Vries (3):
  [gdb/dap] Fix exit race
  [gdb/dap] Catch and log exceptions in dap threads
  [gdb/dap] Ignore OSError on stream.flush in JSON writer

 gdb/python/lib/gdb/dap/io.py      |  6 ++++--
 gdb/python/lib/gdb/dap/startup.py | 20 ++++++++++++++++++--
 2 files changed, 22 insertions(+), 4 deletions(-)

base-commit: 0afc614c9938fbf5eda10a26c77d574c3c2f945a
-- 
2.35.3

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-07  9:02 [RFC 0/3] [gdb/dap] Fix issues triggered by gdb.dap/eof.exp Tom de Vries
@ 2024-02-07  9:02 ` Tom de Vries
  2024-02-07 16:01   ` Tom Tromey
  2024-02-07  9:02 ` [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads Tom de Vries
  2024-02-07  9:02 ` [RFC 3/3] [gdb/dap] Ignore OSError on stream.flush in JSON writer Tom de Vries
  2 siblings, 1 reply; 17+ messages in thread
From: Tom de Vries @ 2024-02-07  9:02 UTC (permalink / raw)
  To: gdb-patches

When running test-case gdb.dap/eof.exp, we're likely to get a coredump due to
a segfault in new_threadstate.

At the point of the core dump, the gdb main thread looks like:
...
 (gdb) bt
 #0  0x0000fffee30d2280 in __pthread_kill_implementation () from /lib64/libc.so.6
 #1  0x0000fffee3085800 [PAC] in raise () from /lib64/libc.so.6
 #2  0x00000000007b03e8 [PAC] in handle_fatal_signal (sig=11)
     at gdb/event-top.c:926
 #3  0x00000000007b0470 in handle_sigsegv (sig=11)
     at gdb/event-top.c:976
 #4  <signal handler called>
 #5  0x0000fffee3a4db14 in new_threadstate () from /lib64/libpython3.12.so.1.0
 #6  0x0000fffee3ab0548 [PAC] in PyGILState_Ensure () from /lib64/libpython3.12.so.1.0
 #7  0x0000000000a6d034 [PAC] in gdbpy_gil::gdbpy_gil (this=0xffffcb279738)
     at gdb/python/python-internal.h:787
 #8  0x0000000000ab87ac in gdbpy_event::~gdbpy_event (this=0xfffea8001ee0,
     __in_chrg=<optimized out>) at gdb/python/python.c:1051
 #9  0x0000000000ab9460 in std::_Function_base::_Base_manager<...>::_M_destroy
     (__victim=...) at /usr/include/c++/13/bits/std_function.h:175
 #10 0x0000000000ab92dc in std::_Function_base::_Base_manager<...>::_M_manager
     (__dest=..., __source=..., __op=std::__destroy_functor)
     at /usr/include/c++/13/bits/std_function.h:203
 #11 0x0000000000ab8f14 in std::_Function_handler<...>::_M_manager(...) (...)
     at /usr/include/c++/13/bits/std_function.h:282
 #12 0x000000000042dd9c in std::_Function_base::~_Function_base (this=0xfffea8001c10,
     __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_function.h:244
 #13 0x000000000042e654 in std::function<void ()>::~function() (this=0xfffea8001c10,
     __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_function.h:334
 #14 0x0000000000b68e60 in std::_Destroy<std::function<void ()> >(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:151
 #15 0x0000000000b68cd0 in std::_Destroy_aux<false>::__destroy<...>(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:163
 #16 0x0000000000b689d8 in std::_Destroy<...>(...) (...)
     at /usr/include/c++/13/bits/stl_construct.h:196
 #17 0x0000000000b68414 in std::_Destroy<...>(...) (...)
     at /usr/include/c++/13/bits/alloc_traits.h:948
 #18 std::vector<...>::~vector() (this=0x2a183c8 <runnables>)
     at /usr/include/c++/13/bits/stl_vector.h:732
 #19 0x0000fffee3088370 in __run_exit_handlers () from /lib64/libc.so.6
 #20 0x0000fffee3088450 [PAC] in exit () from /lib64/libc.so.6
 #21 0x0000000000c95600 [PAC] in quit_force (exit_arg=0x0, from_tty=0)
     at gdb/top.c:1822
 #22 0x0000000000609140 in quit_command (args=0x0, from_tty=0)
     at gdb/cli/cli-cmds.c:508
 #23 0x0000000000c926a4 in quit_cover () at gdb/top.c:300
 #24 0x00000000007b09d4 in async_disconnect (arg=0x0)
     at gdb/event-top.c:1230
 #25 0x0000000000548acc in invoke_async_signal_handlers ()
     at gdb/async-event.c:234
 #26 0x000000000157d2d4 in gdb_do_one_event (mstimeout=-1)
     at gdbsupport/event-loop.cc:199
 #27 0x0000000000943a84 in start_event_loop () at gdb/main.c:401
 #28 0x0000000000943bfc in captured_command_loop () at gdb/main.c:465
 #29 0x000000000094567c in captured_main (data=0xffffcb279d08)
     at gdb/main.c:1335
 #30 0x0000000000945700 in gdb_main (args=0xffffcb279d08)
     at gdb/main.c:1354
 #31 0x0000000000423ab4 in main (argc=14, argv=0xffffcb279e98)
     at gdb/gdb.c:39
...

The direct cause of the segfault is calling PyGILState_Ensure after
calling Py_Finalize.

AFAICT the problem is a race between the gdb main thread and DAP's JSON writer
thread.

On one side, we have the following events:
- DAP's JSON reader thread reads an EOF
- it lets DAP's JSON writer thread known by writing None into its queue
- DAP's JSON writer thread sees the None in its queue, and calls
  send_gdb("quit")
- a corresponding gdbpy_event is deposited in the runnables vector, to be
  run by the gdb main thread

On the other side, we have the following events:
- the gdb main thread receives a SIGHUP
- the corresponding handler calls quit_force, which calls do_final_cleanups
- one of the final cleanups is finalize_python, which calls Py_Finalize
- quit_force calls exit, which triggers the exit handlers
- one of the exit handlers is the destructor of the runnables vector
- destruction of the vector triggers destruction of the remaining element
- the remaining element is a gdbpy_event, and the destructor (indirectly)
  calls PyGILState_Ensure

My first attempt at fixing this was to write a finalize_runnables and call it
from quit_force, similar to finalize_values, to ensure that the gdbpy_event
destructor is called before Py_Finalize.  However, I still ran into the same
problem due to the gdbpy_event being posted after finalize_runnables was
called.  I managed to handle this by ignoring run_on_main_thread after
finalize_runnables, but it made me wonder if there's a better way.

Then I tried to simply remove send_gdb("quit"), and that worked as well, and
caused no regressions.  So, either this is the easiest way to address this, or
we need to add a test-case that regresses when we remove it.

This RFC uses the latter approach.  Perhaps the former is better, perhaps
something else is needed, I'm not sure.

Tested on aarch64-linux.

PR dap/31306
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31306
---
 gdb/python/lib/gdb/dap/io.py | 1 -
 1 file changed, 1 deletion(-)

diff --git a/gdb/python/lib/gdb/dap/io.py b/gdb/python/lib/gdb/dap/io.py
index 5149edae977..4edd504c727 100644
--- a/gdb/python/lib/gdb/dap/io.py
+++ b/gdb/python/lib/gdb/dap/io.py
@@ -68,7 +68,6 @@ def start_json_writer(stream, queue):
                 # This is an exit request.  The stream is already
                 # flushed, so all that's left to do is request an
                 # exit.
-                send_gdb("quit")
                 break
             obj["seq"] = seq
             seq = seq + 1
-- 
2.35.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads
  2024-02-07  9:02 [RFC 0/3] [gdb/dap] Fix issues triggered by gdb.dap/eof.exp Tom de Vries
  2024-02-07  9:02 ` [RFC 1/3] [gdb/dap] Fix exit race Tom de Vries
@ 2024-02-07  9:02 ` Tom de Vries
  2024-02-07 15:52   ` Tom Tromey
  2024-02-07  9:02 ` [RFC 3/3] [gdb/dap] Ignore OSError on stream.flush in JSON writer Tom de Vries
  2 siblings, 1 reply; 17+ messages in thread
From: Tom de Vries @ 2024-02-07  9:02 UTC (permalink / raw)
  To: gdb-patches

When running test-case gdb.dap/eof.exp, it occasionally coredumps.

The thread triggering the coredump is:
...
 #0  0x0000ffff42bb2280 in __pthread_kill_implementation () from /lib64/libc.so.6
 #1  0x0000ffff42b65800 [PAC] in raise () from /lib64/libc.so.6
 #2  0x00000000007b03e8 [PAC] in handle_fatal_signal (sig=11)
     at gdb/event-top.c:926
 #3  0x00000000007b0470 in handle_sigsegv (sig=11)
     at gdb/event-top.c:976
 #4  <signal handler called>
 #5  0x0000000000606080 in cli_ui_out::do_message (this=0xffff2f7ed728, style=...,
     format=0xffff0c002af1 "%s", args=...) at gdb/cli-out.c:232
 #6  0x0000000000ce6358 in ui_out::call_do_message (this=0xffff2f7ed728, style=...,
     format=0xffff0c002af1 "%s") at gdb/ui-out.c:584
 #7  0x0000000000ce6610 in ui_out::vmessage (this=0xffff2f7ed728, in_style=...,
     format=0x16f93ea "", args=...) at gdb/ui-out.c:621
 #8  0x0000000000ce3a9c in ui_file::vprintf (this=0xfffffbea1b18, ...)
     at gdb/ui-file.c:74
 #9  0x0000000000d2b148 in gdb_vprintf (stream=0xfffffbea1b18, format=0x16f93e8 "%s",
     args=...) at gdb/utils.c:1898
 #10 0x0000000000d2b23c in gdb_printf (stream=0xfffffbea1b18, format=0x16f93e8 "%s")
     at gdb/utils.c:1913
 #11 0x0000000000ab5208 in gdbpy_write (self=0x33fe35d0, args=0x342ec280, kw=0x345c08b0)
     at gdb/python/python.c:1464
 #12 0x0000ffff434acedc in cfunction_call () from /lib64/libpython3.12.so.1.0
 #13 0x0000ffff4347c500 [PAC] in _PyObject_MakeTpCall ()
     from /lib64/libpython3.12.so.1.0
 #14 0x0000ffff43488b64 [PAC] in _PyEval_EvalFrameDefault ()
    from /lib64/libpython3.12.so.1.0
 #15 0x0000ffff434d8cd0 [PAC] in method_vectorcall () from /lib64/libpython3.12.so.1.0
 #16 0x0000ffff434b9824 [PAC] in PyObject_CallOneArg () from /lib64/libpython3.12.so.1.0
 #17 0x0000ffff43557674 [PAC] in PyFile_WriteObject () from /lib64/libpython3.12.so.1.0
 #18 0x0000ffff435577a0 [PAC] in PyFile_WriteString () from /lib64/libpython3.12.so.1.0
 #19 0x0000ffff43465354 [PAC] in thread_excepthook () from /lib64/libpython3.12.so.1.0
 #20 0x0000ffff434ac6e0 [PAC] in cfunction_vectorcall_O ()
    from /lib64/libpython3.12.so.1.0
 #21 0x0000ffff434a32d8 [PAC] in PyObject_Vectorcall () from /lib64/libpython3.12.so.1.0
 #22 0x0000ffff43488b64 [PAC] in _PyEval_EvalFrameDefault ()
    from /lib64/libpython3.12.so.1.0
 #23 0x0000ffff434d8d88 [PAC] in method_vectorcall () from /lib64/libpython3.12.so.1.0
 #24 0x0000ffff435e0ef4 [PAC] in thread_run () from /lib64/libpython3.12.so.1.0
 #25 0x0000ffff43591ec0 [PAC] in pythread_wrapper () from /lib64/libpython3.12.so.1.0
 #26 0x0000ffff42bb0584 [PAC] in start_thread () from /lib64/libc.so.6
 #27 0x0000ffff42c1fd4c [PAC] in thread_start () from /lib64/libc.so.6
...

The direct cause for the coredump seems to be that cli_ui_out::do_message
is trying to write to a stream variable which does not look sound:
...
(gdb) p *stream
$8 = {_vptr.ui_file = 0x0, m_applied_style = {m_foreground = {m_simple = true, {
        m_value = 0, {m_red = 0 '\000', m_green = 0 '\000', m_blue = 0 '\000'}}},
    m_background = {m_simple = 32, {m_value = 65535, {m_red = 255 '\377',
          m_green = 255 '\377', m_blue = 0 '\000'}}},
    m_intensity = (unknown: 0x438fe710), m_reverse = 255}}
...

The string that is being printed is:
...
(gdb) p str
$9 = "Exception in thread "
...
so AFAICT this is a DAP thread running into an exception and trying to print
it.

If we look at the state of gdb's main thread, we have:
...
 #0  0x0000ffff42bac914 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
 #1  0x0000ffff42bafb44 [PAC] in pthread_cond_timedwait@@GLIBC_2.17 ()
    from /lib64/libc.so.6
 #2  0x0000ffff43466e9c [PAC] in take_gil () from /lib64/libpython3.12.so.1.0
 #3  0x0000ffff43484fe0 [PAC] in PyEval_RestoreThread ()
     from /lib64/libpython3.12.so.1.0
 #4  0x0000000000ab8698 [PAC] in gdbpy_allow_threads::~gdbpy_allow_threads (
     this=0xfffffbea1cf8, __in_chrg=<optimized out>)
     at gdb/python/python-internal.h:769
 #5  0x0000000000ab2fec in execute_gdb_command (self=0x33fe35d0, args=0x34297b60,
     kw=0x34553d20) at gdb/python/python.c:681
 #6  0x0000ffff434acedc in cfunction_call () from /lib64/libpython3.12.so.1.0
 #7  0x0000ffff4347c500 [PAC] in _PyObject_MakeTpCall ()
     from /lib64/libpython3.12.so.1.0
 #8  0x0000ffff43488b64 [PAC] in _PyEval_EvalFrameDefault ()
    from /lib64/libpython3.12.so.1.0
 #9  0x0000ffff4353bce8 [PAC] in _PyObject_VectorcallTstate.lto_priv.3 ()
    from /lib64/libpython3.12.so.1.0
 #10 0x0000000000ab87fc [PAC] in gdbpy_event::operator() (this=0xffff14005900)
     at gdb/python/python.c:1061
 #11 0x0000000000ab93e8 in std::__invoke_impl<void, gdbpy_event&> (__f=...)
     at /usr/include/c++/13/bits/invoke.h:61
 #12 0x0000000000ab9204 in std::__invoke_r<void, gdbpy_event&> (__fn=...)
     at /usr/include/c++/13/bits/invoke.h:111
 #13 0x0000000000ab8e90 in std::_Function_handler<..>::_M_invoke(...) (...)
     at /usr/include/c++/13/bits/std_function.h:290
 #14 0x000000000062e0d0 in std::function<void ()>::operator()() const (
     this=0xffff14005830) at /usr/include/c++/13/bits/std_function.h:591
 #15 0x0000000000b67f14 in run_events (error=0, client_data=0x0)
     at gdb/run-on-main-thread.c:76
 #16 0x000000000157e290 in handle_file_event (file_ptr=0x33dae3a0, ready_mask=1)
     at gdbsupport/event-loop.cc:573
 #17 0x000000000157e760 in gdb_wait_for_event (block=1)
     at gdbsupport/event-loop.cc:694
 #18 0x000000000157d464 in gdb_do_one_event (mstimeout=-1)
     at gdbsupport/event-loop.cc:264
 #19 0x0000000000943a84 in start_event_loop () at gdb/main.c:401
 #20 0x0000000000943bfc in captured_command_loop () at gdb/main.c:465
 #21 0x000000000094567c in captured_main (data=0xfffffbea23e8)
     at gdb/main.c:1335
 #22 0x0000000000945700 in gdb_main (args=0xfffffbea23e8)
     at gdb/main.c:1354
 #23 0x0000000000423ab4 in main (argc=14, argv=0xfffffbea2578)
     at gdb/gdb.c:39
...

AFAIU, there's a race between the two threads on gdb_stderr:
- the DAP thread samples the gdb_stderr value, and uses it a bit later to
  print to
- the gdb main thread changes the gdb_stderr value forth and back,
  using a temporary value for string capture purposes

The non-sound stream value is caused by gdb_stderr being sampled while
pointing to a str_file object, and used once the str_file object is already
destroyed.

The error here is that the DAP thread attempts to print to gdb_stderr.

Fix this by adding a thread_wrapper that:
- catches all exceptions and logs them to dap.log, and
- while we're at it, logs when exiting
and using the thread_wrapper for each DAP thread.

Tested on aarch64-linux.
---
 gdb/python/lib/gdb/dap/startup.py | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/gdb/python/lib/gdb/dap/startup.py b/gdb/python/lib/gdb/dap/startup.py
index 4f12e0b0376..7cd1ef2fba1 100644
--- a/gdb/python/lib/gdb/dap/startup.py
+++ b/gdb/python/lib/gdb/dap/startup.py
@@ -62,9 +62,25 @@ def start_thread(name, target, args=()):
     """Start a new thread, invoking TARGET with *ARGS there.
     This is a helper function that ensures that any GDB signals are
     correctly blocked."""
-    result = gdb.Thread(name=name, target=target, args=args, daemon=True)
-    result.start()
 
+    def thread_wrapper(*args):
+        thread_name = threading.current_thread().name
+        # Catch any exception, and log it.  If we let it escape here, it
+        # it'll be printed in gdb_stderr, which is not safe to access from
+        # anywhere but gdb's main thread.
+        try:
+            target(*args)
+        except Exception as err:
+            err_string = "%s, %s" % (err, type(err))
+            log(thread_name + ": caught exception: " + err_string)
+            log_stack()
+        finally:
+            # Log when a thread terminates.
+            log(thread_name + ": terminating")
+
+    result = gdb.Thread(name=name, target=thread_wrapper, args=args,
+                        daemon=True)
+    result.start()
 
 def start_dap(target):
     """Start the DAP thread and invoke TARGET there."""
-- 
2.35.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 3/3] [gdb/dap] Ignore OSError on stream.flush in JSON writer
  2024-02-07  9:02 [RFC 0/3] [gdb/dap] Fix issues triggered by gdb.dap/eof.exp Tom de Vries
  2024-02-07  9:02 ` [RFC 1/3] [gdb/dap] Fix exit race Tom de Vries
  2024-02-07  9:02 ` [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads Tom de Vries
@ 2024-02-07  9:02 ` Tom de Vries
  2024-02-07 10:29   ` Tom de Vries
  2 siblings, 1 reply; 17+ messages in thread
From: Tom de Vries @ 2024-02-07  9:02 UTC (permalink / raw)
  To: gdb-patches

Due to the previous two fixes, I can now run test-case gdb.dap/eof.exp without
coredumps happening.

However, I do run into:
...
FAIL: gdb.dap/eof.exp: exceptions in log file
...
due to:
...
JSON writer: caught exception: err=OSError(5, 'Input/output error'),
  type(err)=<class 'OSError'>
Traceback (most recent call last):
  File "python/gdb/dap/startup.py", line 72, in thread_wrapper
    target(*args)
  File "python/gdb/dap/io.py", line 80, in _json_writer
    stream.flush()
OSError: [Errno 5] Input/output error
...

I can image that the stream.flush would throw an OSError if it races with the
gdb main thread handling a SIGHUP and finalizing python in the process.

Handle this is in the simplest way I can think of: ignore the OSError, and
terminate the DAP thread.

Tested on aarch64-linux.
---
 gdb/python/lib/gdb/dap/io.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/gdb/python/lib/gdb/dap/io.py b/gdb/python/lib/gdb/dap/io.py
index 4edd504c727..e5547bd1072 100644
--- a/gdb/python/lib/gdb/dap/io.py
+++ b/gdb/python/lib/gdb/dap/io.py
@@ -77,6 +77,9 @@ def start_json_writer(stream, queue):
             header_bytes = header.encode("ASCII")
             stream.write(header_bytes)
             stream.write(body_bytes)
-            stream.flush()
+            try:
+                stream.flush()
+            except OSError:
+                break
 
     start_thread("JSON writer", _json_writer)
-- 
2.35.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 3/3] [gdb/dap] Ignore OSError on stream.flush in JSON writer
  2024-02-07  9:02 ` [RFC 3/3] [gdb/dap] Ignore OSError on stream.flush in JSON writer Tom de Vries
@ 2024-02-07 10:29   ` Tom de Vries
  0 siblings, 0 replies; 17+ messages in thread
From: Tom de Vries @ 2024-02-07 10:29 UTC (permalink / raw)
  To: gdb-patches; +Cc: Tom Tromey

On 2/7/24 10:02, Tom de Vries wrote:
> Due to the previous two fixes, I can now run test-case gdb.dap/eof.exp without
> coredumps happening.
> 
> However, I do run into:
> ...
> FAIL: gdb.dap/eof.exp: exceptions in log file
> ...
> due to:
> ...
> JSON writer: caught exception: err=OSError(5, 'Input/output error'),
>    type(err)=<class 'OSError'>
> Traceback (most recent call last):
>    File "python/gdb/dap/startup.py", line 72, in thread_wrapper
>      target(*args)
>    File "python/gdb/dap/io.py", line 80, in _json_writer
>      stream.flush()
> OSError: [Errno 5] Input/output error
> ...
> 
> I can image that the stream.flush would throw an OSError if it races with the
> gdb main thread handling a SIGHUP and finalizing python in the process.
> 
> Handle this is in the simplest way I can think of: ignore the OSError, and
> terminate the DAP thread.

I just found that the JSON reader thread does:
...
     except OSError:
         # Reading can also possibly throw an exception.  Treat this as 

         # EOF. 

         log_stack(LogLevel.FULL)
         return None
...

If we however threat this is in the JSON writer thread the same way, we 
get a FAIL from dap_check_log_file, because it checks for ^Traceback, 
which is generated by the log_stack.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads
  2024-02-07  9:02 ` [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads Tom de Vries
@ 2024-02-07 15:52   ` Tom Tromey
  2024-02-12 15:15     ` Tom de Vries
  0 siblings, 1 reply; 17+ messages in thread
From: Tom Tromey @ 2024-02-07 15:52 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gdb-patches

>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Thanks for the patch.

Tom> - the DAP thread samples the gdb_stderr value, and uses it a bit later to
Tom>   print to
Tom> - the gdb main thread changes the gdb_stderr value forth and back,
Tom>   using a temporary value for string capture purposes

I think the first bullet point here is incomplete.

Tom> +        # Catch any exception, and log it.  If we let it escape here, it
Tom> +        # it'll be printed in gdb_stderr, which is not safe to access from

Remove the "it" at the end of the first line here.

Approved-By: Tom Tromey <tom@tromey.com>

Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-07  9:02 ` [RFC 1/3] [gdb/dap] Fix exit race Tom de Vries
@ 2024-02-07 16:01   ` Tom Tromey
  2024-02-13 15:04     ` Tom de Vries
  0 siblings, 1 reply; 17+ messages in thread
From: Tom Tromey @ 2024-02-07 16:01 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gdb-patches

>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> Then I tried to simply remove send_gdb("quit"), and that worked as well, and
Tom> caused no regressions.  So, either this is the easiest way to address this, or
Tom> we need to add a test-case that regresses when we remove it.

Right now main_loop does:

        # Got the terminate request.  This is handled by the
        # JSON-writing thread, so that we can ensure that all
        # responses are flushed to the client before exiting.
        self.write_queue.put(None)

Like the comment says, this approach tries to ensure that all responses
have been flushed to the client before quitting.  However, I suppose
this approach does not really work, because main_loop then proceeds to
return.

One way to fix this would be to have start_json_writer return the thread
object, and then have main_loop join the thread to terminate after
writing the None.  This would mean making this particular thread
non-daemon though.

In this setup, removing the "quit" here is the right thing to do.

Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads
  2024-02-07 15:52   ` Tom Tromey
@ 2024-02-12 15:15     ` Tom de Vries
  2024-02-12 17:35       ` Tom Tromey
  0 siblings, 1 reply; 17+ messages in thread
From: Tom de Vries @ 2024-02-12 15:15 UTC (permalink / raw)
  To: Tom Tromey; +Cc: gdb-patches

On 2/7/24 16:52, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
> Thanks for the patch.
> 
> Tom> - the DAP thread samples the gdb_stderr value, and uses it a bit later to
> Tom>   print to
> Tom> - the gdb main thread changes the gdb_stderr value forth and back,
> Tom>   using a temporary value for string capture purposes
> 

Thanks for the review.

> I think the first bullet point here is incomplete.
> 

AFAICT, it's not, but ... not a native speaker.

Should try to I reformulate?

Thanks,
- Tom

> Tom> +        # Catch any exception, and log it.  If we let it escape here, it
> Tom> +        # it'll be printed in gdb_stderr, which is not safe to access from
> 
> Remove the "it" at the end of the first line here.
> 
> Approved-By: Tom Tromey <tom@tromey.com>
> 
> Tom


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads
  2024-02-12 15:15     ` Tom de Vries
@ 2024-02-12 17:35       ` Tom Tromey
  0 siblings, 0 replies; 17+ messages in thread
From: Tom Tromey @ 2024-02-12 17:35 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Tom Tromey, gdb-patches

>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> - the DAP thread samples the gdb_stderr value, and uses it a
Tom> bit later to print to

>> I think the first bullet point here is incomplete.

Tom> AFAICT, it's not, but ... not a native speaker.

Tom> Should try to I reformulate?

I re-read it a couple more times & it makes sense now.
I think it's fine, sorry about that.

Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-07 16:01   ` Tom Tromey
@ 2024-02-13 15:04     ` Tom de Vries
  2024-02-13 18:04       ` Tom Tromey
  0 siblings, 1 reply; 17+ messages in thread
From: Tom de Vries @ 2024-02-13 15:04 UTC (permalink / raw)
  To: Tom Tromey; +Cc: gdb-patches

On 2/7/24 17:01, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
> Tom> Then I tried to simply remove send_gdb("quit"), and that worked as well, and
> Tom> caused no regressions.  So, either this is the easiest way to address this, or
> Tom> we need to add a test-case that regresses when we remove it.
> 
> Right now main_loop does:
> 
>          # Got the terminate request.  This is handled by the
>          # JSON-writing thread, so that we can ensure that all
>          # responses are flushed to the client before exiting.
>          self.write_queue.put(None)
> 
> Like the comment says, this approach tries to ensure that all responses
> have been flushed to the client before quitting.  However, I suppose
> this approach does not really work, because main_loop then proceeds to
> return.
> 
> One way to fix this would be to have start_json_writer return the thread
> object, and then have main_loop join the thread to terminate after
> writing the None.  This would mean making this particular thread
> non-daemon though.
> 

And as I understand it, the downside of that is that it could possibly 
hang the gdb process.

Btw, I think I found a simpler way of achieving this:
...
diff --git a/gdb/python/lib/gdb/dap/io.py b/gdb/python/lib/gdb/dap/io.py
index 5149edae977..491cf208ae3 100644
--- a/gdb/python/lib/gdb/dap/io.py
+++ b/gdb/python/lib/gdb/dap/io.py
@@ -65,10 +65,7 @@ def start_json_writer(stream, queue):
          while True:
              obj = queue.get()
              if obj is None:
-                # This is an exit request.  The stream is already
-                # flushed, so all that's left to do is request an
-                # exit.
-                send_gdb("quit")
+                queue.task_done()
                  break
              obj["seq"] = seq
              seq = seq + 1
@@ -79,5 +76,6 @@ def start_json_writer(stream, queue):
              stream.write(header_bytes)
              stream.write(body_bytes)
              stream.flush()
+            queue.task_done()

      start_thread("JSON writer", _json_writer)
diff --git a/gdb/python/lib/gdb/dap/server.py 
b/gdb/python/lib/gdb/dap/server.py
index 7cc5a4681ee..ca4465ff207 100644
--- a/gdb/python/lib/gdb/dap/server.py
+++ b/gdb/python/lib/gdb/dap/server.py
@@ -229,6 +229,8 @@ class Server:
          # JSON-writing thread, so that we can ensure that all
          # responses are flushed to the client before exiting.
          self.write_queue.put(None)
+        self.write_queue.join()
+        send_gdb("quit")

      @in_dap_thread
      def send_event_later(self, event, body=None):
...

Regardless, this doesn't address the root cause of the problem, the race 
and crash remain.

+> In this setup, removing the "quit" here is the right thing to do.

I've submitted an updated version (commit message is updated, code is 
the same) of the patch here ( 
https://sourceware.org/pipermail/gdb-patches/2024-February/206567.html ).

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-13 15:04     ` Tom de Vries
@ 2024-02-13 18:04       ` Tom Tromey
  2024-02-13 18:11         ` Tom Tromey
  2024-02-14 15:31         ` Tom de Vries
  0 siblings, 2 replies; 17+ messages in thread
From: Tom Tromey @ 2024-02-13 18:04 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Tom Tromey, gdb-patches

>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

>> One way to fix this would be to have start_json_writer return the
>> thread
>> object, and then have main_loop join the thread to terminate after
>> writing the None.  This would mean making this particular thread
>> non-daemon though.

Tom> And as I understand it, the downside of that is that it could possibly
Tom> hang the gdb process.

How so?

Tom> +                queue.task_done()

I think SimpleQueue doesn't have task_done, so you have to change this
code to use Queue and instead.

Tom>          # JSON-writing thread, so that we can ensure that all
Tom>          # responses are flushed to the client before exiting.
Tom>          self.write_queue.put(None)
Tom> +        self.write_queue.join()
Tom> +        send_gdb("quit")

I suspect this isn't needed.

Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-13 18:04       ` Tom Tromey
@ 2024-02-13 18:11         ` Tom Tromey
  2024-02-14 15:31         ` Tom de Vries
  1 sibling, 0 replies; 17+ messages in thread
From: Tom Tromey @ 2024-02-13 18:11 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Tom de Vries, gdb-patches

>>>>> "Tom" == Tom Tromey <tom@tromey.com> writes:

> +        self.write_queue.join()
> +        send_gdb("quit")

Tom> I suspect this isn't needed.

Sorry -- here I specifically just mean the explicit quit.
However, it's also harmless.

Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-13 18:04       ` Tom Tromey
  2024-02-13 18:11         ` Tom Tromey
@ 2024-02-14 15:31         ` Tom de Vries
  2024-02-14 15:34           ` Tom Tromey
  1 sibling, 1 reply; 17+ messages in thread
From: Tom de Vries @ 2024-02-14 15:31 UTC (permalink / raw)
  To: Tom Tromey; +Cc: gdb-patches

On 2/13/24 19:04, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
>>> One way to fix this would be to have start_json_writer return the
>>> thread
>>> object, and then have main_loop join the thread to terminate after
>>> writing the None.  This would mean making this particular thread
>>> non-daemon though.
> 
> Tom> And as I understand it, the downside of that is that it could possibly
> Tom> hang the gdb process.
> 
> How so?
> 

I don't have a concrete worry if that is what you mean.  I'm just trying 
to reverse-engineer the decision making the dap threads daemon, and this 
is what I could come up with.  I hope this answers your question.

> Tom> +                queue.task_done()
> 
> I think SimpleQueue doesn't have task_done, so you have to change this
> code to use Queue and instead.
> 

Ack, understood.

Anyway, I've now filed a separate PR for this issue, which I hope makes 
the discussion a bit clearer, so we have:
- a PR for the assertion failure (PR31306)
- a PR for ensuring responses are flushed to client before exiting
   (PR31380)

> Tom>          # JSON-writing thread, so that we can ensure that all
> Tom>          # responses are flushed to the client before exiting.
> Tom>          self.write_queue.put(None)
> Tom> +        self.write_queue.join()
> Tom> +        send_gdb("quit")
> 
> I suspect this isn't needed.

I assume you specifically mean the last line.  Removing that line is the 
proposed fix for PR31306, the patch quoted above is for PR31380, which 
we somehow ended up discussing in this thread on PR31306.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-14 15:31         ` Tom de Vries
@ 2024-02-14 15:34           ` Tom Tromey
  2024-02-14 15:53             ` Tom de Vries
  0 siblings, 1 reply; 17+ messages in thread
From: Tom Tromey @ 2024-02-14 15:34 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Tom Tromey, gdb-patches

>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

Tom> Anyway, I've now filed a separate PR for this issue, which I hope
Tom> makes the discussion a bit clearer, so we have:
Tom> - a PR for the assertion failure (PR31306)
Tom> - a PR for ensuring responses are flushed to client before exiting
Tom>   (PR31380)

Ok.  FWIW I think the Queue idea seems totally fine, and combining these
patches seems natural to me.

Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-14 15:34           ` Tom Tromey
@ 2024-02-14 15:53             ` Tom de Vries
  2024-02-14 16:18               ` Tom Tromey
  0 siblings, 1 reply; 17+ messages in thread
From: Tom de Vries @ 2024-02-14 15:53 UTC (permalink / raw)
  To: Tom Tromey; +Cc: gdb-patches

On 2/14/24 16:34, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
> Tom> Anyway, I've now filed a separate PR for this issue, which I hope
> Tom> makes the discussion a bit clearer, so we have:
> Tom> - a PR for the assertion failure (PR31306)
> Tom> - a PR for ensuring responses are flushed to client before exiting
> Tom>   (PR31380)
> 
> Ok.  FWIW I think the Queue idea seems totally fine, and combining these
> patches seems natural to me.

Sorry for going on this topic, but I'd like to understand why you think 
this.

 From my point of view, it's not a good idea to combine the patches 
because the PRs have distinct root causes, which makes the discussion 
about fixing them together confusing.

So, it's not that I _want_ to do this, it's that I'm convinced by the 
evidence I've seen that this is the natural and obvious thing to do.

Which might mean I'm misunderstanding or overlooking something.

I start to wonder if perhaps you're thinking of a different race, which 
is actually not discussed in the two PRs.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-14 15:53             ` Tom de Vries
@ 2024-02-14 16:18               ` Tom Tromey
  2024-02-14 17:16                 ` Tom de Vries
  0 siblings, 1 reply; 17+ messages in thread
From: Tom Tromey @ 2024-02-14 16:18 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Tom Tromey, gdb-patches

>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:

>> Ok.  FWIW I think the Queue idea seems totally fine, and combining
>> these
>> patches seems natural to me.

Tom> Sorry for going on this topic, but I'd like to understand why you
Tom> think this.

My thinking is that the code is written this way to solve the flushing
issue (perhaps not well).  Removing the quit fixes one issue, but it
risks introducing another.  I'm not really a stickler for this though.

Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/3] [gdb/dap] Fix exit race
  2024-02-14 16:18               ` Tom Tromey
@ 2024-02-14 17:16                 ` Tom de Vries
  0 siblings, 0 replies; 17+ messages in thread
From: Tom de Vries @ 2024-02-14 17:16 UTC (permalink / raw)
  To: Tom Tromey; +Cc: gdb-patches

On 2/14/24 17:18, Tom Tromey wrote:
>>>>>> "Tom" == Tom de Vries <tdevries@suse.de> writes:
> 
>>> Ok.  FWIW I think the Queue idea seems totally fine, and combining
>>> these
>>> patches seems natural to me.
> 
> Tom> Sorry for going on this topic, but I'd like to understand why you
> Tom> think this.
> 
> My thinking is that the code is written this way to solve the flushing
> issue (perhaps not well).  Removing the quit fixes one issue, but it
> risks introducing another.  I'm not really a stickler for this though.

Thanks for the explanation.

OK, I think I start to understand your position.  You've written both 
pieces of code with the intent to address one problem, which is why 
you're thinking about it as a whole.

I haven't added the code, I'm just looking at the behaviour of two 
different races and am seeing two independent problems.

[ FWIW, I understood the send_gdb("quit") as a means to ensure that 
gdb's main thread doesn't keep hanging after the dap threads have 
exited, which AFAICT is a problem unrelated to the flushing race. ]

To follow up on the risk remark, I'm not sure which risk you're 
referring to.

I've documented one risk in the commit log:
...
     So, for the system I'm running this on, the send_gdb("quit") is
     actually not needed.

     I'm not sure if we support any systems where it's actually needed.
...
There's a risk that indeed there are such systems, but I figured we can 
deal with that when we encounter them.

If you mean the risk of not flushing in time, I don't think there's any 
relation with send_gdb("quit").  The problem with not flushing in time 
is there, and AFAIU the problem is not made any better or worse by 
removing the send_gdb("quit").

I'll proceed to commit the approved patch, since I'm keen on getting rid 
of the segfault.

As for the flushing race, as I've documented in the PR, the queue 
approach doesn't completely fix the race.  I wonder if using a 
gdb_exiting hook to wait for the maint dap thread to exit is the way to 
go there.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-02-14 17:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-07  9:02 [RFC 0/3] [gdb/dap] Fix issues triggered by gdb.dap/eof.exp Tom de Vries
2024-02-07  9:02 ` [RFC 1/3] [gdb/dap] Fix exit race Tom de Vries
2024-02-07 16:01   ` Tom Tromey
2024-02-13 15:04     ` Tom de Vries
2024-02-13 18:04       ` Tom Tromey
2024-02-13 18:11         ` Tom Tromey
2024-02-14 15:31         ` Tom de Vries
2024-02-14 15:34           ` Tom Tromey
2024-02-14 15:53             ` Tom de Vries
2024-02-14 16:18               ` Tom Tromey
2024-02-14 17:16                 ` Tom de Vries
2024-02-07  9:02 ` [RFC 2/3] [gdb/dap] Catch and log exceptions in dap threads Tom de Vries
2024-02-07 15:52   ` Tom Tromey
2024-02-12 15:15     ` Tom de Vries
2024-02-12 17:35       ` Tom Tromey
2024-02-07  9:02 ` [RFC 3/3] [gdb/dap] Ignore OSError on stream.flush in JSON writer Tom de Vries
2024-02-07 10:29   ` Tom de Vries

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).