double-fork issue on Windows on ARM64

public inbox for cygwin-developers@cygwin.com
 help / color / mirror / Atom feed

* double-fork issue on Windows on ARM64
@ 2024-05-08 18:54 Jeremy Drake
  2024-05-20 23:58 ` Jeremy Drake
  0 siblings, 1 reply; 3+ messages in thread
From: Jeremy Drake @ 2024-05-08 18:54 UTC (permalink / raw)
  To: cygwin-developers; +Cc: Johannes.Schindelin

[-- Attachment #1: Type: text/plain, Size: 2572 bytes --]

(this is the same issue discussed in
https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)

On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
with processes hanging up.  Usually pacman, when it is trying to validate
signatures with gpgme.  When a process is hung in this way, no debugger
seems to be able to attach properly.

After many months of off-and-on progress trying to debug this, we've
*finally* got an idea of what behavior is causing this, and a standalone
reproducer that runs on Cygwin.

> A common symptom is that the hanging process has a command-line that is
> identical to its parent process' command-line (indicating that it has
> been fork()ed), and anecdotally, the hang occurs when _exit() calls
> proc_terminate() which is then blocked by a call to TerminateThread()
> with an invalid thread handle (for more details, see
> https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327).
>
> In my tests, I found that the hanging process is spawned from
> _gpgme_io_spawn() which lets the child process immediately spawn another
> child. That seems like a fantastic way to find timing-related bugs in
> the MSYS2/Cygwin runtime.
>
> As a work-around, it does seem to help if we avoid that double-fork.

That led me to make the attached reproducer, which is based on the code
from _gpgme_io_spawn.  I originally expected that this would require some
timing adjustment, hence the defines to change the binary and argument (I
expected to use /bin/sleep and different values).  It turns out, this
reproduces readily with /bin/true.

I build this with `gcc -ggdb -o testfork testfork.c`, and this reproduces:
* on a Raspberry PI 4 running Windows 10, with an i686 msys2 runtime
* on a QC710 running Windows 11 23H2, with x86_64 msys2 runtime (this
seems to reproduce it most readily).
* on a hyper-v virtual machine on Dev Kit 2023 running Windows 11 23H2,
with x86_64 msys2 runtime or Cygwin 3.5.3.  This seems to require running
two instances of testfork.exe at the same time.

When attaching to the hung process, gdb shows
(gdb) i thr
  Id   Target Id                Frame
  1    Thread 6516.0xbe8        error return
/cygdrive/d/a/scallywag/gdb/gdb-13.2-1.x86_64/src/gdb-13.2/gdb/windows-nat.c:748
was 31: A device attached to the system is not functioning.
0x0000000000000000 in ?? ()
  2    Thread 6516.0x1b28 "sig" 0x00007ff8051a8a64 in ?? ()
* 3    Thread 6516.0x12b4       0x00007ff8051b4374 in ?? ()

Let me know if I can provide any additional info, or anything else we can
try to help debug this.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-c; name=testfork.c, Size: 1045 bytes --]

#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>

#ifndef BINARY
#define BINARY "/bin/true"
#endif

#ifndef ARG
#define ARG "0.1"
#endif

int main(int argc, char ** argv)
{
	while (1)
	{
		int pid;
		printf("Starting group of 100x " BINARY " " ARG "\n");
		for (int i = 0; i < 100; ++i)
		{
			pid = fork();
			if (pid == -1)
			{
				perror("fork error");
				return 1;
			}
			else if (pid == 0)
			{
				if ((pid = fork()) == 0)
				{
					char * const args[] = {BINARY, ARG, NULL};
					execv(BINARY, args);
					perror("execv failed");
					_exit(5);
				}
				if (pid == -1)
				{
					perror("inner fork error");
					_exit(1);
				}
				else
				{
					_exit(0);
				}
			}
			else
			{
				int status;
				if (waitpid(pid, &status, 0) == -1)
				{
					perror("waitpid error");
					return 2;
				}
				else if (status != 0)
				{
					fprintf(stderr, "subprocess exited non-zero: %d\n", status);
					return WEXITSTATUS(status);
				}
			}
		}
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: double-fork issue on Windows on ARM64
  2024-05-08 18:54 double-fork issue on Windows on ARM64 Jeremy Drake
@ 2024-05-20 23:58 ` Jeremy Drake
  2024-05-22  1:01   ` Jeremy Drake
  0 siblings, 1 reply; 3+ messages in thread
From: Jeremy Drake @ 2024-05-20 23:58 UTC (permalink / raw)
  To: cygwin-developers; +Cc: Johannes.Schindelin

On Wed, 8 May 2024, Jeremy Drake wrote:

> (this is the same issue discussed in
> https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)
>
> On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
> with processes hanging up.  Usually pacman, when it is trying to validate
> signatures with gpgme.  When a process is hung in this way, no debugger
> seems to be able to attach properly.
>
> > anecdotally, the hang occurs when _exit() calls
> > proc_terminate() which is then blocked by a call to TerminateThread()
> > with an invalid thread handle (for more details, see
> > https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327).


As a follow-up to this, that was from a proposed workaround of just
commenting out the double-fork behavior in gpgme.  After reading a comment
in the code and doing some research online, it seems the double-fork is an
accepted idiom on posix to avoid having to wait for the (grand)child,
without creating zombie processes.  I was unable to see zombie processes
in ps or /proc/<pid>, but I did see extra cygpid.* entries in
/proc/sys/BaseNamedObjects/cygwin* which seem to be much the same thing.

Today, I was attempting to look at the TerminateThread situation.  The
call in question comes from the attempt to terminate the wait_thread of a
chld_procs entry.  I noticed elsewhere in cygwin code (flock.cc) that
CancelSynchronousIo was being called, and that stood out to me because
chances are that the wait thread (if running) is going to be blocked in
ReadFile.  I am testing with the following hack, and so far have not seen
a hang:
diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index 86e4e607ab..020906d797 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -410,7 +410,7 @@ proc_terminate ()
 	  if (!have_execed || !have_execed_cygwin)
 	    chld_procs[i]->ppid = 1;
 	  if (chld_procs[i].wait_thread)
-	    chld_procs[i].wait_thread->terminate_thread ();
+	    CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ());
 	  /* Release memory associated with this process unless it is 'myself'.
 	     'myself' is only in the chld_procs table when we've execed.  We
 	     reach here when the next process has finished initializing but we


As a disclaimer, I am having a hard time wrapping my head around this
code, so I don't know what kind of side-effects this may have, but it does
seem to help the hang, without resulting in "zombie" cygpid entries.

(Note that I first tried
+	      if (CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ()))
+		chld_procs[i].wait_thread->detach ();
+	      else
+		chld_procs[i].wait_thread->terminate_thread ();
but that resulted in a (debuggable) hang in detach, because the
cygthread::stub was waiting for thread_sync, while cygthread::detach was
waiting for *this.  That appears to be because this is an auto-releasing
cygthread.  It kind of bothers me that there is no synchronization to be
sure the wait_thread is done shutting down before moving on in
proc_terminate, but I don't see an obvious way in the current structure).

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: double-fork issue on Windows on ARM64
  2024-05-20 23:58 ` Jeremy Drake
@ 2024-05-22  1:01   ` Jeremy Drake
  0 siblings, 0 replies; 3+ messages in thread
From: Jeremy Drake @ 2024-05-22  1:01 UTC (permalink / raw)
  To: cygwin-developers; +Cc: Johannes.Schindelin

On Mon, 20 May 2024, Jeremy Drake wrote:

> Today, I was attempting to look at the TerminateThread situation.  The
> call in question comes from the attempt to terminate the wait_thread of a
> chld_procs entry.  I noticed elsewhere in cygwin code (flock.cc) that
> CancelSynchronousIo was being called, and that stood out to me because
> chances are that the wait thread (if running) is going to be blocked in
> ReadFile.  I am testing with the following hack, and so far have not seen
> a hang

I left my reproducer running with this hack, and I did eventually get an
error exit from the intermediate subprocess, which seems to have been a
signal 11 (if I'm reading the status from waitpid correctly).

What I noticed today is that in pinfo.cc, near the end of proc_waiter, it
sets vchild.wait_thread = NULL;.  If my reading of this is correct, that
does nothing useful, because vchild is a stack variable there and the
function returns soon after.  I that what that *intended* to do was to
NULL out the wait_thread pointer that would be checked in proc_terminate,
but there's no guarantee that the entry in chld_procs is in the same place
at the end of proc_waiter as it was at the start (so arg may point to
some other pinfo entirely).

Does any of this make any sense, or am I barking up the wrong tree here?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-05-22  1:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-08 18:54 double-fork issue on Windows on ARM64 Jeremy Drake
2024-05-20 23:58 ` Jeremy Drake
2024-05-22  1:01   ` Jeremy Drake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).