public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* killpg(pgid, 0) fails if the process is in the middle of spawnve()
@ 2022-05-18 11:19 Jun T
  2022-05-18 14:54 ` Corinna Vinschen
  0 siblings, 1 reply; 4+ messages in thread
From: Jun T @ 2022-05-18 11:19 UTC (permalink / raw)
  To: cygwin

Dear Cygwin developers,

It seems killpg(2) on Cygwin has a problem as described below.
Can this be (easily) fixed?

[1] The problem

killpg(pgid, 0) (or kill_pgrp(pgid, si_signo=0), in signal.cc)
fails (returns -1) even when there is a process in the process
group pgid, if the process is in the middle of spawnve().

[2] A problem of zsh on Cygwin that is caused by [1]

More than a year ago, a user of zsh on Cygwin/MSYS2 reported
to the zsh/workers mailing list:
https://www.zsh.org/mla/workers/2021/msg00060.html

As described in this post, it can sometimes (or frequently)
happen that a pipeline like 'ls | less' results in:

zsh% ls --color | less
zsh: done                    ls --color |
zsh: suspended (tty output)  less

How frequently you get this may depend on your hardware,
but if it happens you will find it quite annoying.

[3] How does [1] cause [2]?

According to the strace output, what is happening is as follows:

The main zsh (zsh0) fork() two subshells, zsh1 for ls and
zsh2 for less.
zsh1 becomes a process group leader (pid=pgid=101, for example),
gets tty (becomes foreground), and calls execve(ls).
zsh2 becomes a member of the process group pgid=101, and
calls execve(less).

When ls exits, zsh0 gets SIGCHLD, and in the signal handler
it calls killpg(101, 0) to see if there are any process
remaining in the process group 101. At this point zsh2/less
is still in the process group 101, so killpg(101, 0) should
succeed.

But when problem [2] happens, zsh2 has already called execve(less)
or spawnve(_P_OVERLAY,less), but spawnve() has not finished yet.

There are two Windows processes (zsh2 and less), but it _seems_
neither of them is included in the list of win-pids created by
    winpids pids ((DWORD) PID_MAP_RW);
at line 358 of signal.cc. So kill_pgrp() fails, and zsh0 thinks
that there is no foreground process remaining, and regains tty.

Later spawnve(less) completes, and less wants to write to stdout,
but it has not tty, and is stopped by SIGTTOU.

Is it possible to fix the problem [1], so that killpg(pgid, 0)
succeeds even when all the process(es) in the process group pgid
is/are in the middle of spawnve()?

--
Jun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: killpg(pgid, 0) fails if the process is in the middle of spawnve()
  2022-05-18 11:19 killpg(pgid, 0) fails if the process is in the middle of spawnve() Jun T
@ 2022-05-18 14:54 ` Corinna Vinschen
  2022-05-19  4:51   ` Jun T
  0 siblings, 1 reply; 4+ messages in thread
From: Corinna Vinschen @ 2022-05-18 14:54 UTC (permalink / raw)
  To: Jun T; +Cc: cygwin

On May 18 20:19, Jun T wrote:
> It seems killpg(2) on Cygwin has a problem as described below.
> Can this be (easily) fixed?
> 
> [1] The problem
> 
> killpg(pgid, 0) (or kill_pgrp(pgid, si_signo=0), in signal.cc)
> fails (returns -1) even when there is a process in the process
> group pgid, if the process is in the middle of spawnve().
> 
> [2] A problem of zsh on Cygwin that is caused by [1]
> [...]
> zsh% ls --color | less
> zsh: done                    ls --color |
> zsh: suspended (tty output)  less
> 
> How frequently you get this may depend on your hardware,
> but if it happens you will find it quite annoying.
> 
> [3] How does [1] cause [2]?
> 
> According to the strace output, what is happening is as follows:
> 
> The main zsh (zsh0) fork() two subshells, zsh1 for ls and
> zsh2 for less.
> zsh1 becomes a process group leader (pid=pgid=101, for example),
> gets tty (becomes foreground), and calls execve(ls).
> zsh2 becomes a member of the process group pgid=101, and
> calls execve(less).
> 
> When ls exits, zsh0 gets SIGCHLD, and in the signal handler
> it calls killpg(101, 0) to see if there are any process
> remaining in the process group 101. At this point zsh2/less
> is still in the process group 101, so killpg(101, 0) should
> succeed.
> 
> But when problem [2] happens, zsh2 has already called execve(less)
> or spawnve(_P_OVERLAY,less), but spawnve() has not finished yet.
> 
> There are two Windows processes (zsh2 and less), but it _seems_
> neither of them is included in the list of win-pids created by
>     winpids pids ((DWORD) PID_MAP_RW);
> at line 358 of signal.cc. So kill_pgrp() fails, and zsh0 thinks
> that there is no foreground process remaining, and regains tty.
> [...]

I'm not 100% sure, but I think the core of the problem is spawn.cc.
Look into the mode == _P_OVERLAY (aka "exec'ing") starting at line 857.

At line 870, the handle to the own winpid object is closed in the
exec'ing process. However, the exec'ing process only creates the winpid
symlink for the exec'ed process if it's a non-Cygwin process. For a
Cygwin process, the assumption is that the exec'ed process creates its
own symlink (in pinfo::thisproc() in pinfo.cc). If the exec'ing process
calls NtClose on it's own winpid symlink, but the exec'ed process didn't
progress enough into initialization, there's a slim chance that neither
the exec'ing process, nor the exec'ed process has a winpid symlink
attached.

As for how to fix this... as a first rough idea I'd think that we
should always create the winpid symlink in spawn.cc, even for
exec'ed Cygwin processes.  We also need to make sure that we dup
the handle into the new process, and stop creating the winpid symlink
in exec'ed processes. Here's a Q&D patch:

diff --git a/winsup/cygwin/pinfo.cc b/winsup/cygwin/pinfo.cc
index 4a8daefd924e..f501c470f34c 100644
--- a/winsup/cygwin/pinfo.cc
+++ b/winsup/cygwin/pinfo.cc
@@ -55,10 +55,11 @@ void
 pinfo::thisproc (HANDLE h)
 {
   procinfo = NULL;
+  bool execed = !!h;
 
   DWORD flags = PID_IN_USE | PID_ACTIVE;
   /* Forked process or process started from non-Cygwin parent needs a pid. */
-  if (!h)
+  if (!execed)
     {
       cygheap->pid = create_cygwin_pid ();
       flags |= PID_NEW;
@@ -72,7 +73,8 @@ pinfo::thisproc (HANDLE h)
   procinfo->dwProcessId = myself_initial.dwProcessId;
   procinfo->sendsig = myself_initial.sendsig;
   wcscpy (procinfo->progname, myself_initial.progname);
-  create_winpid_symlink ();
+  if (!execed)
+    create_winpid_symlink ();
   procinfo->exec_sendsig = NULL;
   procinfo->exec_dwProcessId = 0;
   debug_printf ("myself dwProcessId %u", procinfo->dwProcessId);
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 98b588698c5c..905d949fc5c4 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -860,13 +860,14 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	  strace.execing = 1;
 	  myself.hProcess = hExeced = pi.hProcess;
 	  HANDLE old_winpid_hdl = myself.shared_winpid_handle ();
-	  if (!real_path.iscygexec ())
-	    {
-	      /* If the child process is not a Cygwin process, we have to
-		 create a new winpid symlink on behalf of the child process
-		 not being able to do this by itself. */
-	      myself.create_winpid_symlink ();
-	    }
+	  /* We have to create a new winpid symlink on behalf of the child
+	     process. For Cygwin processes we also have to create a reference
+	     in the child. */
+	  myself.create_winpid_symlink ();
+	  if (real_path.iscygexec ())
+	    DuplicateHandle (GetCurrentProcess (),
+			     myself.shared_winpid_handle (),
+			     pi.hProcess, NULL, 0, 0, DUPLICATE_SAME_ACCESS);
 	  NtClose (old_winpid_hdl);
 	  real_path.get_wide_win32_path (myself->progname); // FIXME: race?
 	  sigproc_printf ("new process name %W", myself->progname);

However, this *might* have side effects, like having two processes with
the same pid in ps output or something like that.  The tricky thing here
is that creating the new winpid symlink and closing the old winpid
symlink should be atomic from the POV of pid collecting processes.

For a start, can you try the above patch?  I probably won't have the
time to come up with an atomic solution in the next 2 weeks, but in
general, the problem should be clear now.

If somebody else comes up with a neat atomic solution during my absence,
I'd be happy, too.


Corinna

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: killpg(pgid, 0) fails if the process is in the middle of spawnve()
  2022-05-18 14:54 ` Corinna Vinschen
@ 2022-05-19  4:51   ` Jun T
  2022-05-19  7:46     ` Corinna Vinschen
  0 siblings, 1 reply; 4+ messages in thread
From: Jun T @ 2022-05-19  4:51 UTC (permalink / raw)
  To: Lavrentiev, Anton (NIH/NLM/NCBI) [C] via Cygwin


> 2022/05/18 23:54, Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
> 
> For a start, can you try the above patch? 

Thanks for a quick response. The patch seems to solve the problem.
Of course there is a possibility that it happens with very low probability,
but I haven't get the problem by running 'cmd | less' many times, say
fifty times or so.

I didn't notice any bad side effects, and I guess the probability is
very low if it ever happens (so not easy to detect it).
# I tried running 'ls | less' several times while running 
#   while true; do ps; done
# in another terminal, but didn't get duplicated processes.


PS:
I'm new to building cygwin1.dll by myself, and have a few questions:
(1) After make, I've manually copied new-cygwin1.dll to /bin/cygwin1.dll.
I didn't copy/install any other files. Is this OK?
(2) What are cygwin0.dll and cygwin1.dbg? The latter is a new cygwin1.dll
with debug info?
(3) Is there a simple way to build only new-cygwin1.dll (and anything
required by it), without building (or trying to build) unnecessary
things (especially documents)?

--
Jun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: killpg(pgid, 0) fails if the process is in the middle of spawnve()
  2022-05-19  4:51   ` Jun T
@ 2022-05-19  7:46     ` Corinna Vinschen
  0 siblings, 0 replies; 4+ messages in thread
From: Corinna Vinschen @ 2022-05-19  7:46 UTC (permalink / raw)
  To: Jun T; +Cc: cygwin

On May 19 13:51, Jun T wrote:
> 
> > 2022/05/18 23:54, Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
> > 
> > For a start, can you try the above patch? 
> 
> Thanks for a quick response. The patch seems to solve the problem.
> Of course there is a possibility that it happens with very low probability,
> but I haven't get the problem by running 'cmd | less' many times, say
> fifty times or so.
> 
> I didn't notice any bad side effects, and I guess the probability is
> very low if it ever happens (so not easy to detect it).
> # I tried running 'ls | less' several times while running 
> #   while true; do ps; done
> # in another terminal, but didn't get duplicated processes.

Great, thanks for testing. I'll push the patch as is for now.

> PS:
> I'm new to building cygwin1.dll by myself, and have a few questions:
> (1) After make, I've manually copied new-cygwin1.dll to /bin/cygwin1.dll.
> I didn't copy/install any other files. Is this OK?

Yes.

> (2) What are cygwin0.dll and cygwin1.dbg? The latter is a new cygwin1.dll
> with debug info?

cygwin0.dll is an intermediate stage during build, cygwin1.dbg is the
debuginfo file.  For quick debugging, you can just copy it alongside the
DLL into /bin.

> (3) Is there a simple way to build only new-cygwin1.dll (and anything
> required by it), without building (or trying to build) unnecessary
> things (especially documents)?

  configure --disable-doc

Everything else is built unconditionally.  If you tweak sources
inside ${srcdir}/winsup/cygwin only, it's usually sufficient to rebuild
inside ${builddir}/x86_64-pc-cygwin/winsup/cygwin only, unless you
change the layout/size of data structures. A build from toplevel
should be performed then.


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-05-19  7:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-18 11:19 killpg(pgid, 0) fails if the process is in the middle of spawnve() Jun T
2022-05-18 14:54 ` Corinna Vinschen
2022-05-19  4:51   ` Jun T
2022-05-19  7:46     ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).