On Mon, Feb 20, 2017 at 11:54 PM, Mark Geisert wrote: > Erik Bray wrote: >> >> On Mon, Feb 20, 2017 at 11:54 AM, Mark Geisert wrote: >>>> >>>> So my guess was that Cygwin might try to hold on to a handle to a >>>> child process at least until it's been explicitly wait()ed. But that >>>> does not seem to be the case after all. >>> >>> >>> >>> You might have missed a subtlety in what I said above. The Python >>> interpreter itself is calling wait4() to reap your child process. Cygwin >>> has told Python one of its children has died. You won't get the chance >>> to >>> wait() for it yourself. Cygwin *does* have a handle to the process, but >>> it >>> gets closed as part of Python calling wait4(). >> >> >> To be clear, wait4() is not called from Python until the script >> explicitly calls p.wait(). >> In other words, when run this step by step (e.g. in gdb) I don't see a >> wait4() call until the point where the script explicitly waits(). I >> don't see any reason Python would do this behind the scenes. > > > You're right. I missed the wait in your script and ASSumed too much of the > Python interpreter :-( . > > >>>> Anyways, I think it would be nicer if /proc returned at least partial >>>> information on zombie processes, rather than an error. I have a patch >>>> to this effect for /proc//stat, and will add a few more as well. >>>> To me /proc//stat was the most important because that's the >>>> easiest way to check the process's state in the first place! Now I >>>> also have to catch EINVAL as well and assume that means a zombie >>>> process. >>> >>> >>> >>> The file /proc//stat is there until Cygwin finishes cleanup of the >>> child due to Python having wait()ed for it. When you run your test >>> script, >>> pay attention to the process state character in those cases where you >>> successfully read the stat file. It's often S (stopped, I think) or R >>> (running) but I also see Z (zombie) sometimes. Your script is in a race >>> with Cygwin, and you cannot guarantee you'll see a killed process's state >>> before Cygwin cleans it up. >>> >>> One way around this *might* be to install a SIGCHLD handler in your >>> Python >>> script. If that's possible, that should tell you when your child exits. >> >> >> Perhaps the Python script is a red herring. I just wrote it to >> demonstrate the problem. The difference between where I send stdout >> to is strange, but you're likely right that it just comes down to >> subtle timing differences. Here's a C program that demonstrates the >> same issue more reliably. Interestingly, it works when I run it in >> strace (probably just because of the strace overhead) but not when I >> run it normally. >> >> My point in all this is I'm confused why Cygwin would give up its >> handles to the Windows process before wait() has been called. >> >> (In fact, it's pretty confusing to have fopen returning EINVAL which >> according to [1] it should only be doing if the mode string were >> invalid.) >> >> Thanks, >> Erik >> >> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/fopen.html > > > O.K., you may be on to something amiss in the Cygwin DLL. Thanks for the > STC in C; that'll help somebody looking further at this. I'm out of ideas. > It might be possible to reduce strace overhead somewhat by selecting a > smaller set of trace options than the default. Note: My previous test program had a bug in do_child() (not correctly terminating the argv array). The attached program fixes the bug. I've also attached a (truncated) strace log from it.