* Deadlock of the process tree when running make
@ 2022-04-07 21:53 Alexey Izbyshev
2022-04-07 23:54 ` Brian Inglis
2022-04-09 10:17 ` Takashi Yano
0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-07 21:53 UTC (permalink / raw)
To: cygwin
Hi,
I'm using 32-bit Cygwin 3.3.4 on 64-bit Windows 10 21H2. When running
parallel make (for testing my project), very rarely I get the whole
process tree hanging at some seemingly random point. An example of such
a tree:
make-+-make-+-bash---find
| |-bash---find
| |-bash---find
| |-bash---find
| |-bash---find
| `-bash---javac
`-make-+-bash---bash---bash---readlink
`-bash---bash---bash-+-grep
`-grep
(In the above tree, javac is the zombie parent of a native javac, and
the latter doesn't exist at this point).
I got such hang two times while running make in a loop for several days.
ProcessHacker shows that all leaf processes are single-threaded and are
stuck on WaitForSingleObject().
I've skimmed git log of cygwin-3_3-branch after cygwin-3_3_4-release,
but couldn't find anything that seems definitely related.
Has anybody seen something like this?
Is there any way I can get useful data for diagnosing this hang from the
process tree that I currently have hanging (I'm going to keep it for
now)? Otherwise, what would be the best strategy?
Thanks,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-07 21:53 Deadlock of the process tree when running make Alexey Izbyshev
@ 2022-04-07 23:54 ` Brian Inglis
2022-04-08 8:42 ` Alexey Izbyshev
2022-04-09 10:17 ` Takashi Yano
1 sibling, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2022-04-07 23:54 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On 2022-04-07 15:53, Alexey Izbyshev wrote:
> I'm using 32-bit Cygwin 3.3.4 on 64-bit Windows 10 21H2. When running
> parallel make (for testing my project), very rarely I get the whole
> process tree hanging at some seemingly random point. An example of such
> a tree:
>
> make-+-make-+-bash---find
> | |-bash---find
> | |-bash---find
> | |-bash---find
> | |-bash---find
> | `-bash---javac
> `-make-+-bash---bash---bash---readlink
> `-bash---bash---bash-+-grep
> `-grep
>
> (In the above tree, javac is the zombie parent of a native javac, and
> the latter doesn't exist at this point).
>
> I got such hang two times while running make in a loop for several days.
> ProcessHacker shows that all leaf processes are single-threaded and are
> stuck on WaitForSingleObject().
>
> I've skimmed git log of cygwin-3_3-branch after cygwin-3_3_4-release,
> but couldn't find anything that seems definitely related.
>
> Has anybody seen something like this?
>
> Is there any way I can get useful data for diagnosing this hang from the
> process tree that I currently have hanging (I'm going to keep it for
> now)? Otherwise, what would be the best strategy?
I've seen infinite loops with readlink in build scripts under Cygwin.
Seeing that readlink in a process tree makes me suspicious that
something in a shell script is looping because two paths never match or
always match under Cygwin.
Often there is one constant path and a varying path which is subjected
to readlink in a loop.
Under Cygwin, you may have to pass the first path through readlink and
compare that resulting path against the varying value.
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-07 23:54 ` Brian Inglis
@ 2022-04-08 8:42 ` Alexey Izbyshev
2022-04-08 17:04 ` Brian Inglis
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-08 8:42 UTC (permalink / raw)
To: cygwin
On 2022-04-08 02:54, Brian Inglis wrote:
> I've seen infinite loops with readlink in build scripts under Cygwin.
> Seeing that readlink in a process tree makes me suspicious that
> something in a shell script is looping because two paths never match
> or always match under Cygwin.
> Often there is one constant path and a varying path which is subjected
> to readlink in a loop.
> Under Cygwin, you may have to pass the first path through readlink and
> compare that resulting path against the varying value.
Thanks, but I don't think I have such loops in this project. Also, other
processes hang in independent make jobs, so a hang around readlink
wouldn't explain that.
There is also an additional detail that I forgot to mention: in the
stack trace of all leaf processes as displayed by ProcessHacker, it
seems that the executable entry point is not reached yet. The only
non-Windows-DLL location is in cygwin1.dll, so I suspect that all
processes hang at early initialization in Cygwin's DLL entry point.
Thanks,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-08 8:42 ` Alexey Izbyshev
@ 2022-04-08 17:04 ` Brian Inglis
2022-04-11 13:27 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2022-04-08 17:04 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On 2022-04-08 02:42, Alexey Izbyshev wrote:
> On 2022-04-08 02:54, Brian Inglis wrote:
>> I've seen infinite loops with readlink in build scripts under Cygwin.
>> Seeing that readlink in a process tree makes me suspicious that
>> something in a shell script is looping because two paths never match
>> or always match under Cygwin.
>> Often there is one constant path and a varying path which is subjected
>> to readlink in a loop.
>> Under Cygwin, you may have to pass the first path through readlink and
>> compare that resulting path against the varying value.
>
> Thanks, but I don't think I have such loops in this project. Also, other
> processes hang in independent make jobs, so a hang around readlink
> wouldn't explain that.
>
> There is also an additional detail that I forgot to mention: in the
> stack trace of all leaf processes as displayed by ProcessHacker, it
> seems that the executable entry point is not reached yet. The only
> non-Windows-DLL location is in cygwin1.dll, so I suspect that all
> processes hang at early initialization in Cygwin's DLL entry point.
That sounds like BLODA interference from AntiVirus programs:
https://cygwin.com/faq/faq.html#faq.using.bloda
and can also happen if you use Windows AD, and your users have a lot of
rights, and a slow server, firewall filtering, or network link, but
known issues were fixed a few releases ago.
Any idea how much address space is used by Cygwin DLLs, and memory by
all the processes running: run rebase -is to see if you could be out of
address space for Cygwin and DLLs, and how much is left for processes?
Do you have a decent amount of memory free on your system while running,
and Windows paging space allocated to back it up - total twice memory,
and do you have multiple drives to spread it across?
Check your system memory and paging activity while those processes are
running.
Could you try installing Cygwin64 packages and running those instead of
Cygwin32 (recommended as Cygwin32 support will be dropped next release)
as there is more address space available as well as usable memory for
processes?
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-07 21:53 Deadlock of the process tree when running make Alexey Izbyshev
2022-04-07 23:54 ` Brian Inglis
@ 2022-04-09 10:17 ` Takashi Yano
2022-04-09 11:00 ` Alexey Izbyshev
1 sibling, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 10:17 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Fri, 08 Apr 2022 00:53:31 +0300
Alexey Izbyshev wrote:
> Hi,
>
> I'm using 32-bit Cygwin 3.3.4 on 64-bit Windows 10 21H2. When running
> parallel make (for testing my project), very rarely I get the whole
> process tree hanging at some seemingly random point. An example of such
> a tree:
>
> make-+-make-+-bash---find
> | |-bash---find
> | |-bash---find
> | |-bash---find
> | |-bash---find
> | `-bash---javac
> `-make-+-bash---bash---bash---readlink
> `-bash---bash---bash-+-grep
> `-grep
>
> (In the above tree, javac is the zombie parent of a native javac, and
> the latter doesn't exist at this point).
>
> I got such hang two times while running make in a loop for several days.
> ProcessHacker shows that all leaf processes are single-threaded and are
> stuck on WaitForSingleObject().
>
> I've skimmed git log of cygwin-3_3-branch after cygwin-3_3_4-release,
> but couldn't find anything that seems definitely related.
>
> Has anybody seen something like this?
>
> Is there any way I can get useful data for diagnosing this hang from the
> process tree that I currently have hanging (I'm going to keep it for
> now)? Otherwise, what would be the best strategy?
Attaching gdb to the hanging process and dumping stack by 'bt'
command for each thread may diagnose more detail.
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 10:17 ` Takashi Yano
@ 2022-04-09 11:00 ` Alexey Izbyshev
2022-04-09 11:02 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 11:00 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-09 13:17, Takashi Yano wrote:
> Attaching gdb to the hanging process and dumping stack by 'bt'
> command for each thread may diagnose more detail.
I decided to simply look at assembly at the point shown in ProcessHacker
stack trace (cygwin1.dll!feinitialise+0x5ecab) to avoid disturbing the
process by gdb. And it's clear that the hang is in
fhandler_pty_slave::reset_switch_to_pcon() at [1]. I've checked that
there were some changes in that function since 3.3.4. Could they fix
this deadlock?
[1]
https://cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release
Thanks,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 11:00 ` Alexey Izbyshev
@ 2022-04-09 11:02 ` Alexey Izbyshev
2022-04-09 11:46 ` Takashi Yano
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 11:02 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-09 14:00, Alexey Izbyshev wrote:
> On 2022-04-09 13:17, Takashi Yano wrote:
>
>> Attaching gdb to the hanging process and dumping stack by 'bt'
>> command for each thread may diagnose more detail.
>
> I decided to simply look at assembly at the point shown in
> ProcessHacker stack trace (cygwin1.dll!feinitialise+0x5ecab) to avoid
> disturbing the process by gdb. And it's clear that the hang is in
> fhandler_pty_slave::reset_switch_to_pcon() at [1]. I've checked that
> there were some changes in that function since 3.3.4. Could they fix
> this deadlock?
>
> [1]
> https://cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release
>
Missed the line in the link above:
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199
> Thanks,
> Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 11:02 ` Alexey Izbyshev
@ 2022-04-09 11:46 ` Takashi Yano
2022-04-09 16:07 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 11:46 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Sat, 09 Apr 2022 14:02:38 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 14:00, Alexey Izbyshev wrote:
> > On 2022-04-09 13:17, Takashi Yano wrote:
> >
> >> Attaching gdb to the hanging process and dumping stack by 'bt'
> >> command for each thread may diagnose more detail.
> >
> > I decided to simply look at assembly at the point shown in
> > ProcessHacker stack trace (cygwin1.dll!feinitialise+0x5ecab) to avoid
> > disturbing the process by gdb. And it's clear that the hang is in
> > fhandler_pty_slave::reset_switch_to_pcon() at [1]. I've checked that
> > there were some changes in that function since 3.3.4. Could they fix
> > this deadlock?
> >
> > [1]
> > https://cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release
> >
>
> Missed the line in the link above:
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199
Thanks for finding that. It would be very helpfull if you could
find another process which holds pcon_mutex and where it is stopping.
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 11:46 ` Takashi Yano
@ 2022-04-09 16:07 ` Alexey Izbyshev
2022-04-09 16:57 ` Takashi Yano
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 16:07 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-09 14:46, Takashi Yano wrote:
> On Sat, 09 Apr 2022 14:02:38 +0300
> Alexey Izbyshev wrote:
>>
>> Missed the line in the link above:
>> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199
>
> Thanks for finding that. It would be very helpfull if you could
> find another process which holds pcon_mutex and where it is stopping.
ProcessHacker showed that the owner of the pcon mutex is bash.exe with
(Windows) PID 6276. However, Cygwin ps doesn't list such a process. Its
parent, however, has a Cygwin PID 37961 and is in the hanging tree:
make(32651)-+-make(32656)-+-bash(37296)---find(38057)
| |-bash(37632)---find(38061)
| |-bash(37415)---find(38064)
| |-bash(37852)---find(38062)
| |-bash(37896)---find(38063)
| `-bash(37961)---javac(38032)
`-make(32657)-+-bash(38025)---bash(38054)---bash(38055)---readlink(38056)
`-bash(37722)---bash(37825)---bash(38058)-+-grep(38060)
`-grep(38059)
Since javac(38032) is a zombie, my guess is that missing bash.exe (win
6276) is an intermediate process that Cygwin created when bash(37961)
forked to run javac.
bash.exe (win 6276) has two threads. The first one is blocked at
ClosePseudoConsole() (which according to stack trace eventually calls
NtWaitForSingleObject()) [1] and the second one is at [2].
[1]
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l3615
[2]
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/sigproc.cc;h=02d875a7fc947d628ca933690ed43ef03d767d53;hb=cygwin-3_3_4-release#l1359
Hope this is helpful,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 16:07 ` Alexey Izbyshev
@ 2022-04-09 16:57 ` Takashi Yano
2022-04-09 17:23 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 16:57 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Sat, 09 Apr 2022 19:07:08 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 14:46, Takashi Yano wrote:
> > On Sat, 09 Apr 2022 14:02:38 +0300
> > Alexey Izbyshev wrote:
> >>
> >> Missed the line in the link above:
> >> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199
> >
> > Thanks for finding that. It would be very helpfull if you could
> > find another process which holds pcon_mutex and where it is stopping.
>
> ProcessHacker showed that the owner of the pcon mutex is bash.exe with
> (Windows) PID 6276. However, Cygwin ps doesn't list such a process. Its
> parent, however, has a Cygwin PID 37961 and is in the hanging tree:
>
> make(32651)-+-make(32656)-+-bash(37296)---find(38057)
> | |-bash(37632)---find(38061)
> | |-bash(37415)---find(38064)
> | |-bash(37852)---find(38062)
> | |-bash(37896)---find(38063)
> | `-bash(37961)---javac(38032)
>
> `-make(32657)-+-bash(38025)---bash(38054)---bash(38055)---readlink(38056)
>
> `-bash(37722)---bash(37825)---bash(38058)-+-grep(38060)
>
> `-grep(38059)
>
> Since javac(38032) is a zombie, my guess is that missing bash.exe (win
> 6276) is an intermediate process that Cygwin created when bash(37961)
> forked to run javac.
>
> bash.exe (win 6276) has two threads. The first one is blocked at
> ClosePseudoConsole() (which according to stack trace eventually calls
> NtWaitForSingleObject()) [1] and the second one is at [2].
>
> [1]
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l3615
>
> [2]
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/sigproc.cc;h=02d875a7fc947d628ca933690ed43ef03d767d53;hb=cygwin-3_3_4-release#l1359
>
> Hope this is helpful,
Thank you very much for the information. Can you check if
the thread pty_master_fwd_thread() in root mintty is still
alive?
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 16:57 ` Takashi Yano
@ 2022-04-09 17:23 ` Alexey Izbyshev
2022-04-09 17:54 ` Takashi Yano
2022-04-11 5:23 ` Jeremy Drake
0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 17:23 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-09 19:57, Takashi Yano wrote:
> Thank you very much for the information. Can you check if
> the thread pty_master_fwd_thread() in root mintty is still
> alive?
I don't have mintty because "make" is run via an SSH session. I suppose
I should look into sshd in this case? I've checked an sshd process that
is the parent of this session, and yes, one of its threads is blocked at
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l2710.
Thanks,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 17:23 ` Alexey Izbyshev
@ 2022-04-09 17:54 ` Takashi Yano
2022-04-09 19:35 ` Alexey Izbyshev
2022-04-11 5:23 ` Jeremy Drake
1 sibling, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 17:54 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Sat, 09 Apr 2022 20:23:06 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 19:57, Takashi Yano wrote:
> > Thank you very much for the information. Can you check if
> > the thread pty_master_fwd_thread() in root mintty is still
> > alive?
>
> I don't have mintty because "make" is run via an SSH session. I suppose
> I should look into sshd in this case? I've checked an sshd process that
> is the parent of this session, and yes, one of its threads is blocked at
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l2710.
Thanks for checking. This seems to be normal. Then, I cannot
understand why the ClosePseudoConsole() call is blocked...
The document by Microsoft mentions the blocking conditions of
ClosePseudoConsole():
https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
however, the thread above is draining the channel.
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 17:54 ` Takashi Yano
@ 2022-04-09 19:35 ` Alexey Izbyshev
2022-04-09 20:26 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 19:35 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-09 20:54, Takashi Yano wrote:
> Thanks for checking. This seems to be normal. Then, I cannot
> understand why the ClosePseudoConsole() call is blocked...
>
> The document by Microsoft mentions the blocking conditions of
> ClosePseudoConsole():
> https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
> however, the thread above is draining the channel.
I've decided to check what object ClosePseudoConsole() waits for. The
wait happens inside unexported KERNELBASE!_ClosePseudoConsoleMembers
function. Here is the relevant part:
76589fb5 8b4e08 mov ecx,dword ptr [esi+8]
76589fb8 e8c2fdffff call KERNELBASE!_HandleIsValid (76589d7f)
76589fbd 84c0 test al,al
76589fbf 7456 je
KERNELBASE!_ClosePseudoConsoleMembers+0x89 (7658a017)
76589fc1 8d45fc lea eax,[ebp-4]
76589fc4 895dfc mov dword ptr [ebp-4],ebx
76589fc7 50 push eax
76589fc8 51 push ecx
76589fc9 e8c23ef5ff call KERNELBASE!GetExitCodeProcess
(764dde90)
76589fce 85c0 test eax,eax
76589fd0 7414 je
KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
76589fd2 817dfc03010000 cmp dword ptr [ebp-4],103h
76589fd9 750b jne
KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
76589fdb 53 push ebx
76589fdc 6aff push 0FFFFFFFFh
76589fde ff7608 push dword ptr [esi+8]
76589fe1 e8ba74f6ff call KERNELBASE!WaitForSingleObjectEx
(764f14a0)
"esi" is the argument of ClosePseudoConsole(), so the first mov
dereferences it with an offset and loads a process handle. Then, if this
handle is valid, it calls GetExitCodeProcess(), and if it succeeds and
returns STILL_ACTIVE, it waits for that process.
I've checked that hanging bash process has only 3 process handles: for
itself, for dead javac, and for conhost.exe. So obviously it waits for
the latter to terminate. (After I did all this, I realized there was
much easier way to get this result via "Analyze wait chain" feature of
Task Manager).
Unfortunately, I don't know anything about Windows consoles, but just in
case I also checked what 5 threads of conhost.exe are waiting for:
1. Tries to enter a critical section (Task Manager claims it waits for
thread 4, so probably the latter owns it).
2. Waits on a handle for "pty1-from-master-nat" named pipe.
3. Waits for an anonymous event.
4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
5. Blocked in GetMessageW().
It's also worth of note that this conhost.exe seems to be the only one
related to the Cygwin process tree (as well as the only related
non-Cygwin process). All other conhost.exe processes were created before
I started my stress test.
My guess is that this conhost.exe was created for a native app started
from a Cygwin process. Could it be some race condition/bug that
prevented conhost.exe from terminating once the native process (probably
javac?) died?
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 19:35 ` Alexey Izbyshev
@ 2022-04-09 20:26 ` Alexey Izbyshev
2022-04-10 7:34 ` Takashi Yano
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 20:26 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-09 22:35, Alexey Izbyshev wrote:
> On 2022-04-09 20:54, Takashi Yano wrote:
>> Thanks for checking. This seems to be normal. Then, I cannot
>> understand why the ClosePseudoConsole() call is blocked...
>>
>> The document by Microsoft mentions the blocking conditions of
>> ClosePseudoConsole():
>> https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
>> however, the thread above is draining the channel.
>
> I've decided to check what object ClosePseudoConsole() waits for. The
> wait happens inside unexported KERNELBASE!_ClosePseudoConsoleMembers
> function. Here is the relevant part:
>
> 76589fb5 8b4e08 mov ecx,dword ptr [esi+8]
> 76589fb8 e8c2fdffff call KERNELBASE!_HandleIsValid (76589d7f)
> 76589fbd 84c0 test al,al
> 76589fbf 7456 je
> KERNELBASE!_ClosePseudoConsoleMembers+0x89 (7658a017)
> 76589fc1 8d45fc lea eax,[ebp-4]
> 76589fc4 895dfc mov dword ptr [ebp-4],ebx
> 76589fc7 50 push eax
> 76589fc8 51 push ecx
> 76589fc9 e8c23ef5ff call KERNELBASE!GetExitCodeProcess
> (764dde90)
> 76589fce 85c0 test eax,eax
> 76589fd0 7414 je
> KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> 76589fd2 817dfc03010000 cmp dword ptr [ebp-4],103h
> 76589fd9 750b jne
> KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> 76589fdb 53 push ebx
> 76589fdc 6aff push 0FFFFFFFFh
> 76589fde ff7608 push dword ptr [esi+8]
> 76589fe1 e8ba74f6ff call KERNELBASE!WaitForSingleObjectEx
> (764f14a0)
>
> "esi" is the argument of ClosePseudoConsole(), so the first mov
> dereferences it with an offset and loads a process handle. Then, if
> this handle is valid, it calls GetExitCodeProcess(), and if it
> succeeds and returns STILL_ACTIVE, it waits for that process.
>
> I've checked that hanging bash process has only 3 process handles: for
> itself, for dead javac, and for conhost.exe. So obviously it waits for
> the latter to terminate. (After I did all this, I realized there was
> much easier way to get this result via "Analyze wait chain" feature of
> Task Manager).
>
> Unfortunately, I don't know anything about Windows consoles, but just
> in case I also checked what 5 threads of conhost.exe are waiting for:
>
> 1. Tries to enter a critical section (Task Manager claims it waits for
> thread 4, so probably the latter owns it).
> 2. Waits on a handle for "pty1-from-master-nat" named pipe.
> 3. Waits for an anonymous event.
> 4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
> 5. Blocked in GetMessageW().
>
> It's also worth of note that this conhost.exe seems to be the only one
> related to the Cygwin process tree (as well as the only related
> non-Cygwin process). All other conhost.exe processes were created
> before I started my stress test.
>
> My guess is that this conhost.exe was created for a native app started
> from a Cygwin process. Could it be some race condition/bug that
> prevented conhost.exe from terminating once the native process
> (probably javac?) died?
>
A few more things that might be important:
* Clarification: thread 2 of conhost.exe waits in KernelBase!ReadFile().
* In the assembly part I omitted, before waiting on the conhost process,
_ClosePseudoConsoleMembers() closes the handle obtained from "dword ptr
[esi]", i.e. "hWritePipe" member of HPCON_INTERNAL struct.
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 20:26 ` Alexey Izbyshev
@ 2022-04-10 7:34 ` Takashi Yano
2022-04-10 12:13 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-10 7:34 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Sat, 09 Apr 2022 23:26:51 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 22:35, Alexey Izbyshev wrote:
> > On 2022-04-09 20:54, Takashi Yano wrote:
> >> Thanks for checking. This seems to be normal. Then, I cannot
> >> understand why the ClosePseudoConsole() call is blocked...
> >>
> >> The document by Microsoft mentions the blocking conditions of
> >> ClosePseudoConsole():
> >> https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
> >> however, the thread above is draining the channel.
> >
> > I've decided to check what object ClosePseudoConsole() waits for. The
> > wait happens inside unexported KERNELBASE!_ClosePseudoConsoleMembers
> > function. Here is the relevant part:
> >
> > 76589fb5 8b4e08 mov ecx,dword ptr [esi+8]
> > 76589fb8 e8c2fdffff call KERNELBASE!_HandleIsValid (76589d7f)
> > 76589fbd 84c0 test al,al
> > 76589fbf 7456 je
> > KERNELBASE!_ClosePseudoConsoleMembers+0x89 (7658a017)
> > 76589fc1 8d45fc lea eax,[ebp-4]
> > 76589fc4 895dfc mov dword ptr [ebp-4],ebx
> > 76589fc7 50 push eax
> > 76589fc8 51 push ecx
> > 76589fc9 e8c23ef5ff call KERNELBASE!GetExitCodeProcess
> > (764dde90)
> > 76589fce 85c0 test eax,eax
> > 76589fd0 7414 je
> > KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> > 76589fd2 817dfc03010000 cmp dword ptr [ebp-4],103h
> > 76589fd9 750b jne
> > KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> > 76589fdb 53 push ebx
> > 76589fdc 6aff push 0FFFFFFFFh
> > 76589fde ff7608 push dword ptr [esi+8]
> > 76589fe1 e8ba74f6ff call KERNELBASE!WaitForSingleObjectEx
> > (764f14a0)
> >
> > "esi" is the argument of ClosePseudoConsole(), so the first mov
> > dereferences it with an offset and loads a process handle. Then, if
> > this handle is valid, it calls GetExitCodeProcess(), and if it
> > succeeds and returns STILL_ACTIVE, it waits for that process.
> >
> > I've checked that hanging bash process has only 3 process handles: for
> > itself, for dead javac, and for conhost.exe. So obviously it waits for
> > the latter to terminate. (After I did all this, I realized there was
> > much easier way to get this result via "Analyze wait chain" feature of
> > Task Manager).
> >
> > Unfortunately, I don't know anything about Windows consoles, but just
> > in case I also checked what 5 threads of conhost.exe are waiting for:
> >
> > 1. Tries to enter a critical section (Task Manager claims it waits for
> > thread 4, so probably the latter owns it).
> > 2. Waits on a handle for "pty1-from-master-nat" named pipe.
> > 3. Waits for an anonymous event.
> > 4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
> > 5. Blocked in GetMessageW().
> >
> > It's also worth of note that this conhost.exe seems to be the only one
> > related to the Cygwin process tree (as well as the only related
> > non-Cygwin process). All other conhost.exe processes were created
> > before I started my stress test.
> >
> > My guess is that this conhost.exe was created for a native app started
> > from a Cygwin process. Could it be some race condition/bug that
> > prevented conhost.exe from terminating once the native process
> > (probably javac?) died?
> >
> A few more things that might be important:
>
> * Clarification: thread 2 of conhost.exe waits in KernelBase!ReadFile().
>
> * In the assembly part I omitted, before waiting on the conhost process,
> _ClosePseudoConsoleMembers() closes the handle obtained from "dword ptr
> [esi]", i.e. "hWritePipe" member of HPCON_INTERNAL struct.
Thanks for investigating. In the normal case, conhost.exe is terminated
when hWritePipe is closed.
Possibly, the hWritePipe has incorrect handle value.
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-10 7:34 ` Takashi Yano
@ 2022-04-10 12:13 ` Alexey Izbyshev
2022-04-10 20:49 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-10 12:13 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-10 10:34, Takashi Yano wrote:
> On Sat, 09 Apr 2022 23:26:51 +0300
> Thanks for investigating. In the normal case, conhost.exe is terminated
> when hWritePipe is closed.
Thanks for confirming.
>
> Possibly, the hWritePipe has incorrect handle value.
I've verified that the handle was correct by attaching via gdb to the
hanging bash and checking that hWritePipe field is now zeroed (which
happens only in the branch where _HandleIsValid returns true and
hWritePipe is closed).
I've found something interesting though. I've modeled a similar
situation on another machine:
1. I've run a native process via bash.
2. I've attached to bash via gdb and set a breakpoint on
ClosePseudoConsole().
3. I've killed the native process.
4. The breakpoint was hit, and I looked at hWritePipe value.
ProcessHacker shows it as "Unnamed file: \FileSystem\Npfs". Both bash
and conhost had a single handle with such name, and after I've forcibly
closed it in the bash process (while it was still suspended by gdb),
conhost.exe indeed died.
Then I looked at the original hanging tree and found that the hanging
bash.exe still has a single handle displayed as "Unnamed file:
\FileSystem\Npfs". I don't know how to check what kernel object it
refers to, but at least its access rights are the same as for hWritePipe
that I've seen on another machine, and its handle count is 1. So could
it be another copy of hWritePipe, e.g. due to some handle leak?
I don't know how to verify whether this suspicious handle in bash.exe is
paired with "Unnamed file: \FileSystem\Npfs" in conhost.exe, other than
by forcibly closing it. If I close it and conhost.exe dies, it will
confirm "the extra handle" theory, but will also prevent further
investigation with the hanging tree. Do you have any advice?
Thanks,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-10 12:13 ` Alexey Izbyshev
@ 2022-04-10 20:49 ` Alexey Izbyshev
2022-04-11 8:35 ` Takashi Yano
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-10 20:49 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-10 15:13, Alexey Izbyshev wrote:
> On 2022-04-10 10:34, Takashi Yano wrote:
>> On Sat, 09 Apr 2022 23:26:51 +0300
>> Thanks for investigating. In the normal case, conhost.exe is
>> terminated
>> when hWritePipe is closed.
>
> Thanks for confirming.
>
>>
>> Possibly, the hWritePipe has incorrect handle value.
>
> I've verified that the handle was correct by attaching via gdb to the
> hanging bash and checking that hWritePipe field is now zeroed (which
> happens only in the branch where _HandleIsValid returns true and
> hWritePipe is closed).
>
> I've found something interesting though. I've modeled a similar
> situation on another machine:
>
> 1. I've run a native process via bash.
> 2. I've attached to bash via gdb and set a breakpoint on
> ClosePseudoConsole().
> 3. I've killed the native process.
> 4. The breakpoint was hit, and I looked at hWritePipe value.
>
> ProcessHacker shows it as "Unnamed file: \FileSystem\Npfs". Both bash
> and conhost had a single handle with such name, and after I've
> forcibly closed it in the bash process (while it was still suspended
> by gdb), conhost.exe indeed died.
>
> Then I looked at the original hanging tree and found that the hanging
> bash.exe still has a single handle displayed as "Unnamed file:
> \FileSystem\Npfs". I don't know how to check what kernel object it
> refers to, but at least its access rights are the same as for
> hWritePipe that I've seen on another machine, and its handle count is
> 1. So could it be another copy of hWritePipe, e.g. due to some handle
> leak?
>
> I don't know how to verify whether this suspicious handle in bash.exe
> is paired with "Unnamed file: \FileSystem\Npfs" in conhost.exe, other
> than by forcibly closing it. If I close it and conhost.exe dies, it
> will confirm "the extra handle" theory, but will also prevent further
> investigation with the hanging tree. Do you have any advice?
>
I've found something that looked strange to me by checking handles in
the hanging process tree: the hanging conhost.exe and the hanging
bash.exe belong to different tests. Each test is a separate shell script
in a separate make recipe, so it looks like conhost.exe was created by
one test (which is still hanging at a later point in its script, trying
to run grep), but then bash.exe belonging to another test somehow got a
pseudoconsole referring to this conhost.exe and now hangs trying to
close it. So it looks that Cygwin migrated the pseudoconsole between
processes, and indeed fhandler_pty_slave::close_pseudoconsole() contains
something looking like migration logic. And this logic contains the
following call:
DuplicateHandle (GetCurrentProcess (),
ttyp->h_pcon_write_pipe,
new_owner, &new_write_pipe,
0, TRUE, DUPLICATE_SAME_ACCESS);
Is it safe to create an *inheritable* handle in another process here?
Could it be that the target process spawns a child at the wrong moment
(e.g. before it even knows about the newly created handle), and that
handle unintentionally leaks into the child, triggering the hang
afterwards?
A similarly suspicious code is also in
fhandler_pty_common::resize_pseudo_console():
DuplicateHandle (pcon_owner, get_ttyp ()->h_pcon_write_pipe,
GetCurrentProcess (), &hpcon_local.hWritePipe,
0, TRUE, DUPLICATE_SAME_ACCESS);
ResizePseudoConsole ((HPCON) &hpcon_local, size);
CloseHandle (pcon_owner);
CloseHandle (hpcon_local.hWritePipe);
If another thread spawns a child using
CreateProcess(bInheritHandles=TRUE) between DuplicateHandle() and
CloseHandle(hpcon_local.hWritePipe), the handle will leak into the
child.
Sorry if this is a false lead, I haven't tried to really understand the
pseudoconsole-related code yet.
Thanks,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-09 17:23 ` Alexey Izbyshev
2022-04-09 17:54 ` Takashi Yano
@ 2022-04-11 5:23 ` Jeremy Drake
2022-04-11 8:36 ` Takashi Yano
2022-04-11 15:28 ` Alexey Izbyshev
1 sibling, 2 replies; 32+ messages in thread
From: Jeremy Drake @ 2022-04-11 5:23 UTC (permalink / raw)
To: cygwin
On Sat, 9 Apr 2022, Alexey Izbyshev wrote:
> I don't have mintty because "make" is run via an SSH session. I suppose
> I should look into sshd in this case?
Sshd wouldn't happen to be running as a service, would it?
https://cygwin.com/pipermail/cygwin-patches/2022q2/011867.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-10 20:49 ` Alexey Izbyshev
@ 2022-04-11 8:35 ` Takashi Yano
2022-04-11 10:10 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-11 8:35 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Sun, 10 Apr 2022 23:49:29 +0300
Alexey Izbyshev wrote:
> On 2022-04-10 15:13, Alexey Izbyshev wrote:
> > On 2022-04-10 10:34, Takashi Yano wrote:
> >> On Sat, 09 Apr 2022 23:26:51 +0300
> >> Thanks for investigating. In the normal case, conhost.exe is
> >> terminated
> >> when hWritePipe is closed.
> >
> > Thanks for confirming.
> >
> >>
> >> Possibly, the hWritePipe has incorrect handle value.
> >
> > I've verified that the handle was correct by attaching via gdb to the
> > hanging bash and checking that hWritePipe field is now zeroed (which
> > happens only in the branch where _HandleIsValid returns true and
> > hWritePipe is closed).
> >
> > I've found something interesting though. I've modeled a similar
> > situation on another machine:
> >
> > 1. I've run a native process via bash.
> > 2. I've attached to bash via gdb and set a breakpoint on
> > ClosePseudoConsole().
> > 3. I've killed the native process.
> > 4. The breakpoint was hit, and I looked at hWritePipe value.
> >
> > ProcessHacker shows it as "Unnamed file: \FileSystem\Npfs". Both bash
> > and conhost had a single handle with such name, and after I've
> > forcibly closed it in the bash process (while it was still suspended
> > by gdb), conhost.exe indeed died.
> >
> > Then I looked at the original hanging tree and found that the hanging
> > bash.exe still has a single handle displayed as "Unnamed file:
> > \FileSystem\Npfs". I don't know how to check what kernel object it
> > refers to, but at least its access rights are the same as for
> > hWritePipe that I've seen on another machine, and its handle count is
> > 1. So could it be another copy of hWritePipe, e.g. due to some handle
> > leak?
> >
> > I don't know how to verify whether this suspicious handle in bash.exe
> > is paired with "Unnamed file: \FileSystem\Npfs" in conhost.exe, other
> > than by forcibly closing it. If I close it and conhost.exe dies, it
> > will confirm "the extra handle" theory, but will also prevent further
> > investigation with the hanging tree. Do you have any advice?
> >
> I've found something that looked strange to me by checking handles in
> the hanging process tree: the hanging conhost.exe and the hanging
> bash.exe belong to different tests. Each test is a separate shell script
> in a separate make recipe, so it looks like conhost.exe was created by
> one test (which is still hanging at a later point in its script, trying
> to run grep), but then bash.exe belonging to another test somehow got a
> pseudoconsole referring to this conhost.exe and now hangs trying to
> close it. So it looks that Cygwin migrated the pseudoconsole between
> processes, and indeed fhandler_pty_slave::close_pseudoconsole() contains
> something looking like migration logic. And this logic contains the
> following call:
>
> DuplicateHandle (GetCurrentProcess (),
> ttyp->h_pcon_write_pipe,
> new_owner, &new_write_pipe,
> 0, TRUE, DUPLICATE_SAME_ACCESS);
>
> Is it safe to create an *inheritable* handle in another process here?
> Could it be that the target process spawns a child at the wrong moment
> (e.g. before it even knows about the newly created handle), and that
> handle unintentionally leaks into the child, triggering the hang
> afterwards?
Thanks for finding that! As you pointed out, hWritePipe should not
be inheritable. That might be the cause.
A countermeasure version is available at the following location:
https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz
Could you please test? To keep the hanging tree, please install
cygwin another directory, and replace cygwin1.dll with the
countermeasure version.
If you want to setup another sshd, please use the command such as:
ssh-host-config --name cygsshd2 --port 2222
To remove sshd installed using above command:
cygrunsrv -E cygsshd2
cygrunsrv -R cygsshd2
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-11 5:23 ` Jeremy Drake
@ 2022-04-11 8:36 ` Takashi Yano
2022-04-11 15:28 ` Alexey Izbyshev
1 sibling, 0 replies; 32+ messages in thread
From: Takashi Yano @ 2022-04-11 8:36 UTC (permalink / raw)
To: cygwin; +Cc: Jeremy Drake
On Sun, 10 Apr 2022 22:23:06 -0700 (PDT)
Jeremy Drake wrote:
> On Sat, 9 Apr 2022, Alexey Izbyshev wrote:
>
> > I don't have mintty because "make" is run via an SSH session. I suppose
> > I should look into sshd in this case?
>
> Sshd wouldn't happen to be running as a service, would it?
>
> https://cygwin.com/pipermail/cygwin-patches/2022q2/011867.html
sshd itself is running as service, however, the user session
is not. So, this is another issue.
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-11 8:35 ` Takashi Yano
@ 2022-04-11 10:10 ` Alexey Izbyshev
2022-04-13 16:48 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-11 10:10 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-11 11:35, Takashi Yano wrote:
> On Sun, 10 Apr 2022 23:49:29 +0300
> Alexey Izbyshev wrote:
>>
>> Is it safe to create an *inheritable* handle in another process here?
>> Could it be that the target process spawns a child at the wrong moment
>> (e.g. before it even knows about the newly created handle), and that
>> handle unintentionally leaks into the child, triggering the hang
>> afterwards?
>
> Thanks for finding that! As you pointed out, hWritePipe should not
> be inheritable. That might be the cause.
>
> A countermeasure version is available at the following location:
> https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
> https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz
>
> Could you please test? To keep the hanging tree, please install
> cygwin another directory, and replace cygwin1.dll with the
> countermeasure version.
>
Thank you for providing the binaries! I've started testing in a separate
cygwin installation on the same machine, as you suggested. The hang
previously took many hours to reproduce, so I'll keep tests running for
a while and then report back.
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-08 17:04 ` Brian Inglis
@ 2022-04-11 13:27 ` Alexey Izbyshev
0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-11 13:27 UTC (permalink / raw)
To: cygwin
On 2022-04-08 20:04, Brian Inglis wrote:
> On 2022-04-08 02:42, Alexey Izbyshev wrote:
>> There is also an additional detail that I forgot to mention: in the
>> stack trace of all leaf processes as displayed by ProcessHacker, it
>> seems that the executable entry point is not reached yet. The only
>> non-Windows-DLL location is in cygwin1.dll, so I suspect that all
>> processes hang at early initialization in Cygwin's DLL entry point.
>
> That sounds like BLODA interference from AntiVirus programs:
>
> https://cygwin.com/faq/faq.html#faq.using.bloda
>
> and can also happen if you use Windows AD, and your users have a lot
> of rights, and a slow server, firewall filtering, or network link, but
> known issues were fixed a few releases ago.
>
It seems that a potential cause of the hang has been identified in
another discussion thread, and I'm now testing a patched version
provided by Takashi Yano.
Anyway, thanks for your time! And just in case the identified cause
turns out to be wrong, I'm answering your questions below.
We don't use any third-party AV products on this machine (it's an
internal box used only for CI), and we've disabled Real-Time Protection
in the Windows AV (it causes a terrible performance degradation,
something like 1.5-2 times).
> Any idea how much address space is used by Cygwin DLLs, and memory by
> all the processes running: run rebase -is to see if you could be out
> of address space for Cygwin and DLLs, and how much is left for
> processes?
>
> Do you have a decent amount of memory free on your system while
> running, and Windows paging space allocated to back it up - total
> twice memory, and do you have multiple drives to spread it across?
> Check your system memory and paging activity while those processes are
> running.
>
The peak memory consumption of our tests never exceeds 30% of RAM. Also,
Cygwin is used (almost) exclusively for the test harness (the actual
software under testing is native), and there are no heavyweight
processes in it, mostly just make, bash and some coreutils. So I don't
think we could hit address space issues even on 32-bit Cygwin.
> Could you try installing Cygwin64 packages and running those instead
> of Cygwin32 (recommended as Cygwin32 support will be dropped next
> release) as there is more address space available as well as usable
> memory for processes?
We test both 32-bit and 64-bit builds of our software, and a couple of
tests need to run (Cygwin) make under debugging. Because a 32-bit
process can't debug a 64-bit one, we simply use 32-bit Cygwin for both
cases. But if need to reproduce under 64-bit Cygwin arises, I can simply
exclude the problematic tests (they're unlikely to be relevant to the
hang).
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Re: Deadlock of the process tree when running make
2022-04-11 5:23 ` Jeremy Drake
2022-04-11 8:36 ` Takashi Yano
@ 2022-04-11 15:28 ` Alexey Izbyshev
2022-04-11 17:02 ` Jeremy Drake
1 sibling, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-11 15:28 UTC (permalink / raw)
To: cygwin; +Cc: cygwin
On 2022-04-11 08:23, Jeremy Drake wrote:
> On Sat, 9 Apr 2022, Alexey Izbyshev wrote:
>> I don't have mintty because "make" is run via an SSH session. I
>> suppose
>> I should look into sshd in this case?
> Sshd wouldn't happen to be running as a service, would it?
> https://cygwin.com/pipermail/cygwin-patches/2022q2/011867.html
(I've noticed your message in the mailing list archive, please add me to
CC on replying, I'm not subscribed)
Yes, sshd is running as a service, but I'm not sure that patch is
relevant. In my case, the problematic pipe that the hanging conhost.exe
is waiting on is probably created for that specific conhost.exe process
within the process tree rooted at "make", which runs as an ordinary
user. Also, wouldn't the hang be deterministic if the problem were in
the pipe ownership?
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Re: Deadlock of the process tree when running make
2022-04-11 15:28 ` Alexey Izbyshev
@ 2022-04-11 17:02 ` Jeremy Drake
0 siblings, 0 replies; 32+ messages in thread
From: Jeremy Drake @ 2022-04-11 17:02 UTC (permalink / raw)
To: Alexey Izbyshev; +Cc: cygwin
On Mon, 11 Apr 2022, Alexey Izbyshev wrote:
> Yes, sshd is running as a service, but I'm not sure that patch is relevant. In
> my case, the problematic pipe that the hanging conhost.exe is waiting on is
> probably created for that specific conhost.exe process within the process tree
> rooted at "make", which runs as an ordinary user. Also, wouldn't the hang be
> deterministic if the problem were in the pipe ownership?
Yes it would. I just noticed some of the evidence pointing that way - a
presumably native javac.exe, an anonymous "named pipe" handle, and then
when I saw sshd involved the last piece required for that scenario -
running as a service. But Takashi's reply sounds like sshd drops the
well-known service sid when it switches to the logged-on user's token
anyway.
This is both good and bad, I guess. Bad because your problem may not be
solved yet (though maybe with the latest test dll, fingers crossed!).
Good because there's a mystery hang that's been plaguing me when running
(under emulation) on Windows on ARM64 that the circumstances of that
environment has made virtually impossible to debug, and every commit that
mentions fixing a deadlock gives me new hope that that will be the fix
that makes it go away.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-11 10:10 ` Alexey Izbyshev
@ 2022-04-13 16:48 ` Alexey Izbyshev
2022-04-13 17:22 ` Takashi Yano
2022-04-13 23:17 ` Alexey Izbyshev
0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-13 16:48 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-11 13:10, Alexey Izbyshev wrote:
> On 2022-04-11 11:35, Takashi Yano wrote:
>> On Sun, 10 Apr 2022 23:49:29 +0300
>> A countermeasure version is available at the following location:
>> https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
>> https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz
>>
>> Could you please test? To keep the hanging tree, please install
>> cygwin another directory, and replace cygwin1.dll with the
>> countermeasure version.
>>
> Thank you for providing the binaries! I've started testing in a
> separate cygwin installation on the same machine, as you suggested.
> The hang previously took many hours to reproduce, so I'll keep tests
> running for a while and then report back.
>
The good news is that the tests have been running for two days so far
without any cygwin-related issues, so the patched version doesn't seem
to introduce new issues.
The bad news is my theory about the suspicious "Unnamed file:
\FileSystem\Npfs" in the hanging bash.exe being a leak seems to be
wrong. I've closed that handle, but conhost.exe hasn't unblocked. All of
its threads are doing the same things as before:
1. Tries to enter a critical section. (Task Manager claims it waits for
thread 4, so probably the latter owns it).
2. ReadFile("pty1-from-master-nat" named pipe)
3. Waits for an anonymous event.
4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
5. Blocked in GetMessageW().
I've created a model situation with bash.exe stopped at a breakpoint in
ClosePseudoConsole() at another machine again, and it seems that the
last time I missed that bash.exe contains *two* handles for (different)
"Unnamed file: \FileSystem\Npfs" here too, so it seems to be normal.
What's probably not normal is the behavior of the hanging conhost.exe.
I've compared the points where conhost.exe is blocked, and all but one
threads in the model case are doing the same things as in the hanging
case, but the remaining thread is blocked in
ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of
pcon) instead of trying to enter a critical section like thread 1 above.
So now I'm starting to doubt that it's a cygwin bug and not some
conhost.exe bug.
I'll try to poke around the hanging conhost.exe some more, and also may
be will try to create a faster reproducer.
Thanks for your help so far,
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-13 16:48 ` Alexey Izbyshev
@ 2022-04-13 17:22 ` Takashi Yano
2022-04-13 17:27 ` Alexey Izbyshev
2022-04-13 23:17 ` Alexey Izbyshev
1 sibling, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-13 17:22 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Wed, 13 Apr 2022 19:48:04 +0300
Alexey Izbyshev wrote:
> On 2022-04-11 13:10, Alexey Izbyshev wrote:
> > On 2022-04-11 11:35, Takashi Yano wrote:
> >> On Sun, 10 Apr 2022 23:49:29 +0300
> >> A countermeasure version is available at the following location:
> >> https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
> >> https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz
> >>
> >> Could you please test? To keep the hanging tree, please install
> >> cygwin another directory, and replace cygwin1.dll with the
> >> countermeasure version.
> >>
> > Thank you for providing the binaries! I've started testing in a
> > separate cygwin installation on the same machine, as you suggested.
> > The hang previously took many hours to reproduce, so I'll keep tests
> > running for a while and then report back.
> >
> The good news is that the tests have been running for two days so far
> without any cygwin-related issues, so the patched version doesn't seem
> to introduce new issues.
>
> The bad news is my theory about the suspicious "Unnamed file:
> \FileSystem\Npfs" in the hanging bash.exe being a leak seems to be
> wrong. I've closed that handle, but conhost.exe hasn't unblocked. All of
> its threads are doing the same things as before:
>
> 1. Tries to enter a critical section. (Task Manager claims it waits for
> thread 4, so probably the latter owns it).
> 2. ReadFile("pty1-from-master-nat" named pipe)
> 3. Waits for an anonymous event.
> 4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
> 5. Blocked in GetMessageW().
>
> I've created a model situation with bash.exe stopped at a breakpoint in
> ClosePseudoConsole() at another machine again, and it seems that the
> last time I missed that bash.exe contains *two* handles for (different)
> "Unnamed file: \FileSystem\Npfs" here too, so it seems to be normal.
>
> What's probably not normal is the behavior of the hanging conhost.exe.
> I've compared the points where conhost.exe is blocked, and all but one
> threads in the model case are doing the same things as in the hanging
> case, but the remaining thread is blocked in
> ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of
> pcon) instead of trying to enter a critical section like thread 1 above.
> So now I'm starting to doubt that it's a cygwin bug and not some
> conhost.exe bug.
>
> I'll try to poke around the hanging conhost.exe some more, and also may
> be will try to create a faster reproducer.
Thanks for testing.
Question is:
Is the issue reproduced using new cygwin1.dll? Or is it still
running without the issue so far?
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-13 17:22 ` Takashi Yano
@ 2022-04-13 17:27 ` Alexey Izbyshev
0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-13 17:27 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-13 20:22, Takashi Yano wrote:
> On Wed, 13 Apr 2022 19:48:04 +0300
> Alexey Izbyshev wrote:
>> On 2022-04-11 13:10, Alexey Izbyshev wrote:
>> The good news is that the tests have been running for two days so far
>> without any cygwin-related issues, so the patched version doesn't seem
>> to introduce new issues.
>>
> Thanks for testing.
>
> Question is:
> Is the issue reproduced using new cygwin1.dll? Or is it still
> running without the issue so far?
It's still running without any issues with the new DLL. The experiment
with closing a handle was done with the old hanging tree.
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-13 16:48 ` Alexey Izbyshev
2022-04-13 17:22 ` Takashi Yano
@ 2022-04-13 23:17 ` Alexey Izbyshev
2022-04-16 9:39 ` Takashi Yano
1 sibling, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-13 23:17 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-13 19:48, Alexey Izbyshev wrote:
> On 2022-04-11 13:10, Alexey Izbyshev wrote:
> What's probably not normal is the behavior of the hanging conhost.exe.
> I've compared the points where conhost.exe is blocked, and all but one
> threads in the model case are doing the same things as in the hanging
> case, but the remaining thread is blocked in
> ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of
> pcon) instead of trying to enter a critical section like thread 1
> above. So now I'm starting to doubt that it's a cygwin bug and not
> some conhost.exe bug.
>
> I'll try to poke around the hanging conhost.exe some more, and also
> may be will try to create a faster reproducer.
>
I've studied conhost.exe hang, and it indeed looks like it's buggy.
TLDR: https://github.com/microsoft/terminal/pull/12181
The full story:
I dumped conhost.exe, opened the dump in windbg and looked at the stack
trace of the hanging thread:
ntdll!NtWaitForAlertByThreadId+0x14
ntdll!RtlpWaitOnAddressWithTimeout+0x81
ntdll!RtlpWaitOnAddress+0xae
ntdll!RtlpWaitOnCriticalSection+0xfd
ntdll!RtlpEnterCriticalSectionContended+0x1c4
ntdll!RtlEnterCriticalSection+0x42
conhost!Microsoft::Console::Render::Renderer::_PaintFrameForEngine+0x54
conhost!Microsoft::Console::Render::Renderer::TriggerTeardown+0x19e60
conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit+0x21
conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x65
conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25
kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21
By looking at assembly, I've found that it hangs *after* ReadFile() on
the pipe completes, so the problem is definitely not a leak of
hWritePipe in bash.exe or elsewhere.
Using the function names, I've found this issue:
https://github.com/microsoft/terminal/issues/1810.
This is a different one, but the discussion and the patch shows that
synchronization on startup/shutdown is a disaster.
Then I looked at the code and identified that hang happens while
attempting to lock the console at [1]. After studying how this lock is
used in other parts of the code, I noticed that
PtySignalInputThread::_Shutdown() (which is further up in the call stack
of the hanging function) uses ProcessCtrlEvents() incorrectly, because
the latter unconditionally unlocks the console, but the lock is never
taken by this thread at this point. Then I looked at a more recent
version of the code and discovered the patch to _Shutdown() which I
referenced above.
I've also verified that assembly of _Shutdown() (which is inlined into
PtySignalInputThread::_GetData()) corresponds to the unpatched version
(i.e. without LockConsole() call):
call conhost!CloseConsoleProcessState (00007ff6`22e7013c)
call conhost!ProcessCtrlEvents (00007ff6`22e262a0)
mov ecx,6Dh
call
conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit
(00007ff6`22e3c730)
I'm not sure why this bug is not triggered more frequently, but one
possible reason, as indicated by comment [2], is that the bad path is
only taken if there are live clients after ClosePseudoConsole() is
called, which is probably rare.
A potential workaround on Cygwin side would be to ensure that the
pseudoconsole doesn't have clients before calling ClosePseudoConsole(),
but I don't know whether it's possible.
[1]
https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/renderer/base/renderer.cpp#L75
[2]
https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/PtySignalInputThread.cpp#L205
Alexey
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-13 23:17 ` Alexey Izbyshev
@ 2022-04-16 9:39 ` Takashi Yano
2022-04-16 13:21 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-16 9:39 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
On Thu, 14 Apr 2022 02:17:38 +0300
Alexey Izbyshev wrote:
> On 2022-04-13 19:48, Alexey Izbyshev wrote:
> > On 2022-04-11 13:10, Alexey Izbyshev wrote:
> > What's probably not normal is the behavior of the hanging conhost.exe.
> > I've compared the points where conhost.exe is blocked, and all but one
> > threads in the model case are doing the same things as in the hanging
> > case, but the remaining thread is blocked in
> > ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of
> > pcon) instead of trying to enter a critical section like thread 1
> > above. So now I'm starting to doubt that it's a cygwin bug and not
> > some conhost.exe bug.
> >
> > I'll try to poke around the hanging conhost.exe some more, and also
> > may be will try to create a faster reproducer.
> >
> I've studied conhost.exe hang, and it indeed looks like it's buggy.
>
> TLDR: https://github.com/microsoft/terminal/pull/12181
>
> The full story:
>
> I dumped conhost.exe, opened the dump in windbg and looked at the stack
> trace of the hanging thread:
>
> ntdll!NtWaitForAlertByThreadId+0x14
> ntdll!RtlpWaitOnAddressWithTimeout+0x81
> ntdll!RtlpWaitOnAddress+0xae
> ntdll!RtlpWaitOnCriticalSection+0xfd
> ntdll!RtlpEnterCriticalSectionContended+0x1c4
> ntdll!RtlEnterCriticalSection+0x42
> conhost!Microsoft::Console::Render::Renderer::_PaintFrameForEngine+0x54
> conhost!Microsoft::Console::Render::Renderer::TriggerTeardown+0x19e60
> conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit+0x21
> conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x65
> conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25
> kernel32!BaseThreadInitThunk+0x14
> ntdll!RtlUserThreadStart+0x21
>
> By looking at assembly, I've found that it hangs *after* ReadFile() on
> the pipe completes, so the problem is definitely not a leak of
> hWritePipe in bash.exe or elsewhere.
>
> Using the function names, I've found this issue:
> https://github.com/microsoft/terminal/issues/1810.
>
> This is a different one, but the discussion and the patch shows that
> synchronization on startup/shutdown is a disaster.
>
> Then I looked at the code and identified that hang happens while
> attempting to lock the console at [1]. After studying how this lock is
> used in other parts of the code, I noticed that
> PtySignalInputThread::_Shutdown() (which is further up in the call stack
> of the hanging function) uses ProcessCtrlEvents() incorrectly, because
> the latter unconditionally unlocks the console, but the lock is never
> taken by this thread at this point. Then I looked at a more recent
> version of the code and discovered the patch to _Shutdown() which I
> referenced above.
>
> I've also verified that assembly of _Shutdown() (which is inlined into
> PtySignalInputThread::_GetData()) corresponds to the unpatched version
> (i.e. without LockConsole() call):
>
> call conhost!CloseConsoleProcessState (00007ff6`22e7013c)
> call conhost!ProcessCtrlEvents (00007ff6`22e262a0)
> mov ecx,6Dh
> call
> conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit
> (00007ff6`22e3c730)
>
> I'm not sure why this bug is not triggered more frequently, but one
> possible reason, as indicated by comment [2], is that the bad path is
> only taken if there are live clients after ClosePseudoConsole() is
> called, which is probably rare.
>
> A potential workaround on Cygwin side would be to ensure that the
> pseudoconsole doesn't have clients before calling ClosePseudoConsole(),
> but I don't know whether it's possible.
I am not sure yet what is essential, but the current code closes
pseudo console only if there is no other process which is attaching
to the pseudo console. I wonder why javac.exe is remaining as
zombie. The parent bash.exe calls ColosePseudoConsole() when
child non-cygwin app is terminated, i.e., after WaitForSingleObject()
for child process handle returns.
https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009
What does the "zombie" mean? Is it listed in the process list of
ProcessHacker? I still suspect that the zombie javac.exe holds
the hWritePipe handle leaked from parent bash.exe.
> [1]
> https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/renderer/base/renderer.cpp#L75
>
> [2]
> https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/PtySignalInputThread.cpp#L205
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-16 9:39 ` Takashi Yano
@ 2022-04-16 13:21 ` Alexey Izbyshev
2022-04-27 11:22 ` Takashi Yano
0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-16 13:21 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
On 2022-04-16 12:39, Takashi Yano wrote:
> I am not sure yet what is essential, but the current code closes
> pseudo console only if there is no other process which is attaching
> to the pseudo console. I wonder why javac.exe is remaining as
> zombie. The parent bash.exe calls ColosePseudoConsole() when
> child non-cygwin app is terminated, i.e., after WaitForSingleObject()
> for child process handle returns.
> https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009
>
> What does the "zombie" mean? Is it listed in the process list of
> ProcessHacker? I still suspect that the zombie javac.exe holds
> the hWritePipe handle leaked from parent bash.exe.
>
By "zombie" I meant the same thing as in the Linux kernel: a data
structure that remains after a process terminated, but hasn't been
waited for yet (I don't know how this is implemented in Cygwin). So
there is no javac.exe process in ProcessHacker, but "ps" and similar
tools in Cygwin still list "javac".
I'm now trying to create a small reproducer that I can share, and I've
had a first small success this night: I could get a very similar hang
with a simple Makefile and a script with Cygwin 3.3.4. Here is the tree:
make(14479)-+-bash(14484)---bash(14611)
|-bash(14515)---bash(14618)
|-bash(14491)---bash(14500)---bash(14612)
|-bash(14501)---bash(14510)---bash(14605)
|-bash(14505)---bash(14607)
|-bash(14494)---bash(14617)
|-bash(14506)---bash(14513)---bash(14610)
|-bash(14512)---bash(14518)---bash(14615)
|-bash(14486)---bash(14495)---bash(14606)
|-bash(14483)---bash(14490)---bash(14609)
|-bash(14509)---bash(14614)
|-bash(14489)---bash(14608)
|-bash(14499)---bash(14613)
|-bash(14481)---bash(14485)---python(14588)
|-bash(14496)---bash(14504)---bash(14616)
`-bash(14482)---bash(14604)
"python" is a zombie, just as "javac" is in the original case. There is
also a single "conhost.exe" again, and all of its 5 threads are doing
the same things as in the original case (including the signal pipe
thread trying to EnterCriticalSection()). The only difference is that
leaf bash.exe are trying to acquire pcon mutex at a different point [1],
but I guess this difference is not important.
I'll try this reproducer with your patched DLL as well as on another
machine and share it in case of success.
Thanks,
Alexey
[1]
https://www.cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=cygwin-3_3_4-release#l697
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-16 13:21 ` Alexey Izbyshev
@ 2022-04-27 11:22 ` Takashi Yano
2022-04-27 12:19 ` Alexey Izbyshev
0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-27 11:22 UTC (permalink / raw)
To: cygwin; +Cc: Alexey Izbyshev
Hi Alexey,
On Sat, 16 Apr 2022 16:21:34 +0300
Alexey Izbyshev wrote:
> On 2022-04-16 12:39, Takashi Yano wrote:
> > I am not sure yet what is essential, but the current code closes
> > pseudo console only if there is no other process which is attaching
> > to the pseudo console. I wonder why javac.exe is remaining as
> > zombie. The parent bash.exe calls ColosePseudoConsole() when
> > child non-cygwin app is terminated, i.e., after WaitForSingleObject()
> > for child process handle returns.
> > https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009
> >
> > What does the "zombie" mean? Is it listed in the process list of
> > ProcessHacker? I still suspect that the zombie javac.exe holds
> > the hWritePipe handle leaked from parent bash.exe.
> >
> By "zombie" I meant the same thing as in the Linux kernel: a data
> structure that remains after a process terminated, but hasn't been
> waited for yet (I don't know how this is implemented in Cygwin). So
> there is no javac.exe process in ProcessHacker, but "ps" and similar
> tools in Cygwin still list "javac".
>
> I'm now trying to create a small reproducer that I can share, and I've
> had a first small success this night: I could get a very similar hang
> with a simple Makefile and a script with Cygwin 3.3.4. Here is the tree:
>
> make(14479)-+-bash(14484)---bash(14611)
> |-bash(14515)---bash(14618)
> |-bash(14491)---bash(14500)---bash(14612)
> |-bash(14501)---bash(14510)---bash(14605)
> |-bash(14505)---bash(14607)
> |-bash(14494)---bash(14617)
> |-bash(14506)---bash(14513)---bash(14610)
> |-bash(14512)---bash(14518)---bash(14615)
> |-bash(14486)---bash(14495)---bash(14606)
> |-bash(14483)---bash(14490)---bash(14609)
> |-bash(14509)---bash(14614)
> |-bash(14489)---bash(14608)
> |-bash(14499)---bash(14613)
> |-bash(14481)---bash(14485)---python(14588)
> |-bash(14496)---bash(14504)---bash(14616)
> `-bash(14482)---bash(14604)
>
>
> "python" is a zombie, just as "javac" is in the original case. There is
> also a single "conhost.exe" again, and all of its 5 threads are doing
> the same things as in the original case (including the signal pipe
> thread trying to EnterCriticalSection()). The only difference is that
> leaf bash.exe are trying to acquire pcon mutex at a different point [1],
> but I guess this difference is not important.
>
> I'll try this reproducer with your patched DLL as well as on another
> machine and share it in case of success.
>
> Thanks,
> Alexey
>
> [1]
> https://www.cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=cygwin-3_3_4-release#l697
Is there any progress on this?
--
Takashi Yano <takashi.yano@nifty.ne.jp>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Deadlock of the process tree when running make
2022-04-27 11:22 ` Takashi Yano
@ 2022-04-27 12:19 ` Alexey Izbyshev
0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-27 12:19 UTC (permalink / raw)
To: Takashi Yano; +Cc: cygwin
Hi, Takashi,
On 2022-04-27 14:22, Takashi Yano wrote:
> Hi Alexey,
>
> On Sat, 16 Apr 2022 16:21:34 +0300
> Alexey Izbyshev wrote:
>> On 2022-04-16 12:39, Takashi Yano wrote:
>> > I am not sure yet what is essential, but the current code closes
>> > pseudo console only if there is no other process which is attaching
>> > to the pseudo console. I wonder why javac.exe is remaining as
>> > zombie. The parent bash.exe calls ColosePseudoConsole() when
>> > child non-cygwin app is terminated, i.e., after WaitForSingleObject()
>> > for child process handle returns.
>> > https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009
>> >
>> > What does the "zombie" mean? Is it listed in the process list of
>> > ProcessHacker? I still suspect that the zombie javac.exe holds
>> > the hWritePipe handle leaked from parent bash.exe.
>> >
>> By "zombie" I meant the same thing as in the Linux kernel: a data
>> structure that remains after a process terminated, but hasn't been
>> waited for yet (I don't know how this is implemented in Cygwin). So
>> there is no javac.exe process in ProcessHacker, but "ps" and similar
>> tools in Cygwin still list "javac".
>>
>> I'm now trying to create a small reproducer that I can share, and I've
>> had a first small success this night: I could get a very similar hang
>> with a simple Makefile and a script with Cygwin 3.3.4. Here is the
>> tree:
>>
>> make(14479)-+-bash(14484)---bash(14611)
>> |-bash(14515)---bash(14618)
>> |-bash(14491)---bash(14500)---bash(14612)
>> |-bash(14501)---bash(14510)---bash(14605)
>> |-bash(14505)---bash(14607)
>> |-bash(14494)---bash(14617)
>> |-bash(14506)---bash(14513)---bash(14610)
>> |-bash(14512)---bash(14518)---bash(14615)
>> |-bash(14486)---bash(14495)---bash(14606)
>> |-bash(14483)---bash(14490)---bash(14609)
>> |-bash(14509)---bash(14614)
>> |-bash(14489)---bash(14608)
>> |-bash(14499)---bash(14613)
>> |-bash(14481)---bash(14485)---python(14588)
>> |-bash(14496)---bash(14504)---bash(14616)
>> `-bash(14482)---bash(14604)
>>
>>
>> "python" is a zombie, just as "javac" is in the original case. There
>> is
>> also a single "conhost.exe" again, and all of its 5 threads are doing
>> the same things as in the original case (including the signal pipe
>> thread trying to EnterCriticalSection()). The only difference is that
>> leaf bash.exe are trying to acquire pcon mutex at a different point
>> [1],
>> but I guess this difference is not important.
>>
>> I'll try this reproducer with your patched DLL as well as on another
>> machine and share it in case of success.
>>
>> Thanks,
>> Alexey
>>
>> [1]
>> https://www.cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=cygwin-3_3_4-release#l697
>
> Is there any progress on this?
During the last week I reproduced the hang on a vanilla 3.3.4 Cygwin
with a small test multiple times. In one case, the hanging state is even
minimal, i.e. there is only a bash.exe waiting in ClosePseudoConsole()
after its native child terminated and a conhost.exe, but no other
processes trying to acquire pcon mutex. Conhost.exe signal-pipe thread
is also blocked at the same EnterCriticalSection() call in all cases.
However, I couldn't reproduce the hang with your patched DLL[1] with the
same test running for multiple days. I can't explain how your change of
handle inheritability can affect the double-unlock bug in conhost.exe
that I referenced earlier, so either I'm missing something or I've been
very unlucky with reproducing. I was going to try to investigate
conhost.exe logic and state more (in particular, why one of its threads
still reads from "\Device\ConDrv" after all console clients detached)
and then reply to you, but I haven't been able to do it yet.
If you want to try to reproduce the hang yourself with 3.3.4, here is
one of small tests that I used (it looks strange because it's the result
of minimization of other code):
$ cat Makefile
T := $(shell echo {1..16})
all: $(T)
$(T):
@./test.sh $@
$ cat test.sh
#!/bin/bash
set -eu
(
for ((i = 0; i < 10; i++)); do
python -c ""
done
)
$ while make -j16; do echo $((i++)); done
The test can still take multiple hours to hang on my machine.
If I get any new interesting data, I'll share it.
Thank you,
Alexey
[1] https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220418.dll.xz
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2022-04-27 12:19 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-07 21:53 Deadlock of the process tree when running make Alexey Izbyshev
2022-04-07 23:54 ` Brian Inglis
2022-04-08 8:42 ` Alexey Izbyshev
2022-04-08 17:04 ` Brian Inglis
2022-04-11 13:27 ` Alexey Izbyshev
2022-04-09 10:17 ` Takashi Yano
2022-04-09 11:00 ` Alexey Izbyshev
2022-04-09 11:02 ` Alexey Izbyshev
2022-04-09 11:46 ` Takashi Yano
2022-04-09 16:07 ` Alexey Izbyshev
2022-04-09 16:57 ` Takashi Yano
2022-04-09 17:23 ` Alexey Izbyshev
2022-04-09 17:54 ` Takashi Yano
2022-04-09 19:35 ` Alexey Izbyshev
2022-04-09 20:26 ` Alexey Izbyshev
2022-04-10 7:34 ` Takashi Yano
2022-04-10 12:13 ` Alexey Izbyshev
2022-04-10 20:49 ` Alexey Izbyshev
2022-04-11 8:35 ` Takashi Yano
2022-04-11 10:10 ` Alexey Izbyshev
2022-04-13 16:48 ` Alexey Izbyshev
2022-04-13 17:22 ` Takashi Yano
2022-04-13 17:27 ` Alexey Izbyshev
2022-04-13 23:17 ` Alexey Izbyshev
2022-04-16 9:39 ` Takashi Yano
2022-04-16 13:21 ` Alexey Izbyshev
2022-04-27 11:22 ` Takashi Yano
2022-04-27 12:19 ` Alexey Izbyshev
2022-04-11 5:23 ` Jeremy Drake
2022-04-11 8:36 ` Takashi Yano
2022-04-11 15:28 ` Alexey Izbyshev
2022-04-11 17:02 ` Jeremy Drake
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).