public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Deadlock of the process tree when running make
@ 2022-04-07 21:53 Alexey Izbyshev
  2022-04-07 23:54 ` Brian Inglis
  2022-04-09 10:17 ` Takashi Yano
  0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-07 21:53 UTC (permalink / raw)
  To: cygwin

Hi,

I'm using 32-bit Cygwin 3.3.4 on 64-bit Windows 10 21H2. When running 
parallel make (for testing my project), very rarely I get the whole 
process tree hanging at some seemingly random point. An example of such 
a tree:

make-+-make-+-bash---find
      |      |-bash---find
      |      |-bash---find
      |      |-bash---find
      |      |-bash---find
      |      `-bash---javac
      `-make-+-bash---bash---bash---readlink
             `-bash---bash---bash-+-grep
                                  `-grep

(In the above tree, javac is the zombie parent of a native javac, and 
the latter doesn't exist at this point).

I got such hang two times while running make in a loop for several days. 
ProcessHacker shows that all leaf processes are single-threaded and are 
stuck on WaitForSingleObject().

I've skimmed git log of cygwin-3_3-branch after cygwin-3_3_4-release, 
but couldn't find anything that seems definitely related.

Has anybody seen something like this?

Is there any way I can get useful data for diagnosing this hang from the 
process tree that I currently have hanging (I'm going to keep it for 
now)? Otherwise, what would be the best strategy?

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-07 21:53 Deadlock of the process tree when running make Alexey Izbyshev
@ 2022-04-07 23:54 ` Brian Inglis
  2022-04-08  8:42   ` Alexey Izbyshev
  2022-04-09 10:17 ` Takashi Yano
  1 sibling, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2022-04-07 23:54 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On 2022-04-07 15:53, Alexey Izbyshev wrote:
> I'm using 32-bit Cygwin 3.3.4 on 64-bit Windows 10 21H2. When running 
> parallel make (for testing my project), very rarely I get the whole 
> process tree hanging at some seemingly random point. An example of such 
> a tree:
> 
> make-+-make-+-bash---find
>       |      |-bash---find
>       |      |-bash---find
>       |      |-bash---find
>       |      |-bash---find
>       |      `-bash---javac
>       `-make-+-bash---bash---bash---readlink
>              `-bash---bash---bash-+-grep
>                                   `-grep
> 
> (In the above tree, javac is the zombie parent of a native javac, and 
> the latter doesn't exist at this point).
> 
> I got such hang two times while running make in a loop for several days. 
> ProcessHacker shows that all leaf processes are single-threaded and are 
> stuck on WaitForSingleObject().
> 
> I've skimmed git log of cygwin-3_3-branch after cygwin-3_3_4-release, 
> but couldn't find anything that seems definitely related.
> 
> Has anybody seen something like this?
> 
> Is there any way I can get useful data for diagnosing this hang from the 
> process tree that I currently have hanging (I'm going to keep it for 
> now)? Otherwise, what would be the best strategy?

I've seen infinite loops with readlink in build scripts under Cygwin.
Seeing that readlink in a process tree makes me suspicious that 
something in a shell script is looping because two paths never match or 
always match under Cygwin.
Often there is one constant path and a varying path which is subjected 
to readlink in a loop.
Under Cygwin, you may have to pass the first path through readlink and 
compare that resulting path against the varying value.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-07 23:54 ` Brian Inglis
@ 2022-04-08  8:42   ` Alexey Izbyshev
  2022-04-08 17:04     ` Brian Inglis
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-08  8:42 UTC (permalink / raw)
  To: cygwin

On 2022-04-08 02:54, Brian Inglis wrote:
> I've seen infinite loops with readlink in build scripts under Cygwin.
> Seeing that readlink in a process tree makes me suspicious that
> something in a shell script is looping because two paths never match
> or always match under Cygwin.
> Often there is one constant path and a varying path which is subjected
> to readlink in a loop.
> Under Cygwin, you may have to pass the first path through readlink and
> compare that resulting path against the varying value.

Thanks, but I don't think I have such loops in this project. Also, other 
processes hang in independent make jobs, so a hang around readlink 
wouldn't explain that.

There is also an additional detail that I forgot to mention: in the 
stack trace of all leaf processes as displayed by ProcessHacker, it 
seems that the executable entry point is not reached yet. The only 
non-Windows-DLL location is in cygwin1.dll, so I suspect that all 
processes hang at early initialization in Cygwin's DLL entry point.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-08  8:42   ` Alexey Izbyshev
@ 2022-04-08 17:04     ` Brian Inglis
  2022-04-11 13:27       ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2022-04-08 17:04 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev


On 2022-04-08 02:42, Alexey Izbyshev wrote:
> On 2022-04-08 02:54, Brian Inglis wrote:
>> I've seen infinite loops with readlink in build scripts under Cygwin.
>> Seeing that readlink in a process tree makes me suspicious that
>> something in a shell script is looping because two paths never match
>> or always match under Cygwin.
>> Often there is one constant path and a varying path which is subjected
>> to readlink in a loop.
>> Under Cygwin, you may have to pass the first path through readlink and
>> compare that resulting path against the varying value.
> 
> Thanks, but I don't think I have such loops in this project. Also, other 
> processes hang in independent make jobs, so a hang around readlink 
> wouldn't explain that.
> 
> There is also an additional detail that I forgot to mention: in the 
> stack trace of all leaf processes as displayed by ProcessHacker, it 
> seems that the executable entry point is not reached yet. The only 
> non-Windows-DLL location is in cygwin1.dll, so I suspect that all 
> processes hang at early initialization in Cygwin's DLL entry point.

That sounds like BLODA interference from AntiVirus programs:

	https://cygwin.com/faq/faq.html#faq.using.bloda

and can also happen if you use Windows AD, and your users have a lot of 
rights, and a slow server, firewall filtering, or network link, but 
known issues were fixed a few releases ago.

Any idea how much address space is used by Cygwin DLLs, and memory by 
all the processes running: run rebase -is to see if you could be out of 
address space for Cygwin and DLLs, and how much is left for processes?

Do you have a decent amount of memory free on your system while running, 
and Windows paging space allocated to back it up - total twice memory, 
and do you have multiple drives to spread it across?
Check your system memory and paging activity while those processes are 
running.

Could you try installing Cygwin64 packages and running those instead of 
Cygwin32 (recommended as Cygwin32 support will be dropped next release) 
as there is more address space available as well as usable memory for 
processes?

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-07 21:53 Deadlock of the process tree when running make Alexey Izbyshev
  2022-04-07 23:54 ` Brian Inglis
@ 2022-04-09 10:17 ` Takashi Yano
  2022-04-09 11:00   ` Alexey Izbyshev
  1 sibling, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 10:17 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Fri, 08 Apr 2022 00:53:31 +0300
Alexey Izbyshev wrote:
> Hi,
> 
> I'm using 32-bit Cygwin 3.3.4 on 64-bit Windows 10 21H2. When running 
> parallel make (for testing my project), very rarely I get the whole 
> process tree hanging at some seemingly random point. An example of such 
> a tree:
> 
> make-+-make-+-bash---find
>       |      |-bash---find
>       |      |-bash---find
>       |      |-bash---find
>       |      |-bash---find
>       |      `-bash---javac
>       `-make-+-bash---bash---bash---readlink
>              `-bash---bash---bash-+-grep
>                                   `-grep
> 
> (In the above tree, javac is the zombie parent of a native javac, and 
> the latter doesn't exist at this point).
> 
> I got such hang two times while running make in a loop for several days. 
> ProcessHacker shows that all leaf processes are single-threaded and are 
> stuck on WaitForSingleObject().
> 
> I've skimmed git log of cygwin-3_3-branch after cygwin-3_3_4-release, 
> but couldn't find anything that seems definitely related.
> 
> Has anybody seen something like this?
> 
> Is there any way I can get useful data for diagnosing this hang from the 
> process tree that I currently have hanging (I'm going to keep it for 
> now)? Otherwise, what would be the best strategy?

Attaching gdb to the hanging process and dumping stack by 'bt'
command for each thread may diagnose more detail.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 10:17 ` Takashi Yano
@ 2022-04-09 11:00   ` Alexey Izbyshev
  2022-04-09 11:02     ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 11:00 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-09 13:17, Takashi Yano wrote:

> Attaching gdb to the hanging process and dumping stack by 'bt'
> command for each thread may diagnose more detail.

I decided to simply look at assembly at the point shown in ProcessHacker 
stack trace (cygwin1.dll!feinitialise+0x5ecab) to avoid disturbing the 
process by gdb. And it's clear that the hang is in 
fhandler_pty_slave::reset_switch_to_pcon() at [1]. I've checked that 
there were some changes in that function since 3.3.4. Could they fix 
this deadlock?

[1] 
https://cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 11:00   ` Alexey Izbyshev
@ 2022-04-09 11:02     ` Alexey Izbyshev
  2022-04-09 11:46       ` Takashi Yano
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 11:02 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-09 14:00, Alexey Izbyshev wrote:
> On 2022-04-09 13:17, Takashi Yano wrote:
> 
>> Attaching gdb to the hanging process and dumping stack by 'bt'
>> command for each thread may diagnose more detail.
> 
> I decided to simply look at assembly at the point shown in
> ProcessHacker stack trace (cygwin1.dll!feinitialise+0x5ecab) to avoid
> disturbing the process by gdb. And it's clear that the hang is in
> fhandler_pty_slave::reset_switch_to_pcon() at [1]. I've checked that
> there were some changes in that function since 3.3.4. Could they fix
> this deadlock?
> 
> [1] 
> https://cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release
> 

Missed the line in the link above: 
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199

> Thanks,
> Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 11:02     ` Alexey Izbyshev
@ 2022-04-09 11:46       ` Takashi Yano
  2022-04-09 16:07         ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 11:46 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Sat, 09 Apr 2022 14:02:38 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 14:00, Alexey Izbyshev wrote:
> > On 2022-04-09 13:17, Takashi Yano wrote:
> > 
> >> Attaching gdb to the hanging process and dumping stack by 'bt'
> >> command for each thread may diagnose more detail.
> > 
> > I decided to simply look at assembly at the point shown in
> > ProcessHacker stack trace (cygwin1.dll!feinitialise+0x5ecab) to avoid
> > disturbing the process by gdb. And it's clear that the hang is in
> > fhandler_pty_slave::reset_switch_to_pcon() at [1]. I've checked that
> > there were some changes in that function since 3.3.4. Could they fix
> > this deadlock?
> > 
> > [1] 
> > https://cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release
> > 
> 
> Missed the line in the link above: 
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199

Thanks for finding that. It would be very helpfull if you could
find another process which holds pcon_mutex and where it is stopping.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 11:46       ` Takashi Yano
@ 2022-04-09 16:07         ` Alexey Izbyshev
  2022-04-09 16:57           ` Takashi Yano
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 16:07 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-09 14:46, Takashi Yano wrote:
> On Sat, 09 Apr 2022 14:02:38 +0300
> Alexey Izbyshev wrote:
>> 
>> Missed the line in the link above:
>> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199
> 
> Thanks for finding that. It would be very helpfull if you could
> find another process which holds pcon_mutex and where it is stopping.

ProcessHacker showed that the owner of the pcon mutex is bash.exe with 
(Windows) PID 6276. However, Cygwin ps doesn't list such a process. Its 
parent, however, has a Cygwin PID 37961 and is in the hanging tree:

make(32651)-+-make(32656)-+-bash(37296)---find(38057)
             |             |-bash(37632)---find(38061)
             |             |-bash(37415)---find(38064)
             |             |-bash(37852)---find(38062)
             |             |-bash(37896)---find(38063)
             |             `-bash(37961)---javac(38032)
             
`-make(32657)-+-bash(38025)---bash(38054)---bash(38055)---readlink(38056)
                           
`-bash(37722)---bash(37825)---bash(38058)-+-grep(38060)
                                                                     
`-grep(38059)

Since javac(38032) is a zombie, my guess is that missing bash.exe (win 
6276) is an intermediate process that Cygwin created when bash(37961) 
forked to run javac.

bash.exe (win 6276) has two threads. The first one is blocked at 
ClosePseudoConsole() (which according to stack trace eventually calls 
NtWaitForSingleObject()) [1] and the second one is at [2].

[1] 
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l3615

[2] 
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/sigproc.cc;h=02d875a7fc947d628ca933690ed43ef03d767d53;hb=cygwin-3_3_4-release#l1359

Hope this is helpful,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 16:07         ` Alexey Izbyshev
@ 2022-04-09 16:57           ` Takashi Yano
  2022-04-09 17:23             ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 16:57 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Sat, 09 Apr 2022 19:07:08 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 14:46, Takashi Yano wrote:
> > On Sat, 09 Apr 2022 14:02:38 +0300
> > Alexey Izbyshev wrote:
> >> 
> >> Missed the line in the link above:
> >> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l1199
> > 
> > Thanks for finding that. It would be very helpfull if you could
> > find another process which holds pcon_mutex and where it is stopping.
> 
> ProcessHacker showed that the owner of the pcon mutex is bash.exe with 
> (Windows) PID 6276. However, Cygwin ps doesn't list such a process. Its 
> parent, however, has a Cygwin PID 37961 and is in the hanging tree:
> 
> make(32651)-+-make(32656)-+-bash(37296)---find(38057)
>              |             |-bash(37632)---find(38061)
>              |             |-bash(37415)---find(38064)
>              |             |-bash(37852)---find(38062)
>              |             |-bash(37896)---find(38063)
>              |             `-bash(37961)---javac(38032)
>              
> `-make(32657)-+-bash(38025)---bash(38054)---bash(38055)---readlink(38056)
>                            
> `-bash(37722)---bash(37825)---bash(38058)-+-grep(38060)
>                                                                      
> `-grep(38059)
> 
> Since javac(38032) is a zombie, my guess is that missing bash.exe (win 
> 6276) is an intermediate process that Cygwin created when bash(37961) 
> forked to run javac.
> 
> bash.exe (win 6276) has two threads. The first one is blocked at 
> ClosePseudoConsole() (which according to stack trace eventually calls 
> NtWaitForSingleObject()) [1] and the second one is at [2].
> 
> [1] 
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l3615
> 
> [2] 
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/sigproc.cc;h=02d875a7fc947d628ca933690ed43ef03d767d53;hb=cygwin-3_3_4-release#l1359
> 
> Hope this is helpful,

Thank you very much for the information. Can you check if
the thread pty_master_fwd_thread() in root mintty is still
alive?

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 16:57           ` Takashi Yano
@ 2022-04-09 17:23             ` Alexey Izbyshev
  2022-04-09 17:54               ` Takashi Yano
  2022-04-11  5:23               ` Jeremy Drake
  0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 17:23 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-09 19:57, Takashi Yano wrote:
> Thank you very much for the information. Can you check if
> the thread pty_master_fwd_thread() in root mintty is still
> alive?

I don't have mintty because "make" is run via an SSH session. I suppose 
I should look into sshd in this case? I've checked an sshd process that 
is the parent of this session, and yes, one of its threads is blocked at 
https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l2710.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 17:23             ` Alexey Izbyshev
@ 2022-04-09 17:54               ` Takashi Yano
  2022-04-09 19:35                 ` Alexey Izbyshev
  2022-04-11  5:23               ` Jeremy Drake
  1 sibling, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-09 17:54 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Sat, 09 Apr 2022 20:23:06 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 19:57, Takashi Yano wrote:
> > Thank you very much for the information. Can you check if
> > the thread pty_master_fwd_thread() in root mintty is still
> > alive?
> 
> I don't have mintty because "make" is run via an SSH session. I suppose 
> I should look into sshd in this case? I've checked an sshd process that 
> is the parent of this session, and yes, one of its threads is blocked at 
> https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_tty.cc;h=7bef6958c106c5e78cc90e014081022fd3a205bc;hb=cygwin-3_3_4-release#l2710.

Thanks for checking. This seems to be normal. Then, I cannot
understand why the ClosePseudoConsole() call is blocked...

The document by Microsoft mentions the blocking conditions of
ClosePseudoConsole():
https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
however, the thread above is draining the channel.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 17:54               ` Takashi Yano
@ 2022-04-09 19:35                 ` Alexey Izbyshev
  2022-04-09 20:26                   ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 19:35 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-09 20:54, Takashi Yano wrote:
> Thanks for checking. This seems to be normal. Then, I cannot
> understand why the ClosePseudoConsole() call is blocked...
> 
> The document by Microsoft mentions the blocking conditions of
> ClosePseudoConsole():
> https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
> however, the thread above is draining the channel.

I've decided to check what object ClosePseudoConsole() waits for. The 
wait happens inside unexported KERNELBASE!_ClosePseudoConsoleMembers 
function. Here is the relevant part:

76589fb5 8b4e08          mov     ecx,dword ptr [esi+8]
76589fb8 e8c2fdffff      call    KERNELBASE!_HandleIsValid (76589d7f)
76589fbd 84c0            test    al,al
76589fbf 7456            je      
KERNELBASE!_ClosePseudoConsoleMembers+0x89 (7658a017)
76589fc1 8d45fc          lea     eax,[ebp-4]
76589fc4 895dfc          mov     dword ptr [ebp-4],ebx
76589fc7 50              push    eax
76589fc8 51              push    ecx
76589fc9 e8c23ef5ff      call    KERNELBASE!GetExitCodeProcess 
(764dde90)
76589fce 85c0            test    eax,eax
76589fd0 7414            je      
KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
76589fd2 817dfc03010000  cmp     dword ptr [ebp-4],103h
76589fd9 750b            jne     
KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
76589fdb 53              push    ebx
76589fdc 6aff            push    0FFFFFFFFh
76589fde ff7608          push    dword ptr [esi+8]
76589fe1 e8ba74f6ff      call    KERNELBASE!WaitForSingleObjectEx 
(764f14a0)

"esi" is the argument of ClosePseudoConsole(), so the first mov 
dereferences it with an offset and loads a process handle. Then, if this 
handle is valid, it calls GetExitCodeProcess(), and if it succeeds and 
returns STILL_ACTIVE, it waits for that process.

I've checked that hanging bash process has only 3 process handles: for 
itself, for dead javac, and for conhost.exe. So obviously it waits for 
the latter to terminate. (After I did all this, I realized there was 
much easier way to get this result via "Analyze wait chain" feature of 
Task Manager).

Unfortunately, I don't know anything about Windows consoles, but just in 
case I also checked what 5 threads of conhost.exe are waiting for:

1. Tries to enter a critical section (Task Manager claims it waits for 
thread 4, so probably the latter owns it).
2. Waits on a handle for "pty1-from-master-nat" named pipe.
3. Waits for an anonymous event.
4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
5. Blocked in GetMessageW().

It's also worth of note that this conhost.exe seems to be the only one 
related to the Cygwin process tree (as well as the only related 
non-Cygwin process). All other conhost.exe processes were created before 
I started my stress test.

My guess is that this conhost.exe was created for a native app started 
from a Cygwin process. Could it be some race condition/bug that 
prevented conhost.exe from terminating once the native process (probably 
javac?) died?

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 19:35                 ` Alexey Izbyshev
@ 2022-04-09 20:26                   ` Alexey Izbyshev
  2022-04-10  7:34                     ` Takashi Yano
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-09 20:26 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-09 22:35, Alexey Izbyshev wrote:
> On 2022-04-09 20:54, Takashi Yano wrote:
>> Thanks for checking. This seems to be normal. Then, I cannot
>> understand why the ClosePseudoConsole() call is blocked...
>> 
>> The document by Microsoft mentions the blocking conditions of
>> ClosePseudoConsole():
>> https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
>> however, the thread above is draining the channel.
> 
> I've decided to check what object ClosePseudoConsole() waits for. The
> wait happens inside unexported KERNELBASE!_ClosePseudoConsoleMembers
> function. Here is the relevant part:
> 
> 76589fb5 8b4e08          mov     ecx,dword ptr [esi+8]
> 76589fb8 e8c2fdffff      call    KERNELBASE!_HandleIsValid (76589d7f)
> 76589fbd 84c0            test    al,al
> 76589fbf 7456            je
> KERNELBASE!_ClosePseudoConsoleMembers+0x89 (7658a017)
> 76589fc1 8d45fc          lea     eax,[ebp-4]
> 76589fc4 895dfc          mov     dword ptr [ebp-4],ebx
> 76589fc7 50              push    eax
> 76589fc8 51              push    ecx
> 76589fc9 e8c23ef5ff      call    KERNELBASE!GetExitCodeProcess 
> (764dde90)
> 76589fce 85c0            test    eax,eax
> 76589fd0 7414            je
> KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> 76589fd2 817dfc03010000  cmp     dword ptr [ebp-4],103h
> 76589fd9 750b            jne
> KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> 76589fdb 53              push    ebx
> 76589fdc 6aff            push    0FFFFFFFFh
> 76589fde ff7608          push    dword ptr [esi+8]
> 76589fe1 e8ba74f6ff      call    KERNELBASE!WaitForSingleObjectEx 
> (764f14a0)
> 
> "esi" is the argument of ClosePseudoConsole(), so the first mov
> dereferences it with an offset and loads a process handle. Then, if
> this handle is valid, it calls GetExitCodeProcess(), and if it
> succeeds and returns STILL_ACTIVE, it waits for that process.
> 
> I've checked that hanging bash process has only 3 process handles: for
> itself, for dead javac, and for conhost.exe. So obviously it waits for
> the latter to terminate. (After I did all this, I realized there was
> much easier way to get this result via "Analyze wait chain" feature of
> Task Manager).
> 
> Unfortunately, I don't know anything about Windows consoles, but just
> in case I also checked what 5 threads of conhost.exe are waiting for:
> 
> 1. Tries to enter a critical section (Task Manager claims it waits for
> thread 4, so probably the latter owns it).
> 2. Waits on a handle for "pty1-from-master-nat" named pipe.
> 3. Waits for an anonymous event.
> 4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
> 5. Blocked in GetMessageW().
> 
> It's also worth of note that this conhost.exe seems to be the only one
> related to the Cygwin process tree (as well as the only related
> non-Cygwin process). All other conhost.exe processes were created
> before I started my stress test.
> 
> My guess is that this conhost.exe was created for a native app started
> from a Cygwin process. Could it be some race condition/bug that
> prevented conhost.exe from terminating once the native process
> (probably javac?) died?
> 
A few more things that might be important:

* Clarification: thread 2 of conhost.exe waits in KernelBase!ReadFile().

* In the assembly part I omitted, before waiting on the conhost process, 
_ClosePseudoConsoleMembers() closes the handle obtained from "dword ptr 
[esi]", i.e. "hWritePipe" member of HPCON_INTERNAL struct.

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 20:26                   ` Alexey Izbyshev
@ 2022-04-10  7:34                     ` Takashi Yano
  2022-04-10 12:13                       ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-10  7:34 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Sat, 09 Apr 2022 23:26:51 +0300
Alexey Izbyshev wrote:
> On 2022-04-09 22:35, Alexey Izbyshev wrote:
> > On 2022-04-09 20:54, Takashi Yano wrote:
> >> Thanks for checking. This seems to be normal. Then, I cannot
> >> understand why the ClosePseudoConsole() call is blocked...
> >> 
> >> The document by Microsoft mentions the blocking conditions of
> >> ClosePseudoConsole():
> >> https://docs.microsoft.com/en-us/windows/console/closepseudoconsole
> >> however, the thread above is draining the channel.
> > 
> > I've decided to check what object ClosePseudoConsole() waits for. The
> > wait happens inside unexported KERNELBASE!_ClosePseudoConsoleMembers
> > function. Here is the relevant part:
> > 
> > 76589fb5 8b4e08          mov     ecx,dword ptr [esi+8]
> > 76589fb8 e8c2fdffff      call    KERNELBASE!_HandleIsValid (76589d7f)
> > 76589fbd 84c0            test    al,al
> > 76589fbf 7456            je
> > KERNELBASE!_ClosePseudoConsoleMembers+0x89 (7658a017)
> > 76589fc1 8d45fc          lea     eax,[ebp-4]
> > 76589fc4 895dfc          mov     dword ptr [ebp-4],ebx
> > 76589fc7 50              push    eax
> > 76589fc8 51              push    ecx
> > 76589fc9 e8c23ef5ff      call    KERNELBASE!GetExitCodeProcess 
> > (764dde90)
> > 76589fce 85c0            test    eax,eax
> > 76589fd0 7414            je
> > KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> > 76589fd2 817dfc03010000  cmp     dword ptr [ebp-4],103h
> > 76589fd9 750b            jne
> > KERNELBASE!_ClosePseudoConsoleMembers+0x58 (76589fe6)
> > 76589fdb 53              push    ebx
> > 76589fdc 6aff            push    0FFFFFFFFh
> > 76589fde ff7608          push    dword ptr [esi+8]
> > 76589fe1 e8ba74f6ff      call    KERNELBASE!WaitForSingleObjectEx 
> > (764f14a0)
> > 
> > "esi" is the argument of ClosePseudoConsole(), so the first mov
> > dereferences it with an offset and loads a process handle. Then, if
> > this handle is valid, it calls GetExitCodeProcess(), and if it
> > succeeds and returns STILL_ACTIVE, it waits for that process.
> > 
> > I've checked that hanging bash process has only 3 process handles: for
> > itself, for dead javac, and for conhost.exe. So obviously it waits for
> > the latter to terminate. (After I did all this, I realized there was
> > much easier way to get this result via "Analyze wait chain" feature of
> > Task Manager).
> > 
> > Unfortunately, I don't know anything about Windows consoles, but just
> > in case I also checked what 5 threads of conhost.exe are waiting for:
> > 
> > 1. Tries to enter a critical section (Task Manager claims it waits for
> > thread 4, so probably the latter owns it).
> > 2. Waits on a handle for "pty1-from-master-nat" named pipe.
> > 3. Waits for an anonymous event.
> > 4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
> > 5. Blocked in GetMessageW().
> > 
> > It's also worth of note that this conhost.exe seems to be the only one
> > related to the Cygwin process tree (as well as the only related
> > non-Cygwin process). All other conhost.exe processes were created
> > before I started my stress test.
> > 
> > My guess is that this conhost.exe was created for a native app started
> > from a Cygwin process. Could it be some race condition/bug that
> > prevented conhost.exe from terminating once the native process
> > (probably javac?) died?
> > 
> A few more things that might be important:
> 
> * Clarification: thread 2 of conhost.exe waits in KernelBase!ReadFile().
> 
> * In the assembly part I omitted, before waiting on the conhost process, 
> _ClosePseudoConsoleMembers() closes the handle obtained from "dword ptr 
> [esi]", i.e. "hWritePipe" member of HPCON_INTERNAL struct.

Thanks for investigating. In the normal case, conhost.exe is terminated
when hWritePipe is closed.

Possibly, the hWritePipe has incorrect handle value.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-10  7:34                     ` Takashi Yano
@ 2022-04-10 12:13                       ` Alexey Izbyshev
  2022-04-10 20:49                         ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-10 12:13 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-10 10:34, Takashi Yano wrote:
> On Sat, 09 Apr 2022 23:26:51 +0300
> Thanks for investigating. In the normal case, conhost.exe is terminated
> when hWritePipe is closed.

Thanks for confirming.

> 
> Possibly, the hWritePipe has incorrect handle value.

I've verified that the handle was correct by attaching via gdb to the 
hanging bash and checking that hWritePipe field is now zeroed (which 
happens only in the branch where _HandleIsValid returns true and 
hWritePipe is closed).

I've found something interesting though. I've modeled a similar 
situation on another machine:

1. I've run a native process via bash.
2. I've attached to bash via gdb and set a breakpoint on 
ClosePseudoConsole().
3. I've killed the native process.
4. The breakpoint was hit, and I looked at hWritePipe value.

ProcessHacker shows it as "Unnamed file: \FileSystem\Npfs". Both bash 
and conhost had a single handle with such name, and after I've forcibly 
closed it in the bash process (while it was still suspended by gdb), 
conhost.exe indeed died.

Then I looked at the original hanging tree and found that the hanging 
bash.exe still has a single handle displayed as "Unnamed file: 
\FileSystem\Npfs". I don't know how to check what kernel object it 
refers to, but at least its access rights are the same as for hWritePipe 
that I've seen on another machine, and its handle count is 1. So could 
it be another copy of hWritePipe, e.g. due to some handle leak?

I don't know how to verify whether this suspicious handle in bash.exe is 
paired with "Unnamed file: \FileSystem\Npfs" in conhost.exe, other than 
by forcibly closing it. If I close it and conhost.exe dies, it will 
confirm "the extra handle" theory, but will also prevent further 
investigation with the hanging tree. Do you have any advice?

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-10 12:13                       ` Alexey Izbyshev
@ 2022-04-10 20:49                         ` Alexey Izbyshev
  2022-04-11  8:35                           ` Takashi Yano
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-10 20:49 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-10 15:13, Alexey Izbyshev wrote:
> On 2022-04-10 10:34, Takashi Yano wrote:
>> On Sat, 09 Apr 2022 23:26:51 +0300
>> Thanks for investigating. In the normal case, conhost.exe is 
>> terminated
>> when hWritePipe is closed.
> 
> Thanks for confirming.
> 
>> 
>> Possibly, the hWritePipe has incorrect handle value.
> 
> I've verified that the handle was correct by attaching via gdb to the
> hanging bash and checking that hWritePipe field is now zeroed (which
> happens only in the branch where _HandleIsValid returns true and
> hWritePipe is closed).
> 
> I've found something interesting though. I've modeled a similar
> situation on another machine:
> 
> 1. I've run a native process via bash.
> 2. I've attached to bash via gdb and set a breakpoint on 
> ClosePseudoConsole().
> 3. I've killed the native process.
> 4. The breakpoint was hit, and I looked at hWritePipe value.
> 
> ProcessHacker shows it as "Unnamed file: \FileSystem\Npfs". Both bash
> and conhost had a single handle with such name, and after I've
> forcibly closed it in the bash process (while it was still suspended
> by gdb), conhost.exe indeed died.
> 
> Then I looked at the original hanging tree and found that the hanging
> bash.exe still has a single handle displayed as "Unnamed file:
> \FileSystem\Npfs". I don't know how to check what kernel object it
> refers to, but at least its access rights are the same as for
> hWritePipe that I've seen on another machine, and its handle count is
> 1. So could it be another copy of hWritePipe, e.g. due to some handle
> leak?
> 
> I don't know how to verify whether this suspicious handle in bash.exe
> is paired with "Unnamed file: \FileSystem\Npfs" in conhost.exe, other
> than by forcibly closing it. If I close it and conhost.exe dies, it
> will confirm "the extra handle" theory, but will also prevent further
> investigation with the hanging tree. Do you have any advice?
> 
I've found something that looked strange to me by checking handles in 
the hanging process tree: the hanging conhost.exe and the hanging 
bash.exe belong to different tests. Each test is a separate shell script 
in a separate make recipe, so it looks like conhost.exe was created by 
one test (which is still hanging at a later point in its script, trying 
to run grep), but then bash.exe belonging to another test somehow got a 
pseudoconsole referring to this conhost.exe and now hangs trying to 
close it. So it looks that Cygwin migrated the pseudoconsole between 
processes, and indeed fhandler_pty_slave::close_pseudoconsole() contains 
something looking like migration logic. And this logic contains the 
following call:

DuplicateHandle (GetCurrentProcess (),
                  ttyp->h_pcon_write_pipe,
                  new_owner, &new_write_pipe,
                  0, TRUE, DUPLICATE_SAME_ACCESS);

Is it safe to create an *inheritable* handle in another process here? 
Could it be that the target process spawns a child at the wrong moment 
(e.g. before it even knows about the newly created handle), and that 
handle unintentionally leaks into the child, triggering the hang 
afterwards?

A similarly suspicious code is also in 
fhandler_pty_common::resize_pseudo_console():

   DuplicateHandle (pcon_owner, get_ttyp ()->h_pcon_write_pipe,
                    GetCurrentProcess (), &hpcon_local.hWritePipe,
                    0, TRUE, DUPLICATE_SAME_ACCESS);
   ResizePseudoConsole ((HPCON) &hpcon_local, size);
   CloseHandle (pcon_owner);
   CloseHandle (hpcon_local.hWritePipe);

If another thread spawns a child using 
CreateProcess(bInheritHandles=TRUE) between DuplicateHandle() and 
CloseHandle(hpcon_local.hWritePipe), the handle will leak into the 
child.

Sorry if this is a false lead, I haven't tried to really understand the 
pseudoconsole-related code yet.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-09 17:23             ` Alexey Izbyshev
  2022-04-09 17:54               ` Takashi Yano
@ 2022-04-11  5:23               ` Jeremy Drake
  2022-04-11  8:36                 ` Takashi Yano
  2022-04-11 15:28                 ` Alexey Izbyshev
  1 sibling, 2 replies; 32+ messages in thread
From: Jeremy Drake @ 2022-04-11  5:23 UTC (permalink / raw)
  To: cygwin

On Sat, 9 Apr 2022, Alexey Izbyshev wrote:

> I don't have mintty because "make" is run via an SSH session. I suppose
> I should look into sshd in this case?

Sshd wouldn't happen to be running as a service, would it?

https://cygwin.com/pipermail/cygwin-patches/2022q2/011867.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-10 20:49                         ` Alexey Izbyshev
@ 2022-04-11  8:35                           ` Takashi Yano
  2022-04-11 10:10                             ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-11  8:35 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Sun, 10 Apr 2022 23:49:29 +0300
Alexey Izbyshev wrote:
> On 2022-04-10 15:13, Alexey Izbyshev wrote:
> > On 2022-04-10 10:34, Takashi Yano wrote:
> >> On Sat, 09 Apr 2022 23:26:51 +0300
> >> Thanks for investigating. In the normal case, conhost.exe is 
> >> terminated
> >> when hWritePipe is closed.
> > 
> > Thanks for confirming.
> > 
> >> 
> >> Possibly, the hWritePipe has incorrect handle value.
> > 
> > I've verified that the handle was correct by attaching via gdb to the
> > hanging bash and checking that hWritePipe field is now zeroed (which
> > happens only in the branch where _HandleIsValid returns true and
> > hWritePipe is closed).
> > 
> > I've found something interesting though. I've modeled a similar
> > situation on another machine:
> > 
> > 1. I've run a native process via bash.
> > 2. I've attached to bash via gdb and set a breakpoint on 
> > ClosePseudoConsole().
> > 3. I've killed the native process.
> > 4. The breakpoint was hit, and I looked at hWritePipe value.
> > 
> > ProcessHacker shows it as "Unnamed file: \FileSystem\Npfs". Both bash
> > and conhost had a single handle with such name, and after I've
> > forcibly closed it in the bash process (while it was still suspended
> > by gdb), conhost.exe indeed died.
> > 
> > Then I looked at the original hanging tree and found that the hanging
> > bash.exe still has a single handle displayed as "Unnamed file:
> > \FileSystem\Npfs". I don't know how to check what kernel object it
> > refers to, but at least its access rights are the same as for
> > hWritePipe that I've seen on another machine, and its handle count is
> > 1. So could it be another copy of hWritePipe, e.g. due to some handle
> > leak?
> > 
> > I don't know how to verify whether this suspicious handle in bash.exe
> > is paired with "Unnamed file: \FileSystem\Npfs" in conhost.exe, other
> > than by forcibly closing it. If I close it and conhost.exe dies, it
> > will confirm "the extra handle" theory, but will also prevent further
> > investigation with the hanging tree. Do you have any advice?
> > 
> I've found something that looked strange to me by checking handles in 
> the hanging process tree: the hanging conhost.exe and the hanging 
> bash.exe belong to different tests. Each test is a separate shell script 
> in a separate make recipe, so it looks like conhost.exe was created by 
> one test (which is still hanging at a later point in its script, trying 
> to run grep), but then bash.exe belonging to another test somehow got a 
> pseudoconsole referring to this conhost.exe and now hangs trying to 
> close it. So it looks that Cygwin migrated the pseudoconsole between 
> processes, and indeed fhandler_pty_slave::close_pseudoconsole() contains 
> something looking like migration logic. And this logic contains the 
> following call:
> 
> DuplicateHandle (GetCurrentProcess (),
>                   ttyp->h_pcon_write_pipe,
>                   new_owner, &new_write_pipe,
>                   0, TRUE, DUPLICATE_SAME_ACCESS);
> 
> Is it safe to create an *inheritable* handle in another process here? 
> Could it be that the target process spawns a child at the wrong moment 
> (e.g. before it even knows about the newly created handle), and that 
> handle unintentionally leaks into the child, triggering the hang 
> afterwards?

Thanks for finding that! As you pointed out, hWritePipe should not
be inheritable. That might be the cause.

A countermeasure version is available at the following location:
https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz

Could you please test? To keep the hanging tree, please install
cygwin another directory, and replace cygwin1.dll with the
countermeasure version.

If you want to setup another sshd, please use the command such as:
ssh-host-config --name cygsshd2 --port 2222

To remove sshd installed using above command:
cygrunsrv -E cygsshd2
cygrunsrv -R cygsshd2

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-11  5:23               ` Jeremy Drake
@ 2022-04-11  8:36                 ` Takashi Yano
  2022-04-11 15:28                 ` Alexey Izbyshev
  1 sibling, 0 replies; 32+ messages in thread
From: Takashi Yano @ 2022-04-11  8:36 UTC (permalink / raw)
  To: cygwin; +Cc: Jeremy Drake

On Sun, 10 Apr 2022 22:23:06 -0700 (PDT)
Jeremy Drake wrote:
> On Sat, 9 Apr 2022, Alexey Izbyshev wrote:
> 
> > I don't have mintty because "make" is run via an SSH session. I suppose
> > I should look into sshd in this case?
> 
> Sshd wouldn't happen to be running as a service, would it?
> 
> https://cygwin.com/pipermail/cygwin-patches/2022q2/011867.html

sshd itself is running as service, however, the user session
is not. So, this is another issue.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-11  8:35                           ` Takashi Yano
@ 2022-04-11 10:10                             ` Alexey Izbyshev
  2022-04-13 16:48                               ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-11 10:10 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-11 11:35, Takashi Yano wrote:
> On Sun, 10 Apr 2022 23:49:29 +0300
> Alexey Izbyshev wrote:
>> 
>> Is it safe to create an *inheritable* handle in another process here?
>> Could it be that the target process spawns a child at the wrong moment
>> (e.g. before it even knows about the newly created handle), and that
>> handle unintentionally leaks into the child, triggering the hang
>> afterwards?
> 
> Thanks for finding that! As you pointed out, hWritePipe should not
> be inheritable. That might be the cause.
> 
> A countermeasure version is available at the following location:
> https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
> https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz
> 
> Could you please test? To keep the hanging tree, please install
> cygwin another directory, and replace cygwin1.dll with the
> countermeasure version.
> 
Thank you for providing the binaries! I've started testing in a separate 
cygwin installation on the same machine, as you suggested. The hang 
previously took many hours to reproduce, so I'll keep tests running for 
a while and then report back.

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-08 17:04     ` Brian Inglis
@ 2022-04-11 13:27       ` Alexey Izbyshev
  0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-11 13:27 UTC (permalink / raw)
  To: cygwin

On 2022-04-08 20:04, Brian Inglis wrote:
> On 2022-04-08 02:42, Alexey Izbyshev wrote:
>> There is also an additional detail that I forgot to mention: in the 
>> stack trace of all leaf processes as displayed by ProcessHacker, it 
>> seems that the executable entry point is not reached yet. The only 
>> non-Windows-DLL location is in cygwin1.dll, so I suspect that all 
>> processes hang at early initialization in Cygwin's DLL entry point.
> 
> That sounds like BLODA interference from AntiVirus programs:
> 
> 	https://cygwin.com/faq/faq.html#faq.using.bloda
> 
> and can also happen if you use Windows AD, and your users have a lot
> of rights, and a slow server, firewall filtering, or network link, but
> known issues were fixed a few releases ago.
> 

It seems that a potential cause of the hang has been identified in 
another discussion thread, and I'm now testing a patched version 
provided by Takashi Yano.

Anyway, thanks for your time! And just in case the identified cause 
turns out to be wrong, I'm answering your questions below.

We don't use any third-party AV products on this machine (it's an 
internal box used only for CI), and we've disabled Real-Time Protection 
in the Windows AV (it causes a terrible performance degradation, 
something like 1.5-2 times).

> Any idea how much address space is used by Cygwin DLLs, and memory by
> all the processes running: run rebase -is to see if you could be out
> of address space for Cygwin and DLLs, and how much is left for
> processes?
> 
> Do you have a decent amount of memory free on your system while
> running, and Windows paging space allocated to back it up - total
> twice memory, and do you have multiple drives to spread it across?
> Check your system memory and paging activity while those processes are 
> running.
> 
The peak memory consumption of our tests never exceeds 30% of RAM. Also, 
Cygwin is used (almost) exclusively for the test harness (the actual 
software under testing is native), and there are no heavyweight 
processes in it, mostly just make, bash and some coreutils. So I don't 
think we could hit address space issues even on 32-bit Cygwin.

> Could you try installing Cygwin64 packages and running those instead
> of Cygwin32 (recommended as Cygwin32 support will be dropped next
> release) as there is more address space available as well as usable
> memory for processes?

We test both 32-bit and 64-bit builds of our software, and a couple of 
tests need to run (Cygwin) make under debugging. Because a 32-bit 
process can't debug a 64-bit one, we simply use 32-bit Cygwin for both 
cases. But if need to reproduce under 64-bit Cygwin arises, I can simply 
exclude the problematic tests (they're unlikely to be relevant to the 
hang).

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Re: Deadlock of the process tree when running make
  2022-04-11  5:23               ` Jeremy Drake
  2022-04-11  8:36                 ` Takashi Yano
@ 2022-04-11 15:28                 ` Alexey Izbyshev
  2022-04-11 17:02                   ` Jeremy Drake
  1 sibling, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-11 15:28 UTC (permalink / raw)
  To: cygwin; +Cc: cygwin

On 2022-04-11 08:23, Jeremy Drake wrote:
> On Sat, 9 Apr 2022, Alexey Izbyshev wrote:
>> I don't have mintty because "make" is run via an SSH session. I 
>> suppose
>> I should look into sshd in this case?

> Sshd wouldn't happen to be running as a service, would it?

> https://cygwin.com/pipermail/cygwin-patches/2022q2/011867.html

(I've noticed your message in the mailing list archive, please add me to 
CC on replying, I'm not subscribed)

Yes, sshd is running as a service, but I'm not sure that patch is 
relevant. In my case, the problematic pipe that the hanging conhost.exe 
is waiting on is probably created for that specific conhost.exe process 
within the process tree rooted at "make", which runs as an ordinary 
user. Also, wouldn't the hang be deterministic if the problem were in 
the pipe ownership?

Alexey




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Re: Deadlock of the process tree when running make
  2022-04-11 15:28                 ` Alexey Izbyshev
@ 2022-04-11 17:02                   ` Jeremy Drake
  0 siblings, 0 replies; 32+ messages in thread
From: Jeremy Drake @ 2022-04-11 17:02 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: cygwin

On Mon, 11 Apr 2022, Alexey Izbyshev wrote:

> Yes, sshd is running as a service, but I'm not sure that patch is relevant. In
> my case, the problematic pipe that the hanging conhost.exe is waiting on is
> probably created for that specific conhost.exe process within the process tree
> rooted at "make", which runs as an ordinary user. Also, wouldn't the hang be
> deterministic if the problem were in the pipe ownership?

Yes it would.  I just noticed some of the evidence pointing that way - a
presumably native javac.exe, an anonymous "named pipe" handle, and then
when I saw sshd involved the last piece required for that scenario -
running as a service.  But Takashi's reply sounds like sshd drops the
well-known service sid when it switches to the logged-on user's token
anyway.

This is both good and bad, I guess.  Bad because your problem may not be
solved yet (though maybe with the latest test dll, fingers crossed!).
Good because there's a mystery hang that's been plaguing me when running
(under emulation) on Windows on ARM64 that the circumstances of that
environment has made virtually impossible to debug, and every commit that
mentions fixing a deadlock gives me new hope that that will be the fix
that makes it go away.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-11 10:10                             ` Alexey Izbyshev
@ 2022-04-13 16:48                               ` Alexey Izbyshev
  2022-04-13 17:22                                 ` Takashi Yano
  2022-04-13 23:17                                 ` Alexey Izbyshev
  0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-13 16:48 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-11 13:10, Alexey Izbyshev wrote:
> On 2022-04-11 11:35, Takashi Yano wrote:
>> On Sun, 10 Apr 2022 23:49:29 +0300
>> A countermeasure version is available at the following location:
>> https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
>> https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz
>> 
>> Could you please test? To keep the hanging tree, please install
>> cygwin another directory, and replace cygwin1.dll with the
>> countermeasure version.
>> 
> Thank you for providing the binaries! I've started testing in a
> separate cygwin installation on the same machine, as you suggested.
> The hang previously took many hours to reproduce, so I'll keep tests
> running for a while and then report back.
> 
The good news is that the tests have been running for two days so far 
without any cygwin-related issues, so the patched version doesn't seem 
to introduce new issues.

The bad news is my theory about the suspicious "Unnamed file: 
\FileSystem\Npfs" in the hanging bash.exe being a leak seems to be 
wrong. I've closed that handle, but conhost.exe hasn't unblocked. All of 
its threads are doing the same things as before:

1. Tries to enter a critical section. (Task Manager claims it waits for
thread 4, so probably the latter owns it).
2. ReadFile("pty1-from-master-nat" named pipe)
3. Waits for an anonymous event.
4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
5. Blocked in GetMessageW().

I've created a model situation with bash.exe stopped at a breakpoint in 
ClosePseudoConsole() at another machine again, and it seems that the 
last time I missed that bash.exe contains *two* handles for (different) 
"Unnamed file: \FileSystem\Npfs" here too, so it seems to be normal.

What's probably not normal is the behavior of the hanging conhost.exe. 
I've compared the points where conhost.exe is blocked, and all but one 
threads in the model case are doing the same things as in the hanging 
case, but the remaining thread is blocked in 
ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of 
pcon) instead of trying to enter a critical section like thread 1 above. 
So now I'm starting to doubt that it's a cygwin bug and not some 
conhost.exe bug.

I'll try to poke around the hanging conhost.exe some more, and also may 
be will try to create a faster reproducer.

Thanks for your help so far,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-13 16:48                               ` Alexey Izbyshev
@ 2022-04-13 17:22                                 ` Takashi Yano
  2022-04-13 17:27                                   ` Alexey Izbyshev
  2022-04-13 23:17                                 ` Alexey Izbyshev
  1 sibling, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-13 17:22 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Wed, 13 Apr 2022 19:48:04 +0300
Alexey Izbyshev wrote:
> On 2022-04-11 13:10, Alexey Izbyshev wrote:
> > On 2022-04-11 11:35, Takashi Yano wrote:
> >> On Sun, 10 Apr 2022 23:49:29 +0300
> >> A countermeasure version is available at the following location:
> >> https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220411.dll.xz
> >> https://tyan0.yr32.net/cygwin/x86_64/test/cygwin1-20220411.dll.xz
> >> 
> >> Could you please test? To keep the hanging tree, please install
> >> cygwin another directory, and replace cygwin1.dll with the
> >> countermeasure version.
> >> 
> > Thank you for providing the binaries! I've started testing in a
> > separate cygwin installation on the same machine, as you suggested.
> > The hang previously took many hours to reproduce, so I'll keep tests
> > running for a while and then report back.
> > 
> The good news is that the tests have been running for two days so far 
> without any cygwin-related issues, so the patched version doesn't seem 
> to introduce new issues.
> 
> The bad news is my theory about the suspicious "Unnamed file: 
> \FileSystem\Npfs" in the hanging bash.exe being a leak seems to be 
> wrong. I've closed that handle, but conhost.exe hasn't unblocked. All of 
> its threads are doing the same things as before:
> 
> 1. Tries to enter a critical section. (Task Manager claims it waits for
> thread 4, so probably the latter owns it).
> 2. ReadFile("pty1-from-master-nat" named pipe)
> 3. Waits for an anonymous event.
> 4. Waits on a handle for "\Device\ConDrv" (in DeviceIoControl()).
> 5. Blocked in GetMessageW().
> 
> I've created a model situation with bash.exe stopped at a breakpoint in 
> ClosePseudoConsole() at another machine again, and it seems that the 
> last time I missed that bash.exe contains *two* handles for (different) 
> "Unnamed file: \FileSystem\Npfs" here too, so it seems to be normal.
> 
> What's probably not normal is the behavior of the hanging conhost.exe. 
> I've compared the points where conhost.exe is blocked, and all but one 
> threads in the model case are doing the same things as in the hanging 
> case, but the remaining thread is blocked in 
> ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of 
> pcon) instead of trying to enter a critical section like thread 1 above. 
> So now I'm starting to doubt that it's a cygwin bug and not some 
> conhost.exe bug.
> 
> I'll try to poke around the hanging conhost.exe some more, and also may 
> be will try to create a faster reproducer.

Thanks for testing.

Question is:
Is the issue reproduced using new cygwin1.dll? Or is it still
running without the issue so far?

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-13 17:22                                 ` Takashi Yano
@ 2022-04-13 17:27                                   ` Alexey Izbyshev
  0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-13 17:27 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-13 20:22, Takashi Yano wrote:
> On Wed, 13 Apr 2022 19:48:04 +0300
> Alexey Izbyshev wrote:
>> On 2022-04-11 13:10, Alexey Izbyshev wrote:
>> The good news is that the tests have been running for two days so far
>> without any cygwin-related issues, so the patched version doesn't seem
>> to introduce new issues.
>> 
> Thanks for testing.
> 
> Question is:
> Is the issue reproduced using new cygwin1.dll? Or is it still
> running without the issue so far?

It's still running without any issues with the new DLL. The experiment 
with closing a handle was done with the old hanging tree.

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-13 16:48                               ` Alexey Izbyshev
  2022-04-13 17:22                                 ` Takashi Yano
@ 2022-04-13 23:17                                 ` Alexey Izbyshev
  2022-04-16  9:39                                   ` Takashi Yano
  1 sibling, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-13 23:17 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-13 19:48, Alexey Izbyshev wrote:
> On 2022-04-11 13:10, Alexey Izbyshev wrote:
> What's probably not normal is the behavior of the hanging conhost.exe.
> I've compared the points where conhost.exe is blocked, and all but one
> threads in the model case are doing the same things as in the hanging
> case, but the remaining thread is blocked in
> ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of
> pcon) instead of trying to enter a critical section like thread 1
> above. So now I'm starting to doubt that it's a cygwin bug and not
> some conhost.exe bug.
> 
> I'll try to poke around the hanging conhost.exe some more, and also
> may be will try to create a faster reproducer.
> 
I've studied conhost.exe hang, and it indeed looks like it's buggy.

TLDR: https://github.com/microsoft/terminal/pull/12181

The full story:

I dumped conhost.exe, opened the dump in windbg and looked at the stack 
trace of the hanging thread:

ntdll!NtWaitForAlertByThreadId+0x14
ntdll!RtlpWaitOnAddressWithTimeout+0x81
ntdll!RtlpWaitOnAddress+0xae
ntdll!RtlpWaitOnCriticalSection+0xfd
ntdll!RtlpEnterCriticalSectionContended+0x1c4
ntdll!RtlEnterCriticalSection+0x42
conhost!Microsoft::Console::Render::Renderer::_PaintFrameForEngine+0x54
conhost!Microsoft::Console::Render::Renderer::TriggerTeardown+0x19e60
conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit+0x21
conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x65
conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25
kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

By looking at assembly, I've found that it hangs *after* ReadFile() on 
the pipe completes, so the problem is definitely not a leak of 
hWritePipe in bash.exe or elsewhere.

Using the function names, I've found this issue: 
https://github.com/microsoft/terminal/issues/1810.

This is a different one, but the discussion and the patch shows that 
synchronization on startup/shutdown is a disaster.

Then I looked at the code and identified that hang happens while 
attempting to lock the console at [1]. After studying how this lock is 
used in other parts of the code, I noticed that 
PtySignalInputThread::_Shutdown() (which is further up in the call stack 
of the hanging function) uses ProcessCtrlEvents() incorrectly, because 
the latter unconditionally unlocks the console, but the lock is never 
taken by this thread at this point. Then I looked at a more recent 
version of the code and discovered the patch to _Shutdown() which I 
referenced above.

I've also verified that assembly of _Shutdown() (which is inlined into 
PtySignalInputThread::_GetData()) corresponds to the unpatched version 
(i.e. without LockConsole() call):

call    conhost!CloseConsoleProcessState (00007ff6`22e7013c)
call    conhost!ProcessCtrlEvents (00007ff6`22e262a0)
mov     ecx,6Dh
call    
conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit 
(00007ff6`22e3c730)

I'm not sure why this bug is not triggered more frequently, but one 
possible reason, as indicated by comment [2], is that the bad path is 
only taken if there are live clients after ClosePseudoConsole() is 
called, which is probably rare.

A potential workaround on Cygwin side would be to ensure that the 
pseudoconsole doesn't have clients before calling ClosePseudoConsole(), 
but I don't know whether it's possible.

[1] 
https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/renderer/base/renderer.cpp#L75

[2] 
https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/PtySignalInputThread.cpp#L205

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-13 23:17                                 ` Alexey Izbyshev
@ 2022-04-16  9:39                                   ` Takashi Yano
  2022-04-16 13:21                                     ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-16  9:39 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

On Thu, 14 Apr 2022 02:17:38 +0300
Alexey Izbyshev wrote:
> On 2022-04-13 19:48, Alexey Izbyshev wrote:
> > On 2022-04-11 13:10, Alexey Izbyshev wrote:
> > What's probably not normal is the behavior of the hanging conhost.exe.
> > I've compared the points where conhost.exe is blocked, and all but one
> > threads in the model case are doing the same things as in the hanging
> > case, but the remaining thread is blocked in
> > ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of
> > pcon) instead of trying to enter a critical section like thread 1
> > above. So now I'm starting to doubt that it's a cygwin bug and not
> > some conhost.exe bug.
> > 
> > I'll try to poke around the hanging conhost.exe some more, and also
> > may be will try to create a faster reproducer.
> > 
> I've studied conhost.exe hang, and it indeed looks like it's buggy.
> 
> TLDR: https://github.com/microsoft/terminal/pull/12181
> 
> The full story:
> 
> I dumped conhost.exe, opened the dump in windbg and looked at the stack 
> trace of the hanging thread:
> 
> ntdll!NtWaitForAlertByThreadId+0x14
> ntdll!RtlpWaitOnAddressWithTimeout+0x81
> ntdll!RtlpWaitOnAddress+0xae
> ntdll!RtlpWaitOnCriticalSection+0xfd
> ntdll!RtlpEnterCriticalSectionContended+0x1c4
> ntdll!RtlEnterCriticalSection+0x42
> conhost!Microsoft::Console::Render::Renderer::_PaintFrameForEngine+0x54
> conhost!Microsoft::Console::Render::Renderer::TriggerTeardown+0x19e60
> conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit+0x21
> conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x65
> conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25
> kernel32!BaseThreadInitThunk+0x14
> ntdll!RtlUserThreadStart+0x21
> 
> By looking at assembly, I've found that it hangs *after* ReadFile() on 
> the pipe completes, so the problem is definitely not a leak of 
> hWritePipe in bash.exe or elsewhere.
> 
> Using the function names, I've found this issue: 
> https://github.com/microsoft/terminal/issues/1810.
> 
> This is a different one, but the discussion and the patch shows that 
> synchronization on startup/shutdown is a disaster.
> 
> Then I looked at the code and identified that hang happens while 
> attempting to lock the console at [1]. After studying how this lock is 
> used in other parts of the code, I noticed that 
> PtySignalInputThread::_Shutdown() (which is further up in the call stack 
> of the hanging function) uses ProcessCtrlEvents() incorrectly, because 
> the latter unconditionally unlocks the console, but the lock is never 
> taken by this thread at this point. Then I looked at a more recent 
> version of the code and discovered the patch to _Shutdown() which I 
> referenced above.
> 
> I've also verified that assembly of _Shutdown() (which is inlined into 
> PtySignalInputThread::_GetData()) corresponds to the unpatched version 
> (i.e. without LockConsole() call):
> 
> call    conhost!CloseConsoleProcessState (00007ff6`22e7013c)
> call    conhost!ProcessCtrlEvents (00007ff6`22e262a0)
> mov     ecx,6Dh
> call    
> conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit 
> (00007ff6`22e3c730)
> 
> I'm not sure why this bug is not triggered more frequently, but one 
> possible reason, as indicated by comment [2], is that the bad path is 
> only taken if there are live clients after ClosePseudoConsole() is 
> called, which is probably rare.
> 
> A potential workaround on Cygwin side would be to ensure that the 
> pseudoconsole doesn't have clients before calling ClosePseudoConsole(), 
> but I don't know whether it's possible.

I am not sure yet what is essential, but the current code closes
pseudo console only if there is no other process which is attaching
to the pseudo console. I wonder why javac.exe is remaining as
zombie. The parent bash.exe calls ColosePseudoConsole() when
child non-cygwin app is terminated, i.e., after WaitForSingleObject()
for child process handle returns.
https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009

What does the "zombie" mean? Is it listed in the process list of
ProcessHacker? I still suspect that the zombie javac.exe holds
the  hWritePipe handle leaked from parent bash.exe.

> [1] 
> https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/renderer/base/renderer.cpp#L75
> 
> [2] 
> https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/PtySignalInputThread.cpp#L205


-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-16  9:39                                   ` Takashi Yano
@ 2022-04-16 13:21                                     ` Alexey Izbyshev
  2022-04-27 11:22                                       ` Takashi Yano
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-16 13:21 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

On 2022-04-16 12:39, Takashi Yano wrote:
> I am not sure yet what is essential, but the current code closes
> pseudo console only if there is no other process which is attaching
> to the pseudo console. I wonder why javac.exe is remaining as
> zombie. The parent bash.exe calls ColosePseudoConsole() when
> child non-cygwin app is terminated, i.e., after WaitForSingleObject()
> for child process handle returns.
> https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009
> 
> What does the "zombie" mean? Is it listed in the process list of
> ProcessHacker? I still suspect that the zombie javac.exe holds
> the  hWritePipe handle leaked from parent bash.exe.
> 
By "zombie" I meant the same thing as in the Linux kernel: a data 
structure that remains after a process terminated, but hasn't been 
waited for yet (I don't know how this is implemented in Cygwin). So 
there is no javac.exe process in ProcessHacker, but "ps" and similar 
tools in Cygwin still list "javac".

I'm now trying to create a small reproducer that I can share, and I've 
had a first small success this night: I could get a very similar hang 
with a simple Makefile and a script with Cygwin 3.3.4. Here is the tree:

make(14479)-+-bash(14484)---bash(14611)
             |-bash(14515)---bash(14618)
             |-bash(14491)---bash(14500)---bash(14612)
             |-bash(14501)---bash(14510)---bash(14605)
             |-bash(14505)---bash(14607)
             |-bash(14494)---bash(14617)
             |-bash(14506)---bash(14513)---bash(14610)
             |-bash(14512)---bash(14518)---bash(14615)
             |-bash(14486)---bash(14495)---bash(14606)
             |-bash(14483)---bash(14490)---bash(14609)
             |-bash(14509)---bash(14614)
             |-bash(14489)---bash(14608)
             |-bash(14499)---bash(14613)
             |-bash(14481)---bash(14485)---python(14588)
             |-bash(14496)---bash(14504)---bash(14616)
             `-bash(14482)---bash(14604)


"python" is a zombie, just as "javac" is in the original case. There is 
also a single "conhost.exe" again, and all of its 5 threads are doing 
the same things as in the original case (including the signal pipe 
thread trying to EnterCriticalSection()). The only difference is that 
leaf bash.exe are trying to acquire pcon mutex at a different point [1], 
but I guess this difference is not important.

I'll try this reproducer with your patched DLL as well as on another 
machine and share it in case of success.

Thanks,
Alexey

[1] 
https://www.cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=cygwin-3_3_4-release#l697

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-16 13:21                                     ` Alexey Izbyshev
@ 2022-04-27 11:22                                       ` Takashi Yano
  2022-04-27 12:19                                         ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Takashi Yano @ 2022-04-27 11:22 UTC (permalink / raw)
  To: cygwin; +Cc: Alexey Izbyshev

Hi Alexey,

On Sat, 16 Apr 2022 16:21:34 +0300
Alexey Izbyshev wrote:
> On 2022-04-16 12:39, Takashi Yano wrote:
> > I am not sure yet what is essential, but the current code closes
> > pseudo console only if there is no other process which is attaching
> > to the pseudo console. I wonder why javac.exe is remaining as
> > zombie. The parent bash.exe calls ColosePseudoConsole() when
> > child non-cygwin app is terminated, i.e., after WaitForSingleObject()
> > for child process handle returns.
> > https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009
> > 
> > What does the "zombie" mean? Is it listed in the process list of
> > ProcessHacker? I still suspect that the zombie javac.exe holds
> > the  hWritePipe handle leaked from parent bash.exe.
> > 
> By "zombie" I meant the same thing as in the Linux kernel: a data 
> structure that remains after a process terminated, but hasn't been 
> waited for yet (I don't know how this is implemented in Cygwin). So 
> there is no javac.exe process in ProcessHacker, but "ps" and similar 
> tools in Cygwin still list "javac".
> 
> I'm now trying to create a small reproducer that I can share, and I've 
> had a first small success this night: I could get a very similar hang 
> with a simple Makefile and a script with Cygwin 3.3.4. Here is the tree:
> 
> make(14479)-+-bash(14484)---bash(14611)
>              |-bash(14515)---bash(14618)
>              |-bash(14491)---bash(14500)---bash(14612)
>              |-bash(14501)---bash(14510)---bash(14605)
>              |-bash(14505)---bash(14607)
>              |-bash(14494)---bash(14617)
>              |-bash(14506)---bash(14513)---bash(14610)
>              |-bash(14512)---bash(14518)---bash(14615)
>              |-bash(14486)---bash(14495)---bash(14606)
>              |-bash(14483)---bash(14490)---bash(14609)
>              |-bash(14509)---bash(14614)
>              |-bash(14489)---bash(14608)
>              |-bash(14499)---bash(14613)
>              |-bash(14481)---bash(14485)---python(14588)
>              |-bash(14496)---bash(14504)---bash(14616)
>              `-bash(14482)---bash(14604)
> 
> 
> "python" is a zombie, just as "javac" is in the original case. There is 
> also a single "conhost.exe" again, and all of its 5 threads are doing 
> the same things as in the original case (including the signal pipe 
> thread trying to EnterCriticalSection()). The only difference is that 
> leaf bash.exe are trying to acquire pcon mutex at a different point [1], 
> but I guess this difference is not important.
> 
> I'll try this reproducer with your patched DLL as well as on another 
> machine and share it in case of success.
> 
> Thanks,
> Alexey
> 
> [1] 
> https://www.cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=cygwin-3_3_4-release#l697

Is there any progress on this?

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Deadlock of the process tree when running make
  2022-04-27 11:22                                       ` Takashi Yano
@ 2022-04-27 12:19                                         ` Alexey Izbyshev
  0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-04-27 12:19 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin

Hi, Takashi,

On 2022-04-27 14:22, Takashi Yano wrote:
> Hi Alexey,
> 
> On Sat, 16 Apr 2022 16:21:34 +0300
> Alexey Izbyshev wrote:
>> On 2022-04-16 12:39, Takashi Yano wrote:
>> > I am not sure yet what is essential, but the current code closes
>> > pseudo console only if there is no other process which is attaching
>> > to the pseudo console. I wonder why javac.exe is remaining as
>> > zombie. The parent bash.exe calls ColosePseudoConsole() when
>> > child non-cygwin app is terminated, i.e., after WaitForSingleObject()
>> > for child process handle returns.
>> > https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009
>> >
>> > What does the "zombie" mean? Is it listed in the process list of
>> > ProcessHacker? I still suspect that the zombie javac.exe holds
>> > the  hWritePipe handle leaked from parent bash.exe.
>> >
>> By "zombie" I meant the same thing as in the Linux kernel: a data
>> structure that remains after a process terminated, but hasn't been
>> waited for yet (I don't know how this is implemented in Cygwin). So
>> there is no javac.exe process in ProcessHacker, but "ps" and similar
>> tools in Cygwin still list "javac".
>> 
>> I'm now trying to create a small reproducer that I can share, and I've
>> had a first small success this night: I could get a very similar hang
>> with a simple Makefile and a script with Cygwin 3.3.4. Here is the 
>> tree:
>> 
>> make(14479)-+-bash(14484)---bash(14611)
>>              |-bash(14515)---bash(14618)
>>              |-bash(14491)---bash(14500)---bash(14612)
>>              |-bash(14501)---bash(14510)---bash(14605)
>>              |-bash(14505)---bash(14607)
>>              |-bash(14494)---bash(14617)
>>              |-bash(14506)---bash(14513)---bash(14610)
>>              |-bash(14512)---bash(14518)---bash(14615)
>>              |-bash(14486)---bash(14495)---bash(14606)
>>              |-bash(14483)---bash(14490)---bash(14609)
>>              |-bash(14509)---bash(14614)
>>              |-bash(14489)---bash(14608)
>>              |-bash(14499)---bash(14613)
>>              |-bash(14481)---bash(14485)---python(14588)
>>              |-bash(14496)---bash(14504)---bash(14616)
>>              `-bash(14482)---bash(14604)
>> 
>> 
>> "python" is a zombie, just as "javac" is in the original case. There 
>> is
>> also a single "conhost.exe" again, and all of its 5 threads are doing
>> the same things as in the original case (including the signal pipe
>> thread trying to EnterCriticalSection()). The only difference is that
>> leaf bash.exe are trying to acquire pcon mutex at a different point 
>> [1],
>> but I guess this difference is not important.
>> 
>> I'll try this reproducer with your patched DLL as well as on another
>> machine and share it in case of success.
>> 
>> Thanks,
>> Alexey
>> 
>> [1]
>> https://www.cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=cygwin-3_3_4-release#l697
> 
> Is there any progress on this?

During the last week I reproduced the hang on a vanilla 3.3.4 Cygwin 
with a small test multiple times. In one case, the hanging state is even 
minimal, i.e. there is only a bash.exe waiting in ClosePseudoConsole() 
after its native child terminated and a conhost.exe, but no other 
processes trying to acquire pcon mutex. Conhost.exe signal-pipe thread 
is also blocked at the same EnterCriticalSection() call in all cases.

However, I couldn't reproduce the hang with your patched DLL[1] with the 
same test running for multiple days. I can't explain how your change of 
handle inheritability can affect the double-unlock bug in conhost.exe 
that I referenced earlier, so either I'm missing something or I've been 
very unlucky with reproducing. I was going to try to investigate 
conhost.exe logic and state more (in particular, why one of its threads 
still reads from "\Device\ConDrv" after all console clients detached) 
and then reply to you, but I haven't been able to do it yet.

If you want to try to reproduce the hang yourself with 3.3.4, here is 
one of small tests that I used (it looks strange because it's the result 
of minimization of other code):

$ cat Makefile
T := $(shell echo {1..16})

all: $(T)

$(T):
         @./test.sh $@

$ cat test.sh
#!/bin/bash
set -eu

(
   for ((i = 0; i < 10; i++)); do
     python -c ""
   done
)

$ while make -j16; do echo $((i++)); done

The test can still take multiple hours to hang on my machine.

If I get any new interesting data, I'll share it.

Thank you,
Alexey

[1] https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220418.dll.xz

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-04-27 12:19 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-07 21:53 Deadlock of the process tree when running make Alexey Izbyshev
2022-04-07 23:54 ` Brian Inglis
2022-04-08  8:42   ` Alexey Izbyshev
2022-04-08 17:04     ` Brian Inglis
2022-04-11 13:27       ` Alexey Izbyshev
2022-04-09 10:17 ` Takashi Yano
2022-04-09 11:00   ` Alexey Izbyshev
2022-04-09 11:02     ` Alexey Izbyshev
2022-04-09 11:46       ` Takashi Yano
2022-04-09 16:07         ` Alexey Izbyshev
2022-04-09 16:57           ` Takashi Yano
2022-04-09 17:23             ` Alexey Izbyshev
2022-04-09 17:54               ` Takashi Yano
2022-04-09 19:35                 ` Alexey Izbyshev
2022-04-09 20:26                   ` Alexey Izbyshev
2022-04-10  7:34                     ` Takashi Yano
2022-04-10 12:13                       ` Alexey Izbyshev
2022-04-10 20:49                         ` Alexey Izbyshev
2022-04-11  8:35                           ` Takashi Yano
2022-04-11 10:10                             ` Alexey Izbyshev
2022-04-13 16:48                               ` Alexey Izbyshev
2022-04-13 17:22                                 ` Takashi Yano
2022-04-13 17:27                                   ` Alexey Izbyshev
2022-04-13 23:17                                 ` Alexey Izbyshev
2022-04-16  9:39                                   ` Takashi Yano
2022-04-16 13:21                                     ` Alexey Izbyshev
2022-04-27 11:22                                       ` Takashi Yano
2022-04-27 12:19                                         ` Alexey Izbyshev
2022-04-11  5:23               ` Jeremy Drake
2022-04-11  8:36                 ` Takashi Yano
2022-04-11 15:28                 ` Alexey Izbyshev
2022-04-11 17:02                   ` Jeremy Drake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).