From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id D6417385734A for ; Wed, 13 Apr 2022 23:17:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D6417385734A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from mail.ispras.ru (unknown [83.149.199.84]) by mail.ispras.ru (Postfix) with ESMTPSA id 87D8E40D403D; Wed, 13 Apr 2022 23:17:38 +0000 (UTC) MIME-Version: 1.0 Date: Thu, 14 Apr 2022 02:17:38 +0300 From: Alexey Izbyshev To: Takashi Yano Cc: cygwin@cygwin.com Subject: Re: Deadlock of the process tree when running make In-Reply-To: <1bdd5ac77277343fbff9b560fa98b15e@ispras.ru> References: <9388316255ada0e0fcb2d849cce5a894@ispras.ru> <20220409191743.6da2268a36e8c9b4ab22c722@nifty.ne.jp> <1ecd670b1cdff43e0b0d7e5ee4c9cfc5@ispras.ru> <20220409204619.dd0e53902d5e108ef462e510@nifty.ne.jp> <907ce1b4416a826cb07990dd601bd687@ispras.ru> <20220410015753.753e2a238513eaf2a3da81e9@nifty.ne.jp> <20220410025410.196aa0a04368147dbbb31d3e@nifty.ne.jp> <7204ed0aa2d6b3fcfb239010e6b67646@ispras.ru> <20220410163432.00dd7b9f81f8f322d97688f2@nifty.ne.jp> <0e1a53626639cb21369225ff9092ecfc@ispras.ru> <20220411173526.6243b9492e0fc3d4132a58a8@nifty.ne.jp> <1bdd5ac77277343fbff9b560fa98b15e@ispras.ru> User-Agent: Roundcube Webmail/1.4.4 Message-ID: X-Sender: izbyshev@ispras.ru Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_00, DOS_RCVD_IP_TWICE_B, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Apr 2022 23:17:45 -0000 On 2022-04-13 19:48, Alexey Izbyshev wrote: > On 2022-04-11 13:10, Alexey Izbyshev wrote: > What's probably not normal is the behavior of the hanging conhost.exe. > I've compared the points where conhost.exe is blocked, and all but one > threads in the model case are doing the same things as in the hanging > case, but the remaining thread is blocked in > ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of > pcon) instead of trying to enter a critical section like thread 1 > above. So now I'm starting to doubt that it's a cygwin bug and not > some conhost.exe bug. > > I'll try to poke around the hanging conhost.exe some more, and also > may be will try to create a faster reproducer. > I've studied conhost.exe hang, and it indeed looks like it's buggy. TLDR: https://github.com/microsoft/terminal/pull/12181 The full story: I dumped conhost.exe, opened the dump in windbg and looked at the stack trace of the hanging thread: ntdll!NtWaitForAlertByThreadId+0x14 ntdll!RtlpWaitOnAddressWithTimeout+0x81 ntdll!RtlpWaitOnAddress+0xae ntdll!RtlpWaitOnCriticalSection+0xfd ntdll!RtlpEnterCriticalSectionContended+0x1c4 ntdll!RtlEnterCriticalSection+0x42 conhost!Microsoft::Console::Render::Renderer::_PaintFrameForEngine+0x54 conhost!Microsoft::Console::Render::Renderer::TriggerTeardown+0x19e60 conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit+0x21 conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x65 conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25 kernel32!BaseThreadInitThunk+0x14 ntdll!RtlUserThreadStart+0x21 By looking at assembly, I've found that it hangs *after* ReadFile() on the pipe completes, so the problem is definitely not a leak of hWritePipe in bash.exe or elsewhere. Using the function names, I've found this issue: https://github.com/microsoft/terminal/issues/1810. This is a different one, but the discussion and the patch shows that synchronization on startup/shutdown is a disaster. Then I looked at the code and identified that hang happens while attempting to lock the console at [1]. After studying how this lock is used in other parts of the code, I noticed that PtySignalInputThread::_Shutdown() (which is further up in the call stack of the hanging function) uses ProcessCtrlEvents() incorrectly, because the latter unconditionally unlocks the console, but the lock is never taken by this thread at this point. Then I looked at a more recent version of the code and discovered the patch to _Shutdown() which I referenced above. I've also verified that assembly of _Shutdown() (which is inlined into PtySignalInputThread::_GetData()) corresponds to the unpatched version (i.e. without LockConsole() call): call conhost!CloseConsoleProcessState (00007ff6`22e7013c) call conhost!ProcessCtrlEvents (00007ff6`22e262a0) mov ecx,6Dh call conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit (00007ff6`22e3c730) I'm not sure why this bug is not triggered more frequently, but one possible reason, as indicated by comment [2], is that the bad path is only taken if there are live clients after ClosePseudoConsole() is called, which is probably rare. A potential workaround on Cygwin side would be to ensure that the pseudoconsole doesn't have clients before calling ClosePseudoConsole(), but I don't know whether it's possible. [1] https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/renderer/base/renderer.cpp#L75 [2] https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/PtySignalInputThread.cpp#L205 Alexey