From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from conssluserg-03.nifty.com (conssluserg-03.nifty.com [210.131.2.82]) by sourceware.org (Postfix) with ESMTPS id 628F4385803E for ; Sat, 16 Apr 2022 09:39:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 628F4385803E Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=nifty.ne.jp Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=nifty.ne.jp Received: from Express5800-S70 (ak044095.dynamic.ppp.asahi-net.or.jp [119.150.44.95]) (authenticated) by conssluserg-03.nifty.com with ESMTP id 23G9cjFt032721; Sat, 16 Apr 2022 18:38:45 +0900 DKIM-Filter: OpenDKIM Filter v2.10.3 conssluserg-03.nifty.com 23G9cjFt032721 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nifty.ne.jp; s=dec2015msa; t=1650101925; bh=N+MQvZIhVy7K3Izps34sjkswftp7+O2ZtQXLZfiFp9Y=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=DgZSINUU04UMa0TyQ4FwdUo6u47iiOnF4jybPTnjdFMi/mQuDCXlB+/hxU3fJmdd/ iNeTu6UKLgHAbPixSFpW7lvcZwbD9GiLXnKe9J2jw8xIru0Tf35wmZQImpAX2vVK/q Dgx026L2iZoR1Zqe6ACtX/Xu+5zRaLRO7GI2ic2YR6egZfQWM1J5wm5tElEJP4kJcc iPPFtBfqJjuw8jH+Lz9Pl6fyHwXUWTpitw3JQV6zvvtpXOdifFtz8K3jDlzcnEvo6a zygDAU/1SZNLny3COThWXKkeOLBJwSvVICwxeC5JT6+SEBNfXwMr0pGPEnfCO95Pug /hJLWZqgr+Hzw== X-Nifty-SrcIP: [119.150.44.95] Date: Sat, 16 Apr 2022 18:39:10 +0900 From: Takashi Yano To: cygwin@cygwin.com Cc: Alexey Izbyshev Subject: Re: Deadlock of the process tree when running make Message-Id: <20220416183910.b532b2cc95725b508bfd0991@nifty.ne.jp> In-Reply-To: References: <9388316255ada0e0fcb2d849cce5a894@ispras.ru> <20220409191743.6da2268a36e8c9b4ab22c722@nifty.ne.jp> <1ecd670b1cdff43e0b0d7e5ee4c9cfc5@ispras.ru> <20220409204619.dd0e53902d5e108ef462e510@nifty.ne.jp> <907ce1b4416a826cb07990dd601bd687@ispras.ru> <20220410015753.753e2a238513eaf2a3da81e9@nifty.ne.jp> <20220410025410.196aa0a04368147dbbb31d3e@nifty.ne.jp> <7204ed0aa2d6b3fcfb239010e6b67646@ispras.ru> <20220410163432.00dd7b9f81f8f322d97688f2@nifty.ne.jp> <0e1a53626639cb21369225ff9092ecfc@ispras.ru> <20220411173526.6243b9492e0fc3d4132a58a8@nifty.ne.jp> <1bdd5ac77277343fbff9b560fa98b15e@ispras.ru> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.30; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Apr 2022 09:39:43 -0000 On Thu, 14 Apr 2022 02:17:38 +0300 Alexey Izbyshev wrote: > On 2022-04-13 19:48, Alexey Izbyshev wrote: > > On 2022-04-11 13:10, Alexey Izbyshev wrote: > > What's probably not normal is the behavior of the hanging conhost.exe. > > I've compared the points where conhost.exe is blocked, and all but one > > threads in the model case are doing the same things as in the hanging > > case, but the remaining thread is blocked in > > ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of > > pcon) instead of trying to enter a critical section like thread 1 > > above. So now I'm starting to doubt that it's a cygwin bug and not > > some conhost.exe bug. > > > > I'll try to poke around the hanging conhost.exe some more, and also > > may be will try to create a faster reproducer. > > > I've studied conhost.exe hang, and it indeed looks like it's buggy. > > TLDR: https://github.com/microsoft/terminal/pull/12181 > > The full story: > > I dumped conhost.exe, opened the dump in windbg and looked at the stack > trace of the hanging thread: > > ntdll!NtWaitForAlertByThreadId+0x14 > ntdll!RtlpWaitOnAddressWithTimeout+0x81 > ntdll!RtlpWaitOnAddress+0xae > ntdll!RtlpWaitOnCriticalSection+0xfd > ntdll!RtlpEnterCriticalSectionContended+0x1c4 > ntdll!RtlEnterCriticalSection+0x42 > conhost!Microsoft::Console::Render::Renderer::_PaintFrameForEngine+0x54 > conhost!Microsoft::Console::Render::Renderer::TriggerTeardown+0x19e60 > conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit+0x21 > conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x65 > conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25 > kernel32!BaseThreadInitThunk+0x14 > ntdll!RtlUserThreadStart+0x21 > > By looking at assembly, I've found that it hangs *after* ReadFile() on > the pipe completes, so the problem is definitely not a leak of > hWritePipe in bash.exe or elsewhere. > > Using the function names, I've found this issue: > https://github.com/microsoft/terminal/issues/1810. > > This is a different one, but the discussion and the patch shows that > synchronization on startup/shutdown is a disaster. > > Then I looked at the code and identified that hang happens while > attempting to lock the console at [1]. After studying how this lock is > used in other parts of the code, I noticed that > PtySignalInputThread::_Shutdown() (which is further up in the call stack > of the hanging function) uses ProcessCtrlEvents() incorrectly, because > the latter unconditionally unlocks the console, but the lock is never > taken by this thread at this point. Then I looked at a more recent > version of the code and discovered the patch to _Shutdown() which I > referenced above. > > I've also verified that assembly of _Shutdown() (which is inlined into > PtySignalInputThread::_GetData()) corresponds to the unpatched version > (i.e. without LockConsole() call): > > call conhost!CloseConsoleProcessState (00007ff6`22e7013c) > call conhost!ProcessCtrlEvents (00007ff6`22e262a0) > mov ecx,6Dh > call > conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit > (00007ff6`22e3c730) > > I'm not sure why this bug is not triggered more frequently, but one > possible reason, as indicated by comment [2], is that the bad path is > only taken if there are live clients after ClosePseudoConsole() is > called, which is probably rare. > > A potential workaround on Cygwin side would be to ensure that the > pseudoconsole doesn't have clients before calling ClosePseudoConsole(), > but I don't know whether it's possible. I am not sure yet what is essential, but the current code closes pseudo console only if there is no other process which is attaching to the pseudo console. I wonder why javac.exe is remaining as zombie. The parent bash.exe calls ColosePseudoConsole() when child non-cygwin app is terminated, i.e., after WaitForSingleObject() for child process handle returns. https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009 What does the "zombie" mean? Is it listed in the process list of ProcessHacker? I still suspect that the zombie javac.exe holds the hWritePipe handle leaked from parent bash.exe. > [1] > https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/renderer/base/renderer.cpp#L75 > > [2] > https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/PtySignalInputThread.cpp#L205 -- Takashi Yano