From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 31724 invoked by alias); 14 Nov 2013 04:02:22 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 31707 invoked by uid 89); 14 Nov 2013 04:02:21 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.6 required=5.0 tests=BAYES_50,RDNS_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no version=3.3.2 X-HELO: na01-bl2-obe.outbound.protection.outlook.com Received: from Unknown (HELO na01-bl2-obe.outbound.protection.outlook.com) (207.46.163.210) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Thu, 14 Nov 2013 04:02:19 +0000 Received: from [192.168.1.16] (96.253.80.174) by BLUPR05MB450.namprd05.prod.outlook.com (10.141.28.19) with Microsoft SMTP Server (TLS) id 15.0.815.6; Thu, 14 Nov 2013 04:01:58 +0000 Message-ID: <52844B2E.5050902@coverity.com> Date: Thu, 14 Nov 2013 04:02:00 -0000 From: Tom Honermann User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.0 MIME-Version: 1.0 To: Subject: Re: Intermittent failures retrieving process exit codes References: <50C2498C.2000003@coverity.com> <50C276AC.9090301@mailme.ath.cx> <50D401EF.9040705@coverity.com> In-Reply-To: <50D401EF.9040705@coverity.com> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-ClientProxiedBy: BN1PR04CA011.namprd04.prod.outlook.com (10.141.56.11) To BLUPR05MB450.namprd05.prod.outlook.com (10.141.28.19) X-Forefront-PRVS: 0030839EEE X-Forefront-Antispam-Report: SFV:NSPM;SFS:(51704005)(52314003)(24454002)(479174003)(199002)(189002)(377454003)(63696002)(76482001)(31966008)(59896001)(51856001)(80316001)(53806001)(19580395003)(42186004)(77982001)(85306002)(59766001)(76786001)(76796001)(56816003)(77096001)(54356001)(46102001)(65956001)(66066001)(81686001)(47776003)(80976001)(15975445006)(64126003)(74502001)(74662001)(15202345003)(56776001)(81816001)(65806001)(80022001)(79102001)(83322001)(83506001)(23756003)(47446002)(54316002)(4396001)(50466002)(33656001)(50986001)(74706001)(47976001)(81342001)(74876001)(81542001)(47736001)(49866001)(83072001)(69226001)(74366001)(36756003)(87976001)(460985004)(2480315003)(134885004);DIR:OUT;SFP:;SCL:1;SRVR:BLUPR05MB450;H:[192.168.1.16];CLIP:96.253.80.174;FPR:;RD:InfoNoRecords;A:1;MX:1;LANG:en; X-OriginatorOrg: coverity.com X-IsSubscribed: yes X-SW-Source: 2013-11/txt/msg00260.txt.bz2 On 12/21/2012 01:30 AM, Tom Honermann wrote: > I spent most of the week debugging this issue. This appears to be a > defect in Windows. I can reproduce the issue without Cygwin. I can't > rule out other third party kernel mode software possibly contributing to > the issue. A simple change to Cygwin works around the problem for me. > > I don't know which Windows releases are affected by this. I've only > reproduced the problem (outside of Cygwin) with Wow64 processes running > on 64-bit Windows 7. I haven't yet tried elsewhere. > > The problem appears to be a race condition involving concurrent calls to > TerminateProcess() and ExitThread(). The example code below minimally > mimics the threads created and exit process/thread calls that are > performed when running Cygwin's false.exe. The primary thread exits the > process via TerminateProcess() ala pinfo::exit() in > winsup/cygwin/pinfo.cc. The secondary thread exits itself via > ExitThread() ala Cygwin's signal processing thread function, wait_sig(), > in winsup/cygwin/sigproc.cc. > > When the race condition results in the undesirable outcome, the exit > code for the process is set to the exit code for the secondary thread's > call to ExitThread(). I can only speculate at this point, but my guess > is that the TerminateProcess() code disassociates the calling thread > from the process before other threads are stopped such that > ExitThread(), concurrently running in another thread, may determine that > the calling thread is the last thread of the process and overwrite the > process exit code. > > The issue also reproduces if ExitProcess() is called in place of > TerminateProcess(). The test case below only uses TerminateProcess() > because that is what Cygwin does. > > Source code to reproduce the issue follows. Again, Cygwin is not > required to reproduce the problem. For my own testing, I compiled the > code using Microsoft's Visual Studio 2010 x86 compiler with the command > 'cl /Fetest-exit-code.exe test-exit-code.cpp' > > test-exit-code.cpp: > > #include > #include > #include > > DWORD WINAPI SecondaryThread( > LPVOID lpParameter) > { > Sleep(1); > ExitThread(2); > } > > int main() { > HANDLE hSecondaryThread = CreateThread( > NULL, // lpThreadAttributes > 0, // dwStackSize > SecondaryThread, // lpStartAddress > (LPVOID)0, // lpParameter > 0, // dwCreationFlags > NULL); // lpThreadId > if (!hSecondaryThread) { > fprintf(stderr, "CreateThread failed. GLE=%lu\n", > (unsigned long)GetLastError()); > exit(127); > } > > Sleep(1); > > if (!TerminateProcess(GetCurrentProcess(), 1)) { > fprintf(stderr, "TerminateProcess failed. GLE=%lu\n", > (unsigned long)GetLastError()); > exit(127); > } > > return 0; > } > > > To run the test, a simple .bat file is used: > > test.bat: > > @echo off > setlocal > > :loop > echo test... > test-exit-code.exe > if %ERRORLEVEL% NEQ 1 ( > echo test-exit-code.exe returned %ERRORLEVEL% > exit /B 1 > ) > goto loop > > > test.bat should run indefinitely. The amount of time it takes to fail > on my machine (64-bit Windows 7 running in a VMware Workstation 8 VM > under Kubuntu 12.04 on a Lenovo T420 Intel i7-2640M 2 processor laptop) > varies considerably. I had one run fail in less than 10 iterations, but > most of the time it has taken upwards of 5 minutes to get a failure. > > The workaround I implemented within Cygwin was simple and sloppy. I > added a call to Sleep(1000) immediately before the call to ExitThread() > in wait_sig() in winsup/cygwin/sigproc.cc. Since this thread (probably) > doesn't exit until the process is exiting anyway, the call to Sleep() > does not adversely affect shutdown. The thread just gets terminated > while in the call to Sleep() instead of exiting before the process is > terminated or getting terminated while still in the call to > ExitThread(). A better solution might be to avoid the thread exiting at > all (so long as it can't get terminated while holding critical > resources), or to have the process exiting thread wait on it. Neither > of these is ideal. Orderly shutdown of multi-threaded processes is > really hard to do correctly on Windows. > > Since the exit code for the signal processing thread is not used, having > the wait_sig() thread (and any other threads that could potentially > concurrently exit with another thread) exit with a special status value > such as STATUS_THREAD_IS_TERMINATING (0xC000004BL) would enable > diagnosis of this issue as any process exit code matching this would be > a likely indicator that this issue was encountered. > > As is, when this race condition results in the undesirable outcome, > since the signal processing thread exits with a status of 0, the exit > status of the process is 0. This explains why false.exe works so well > to reproduce the issue. It would be impossible to produce a negative > test using true.exe. > > Tom. Time passes... I worked with some former colleagues to report this issue to Microsoft. Windows 8.1 and Windows Server 2012 R2 contain a fix that addresses the test case above. A hotfix has been made available for Windows 7 SP1 and Windows Server 2008 R2. Should anyone desire a hotfix for other versions of Windows, it will be necessary to open a case with Microsoft to request it. http://support.microsoft.com/kb/2875501 Tom. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple