From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6090 invoked by alias); 20 Nov 2014 16:22:16 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 6080 invoked by uid 89); 20 Nov 2014 16:22:15 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-5.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.2 X-HELO: calimero.vinschen.de Received: from aquarius.hirmke.de (HELO calimero.vinschen.de) (217.91.18.234) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 20 Nov 2014 16:22:13 +0000 Received: by calimero.vinschen.de (Postfix, from userid 500) id 091338E0A64; Thu, 20 Nov 2014 17:22:10 +0100 (CET) Date: Thu, 20 Nov 2014 16:22:00 -0000 From: Corinna Vinschen To: cygwin@cygwin.com Subject: Re: Instability with signals and threads Message-ID: <20141120162210.GZ3810@calimero.vinschen.de> Reply-To: cygwin@cygwin.com Mail-Followup-To: cygwin@cygwin.com References: <20141120100001.GL3810@calimero.vinschen.de> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="XG0jWBK27HhJN4nS" Content-Disposition: inline In-Reply-To: <20141120100001.GL3810@calimero.vinschen.de> User-Agent: Mutt/1.5.23 (2014-03-12) X-SW-Source: 2014-11/txt/msg00496.txt.bz2 --XG0jWBK27HhJN4nS Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-length: 3476 On Nov 20 11:00, Corinna Vinschen wrote: > Hi Mikulas, >=20 > On Nov 19 17:42, Mikulas Patocka wrote: > > Hi > >=20 > > I have a program that sets a repetitive timer with setitimer and spawns= =20 > > several threads. > >=20 > > The program is very unstable on cygwin, it locks up in few minutes. > >=20 > > The bug manifests itself in the following way: the signal thread calls= =20 > > cygheap->find_tls to find a thread to deliver the signal to. find_tls=20 > > generates an exception when scanning the threadlist, jumps to the __exc= ept=20 > > block and calls threadlist[idx]->remove(INFINITE). > >=20 > > The method threadlist[idx]->remove is called with invalid "this" pointe= r=20 > > (sometimes it is zero, sometimes it points to unmapped memory), generat= es=20 > > another exception on "initialized =3D 0" line and becomes stuck on this= =20 > > assignment. >=20 > Now that you mention it, it makes sense. The exception gets triggered > by accessing an invalid member of threadlist. Using the very same > member in another method call looks.... borderline, to say the least. >=20 > > I found out that when I modify the remove_tls method so that it always= =20 > > acquires the lock and removes the thread from the threadlist (change=20 > > "tls_sentry here(wait)" to "tls_sentry here(INFINITE)"), the bug goes a= way=20 > > and the multithreaded program is stable. >=20 > [Noted your augmenting comment preceeding the testcase in your other mail] >=20 > > Alternativelly - the crash can be fixed if we change "_my_tls.remove (0= )"=20 > > to "_my_tls.remove (INFINITE)" in thread_wrapper (though, there is anot= her=20 > > _my_tls.remove (0) call in dll_entry in winsup/cygwin/init.cc and it co= uld=20 > > trigger the same crash) >=20 > I don't think so. In dll_init, the call is done inside a DLL_THREAD_DETA= CH > for this very thread, so &_my_tls is still a valid pointer. Never mind that. I can fix your testcase by calling _my_tls.remove with INFINITE as parameter in both places. If I drop one of them, your testcase will invariable fail at one point. With both INFINITE params in place, your testcase is now running half an hour without problems. Thinking about it, the fact that _cygtls::remove allows to apply a non-INFINITE wait is rather strange, isn't it? Calling remove_tls with a 0 wait, it allows to return the function silently, without actually having removed the thread from the list. This is bound to go downhill at one point and looks like a kludge to me to circumvent some potential hang in another situation... I'm not exactly sure if that works as intended. I will apply this patch and create a new Cygwin snapshot on https://cygwin.com/snapshots/ in a couple of minutes. I'd appreciate if you and others would give it an exhaustive test. New spurious hangs or SEGVs in other situations which so far worked fine would be good indicators for another problem in the code. Other than that, there's certainly some room for improvement. Calling threadlist[idx]->remove from the find_tls exception handler looks extremly hairy to me. I wonder if that should be called at all at this point, or if there shouldn't be better some "simplified" removal operation which doesn't require the _cygtls pointer. If the thread doesn't exist anymore, so does its _cygtls area. Thanks, Corinna --=20 Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Maintainer cygwin AT cygwin DOT com Red Hat --XG0jWBK27HhJN4nS Content-Type: application/pgp-signature Content-length: 819 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJUbhUyAAoJEPU2Bp2uRE+ghtMQAIngYpp5BSj6f1pfL7u9lRRx +tpMVJcRpIm1ZU1MgDReXaciKtajubBtD8/SrUPbw6LD0lqleuvc8nMgqeQWCe7Y nq1mppfJdmJrHznUrsmo7cdaHYe+8XPp3BedAQzMCB1jCZ1O9J+T70nS9/+yxKD7 Q2bq0yASOG0y0d/AJ72Ervr0JHM3RuWzwWtvXZ+lC1hm3Zv5CzBLYdFBC2GdyBVn 9mvG8kfiwVlpSsZiPt5bASLsmOBtrrlZHdV6tMO3qg+v2tqfhSS0KFD3pVnOTrLP GQqXH79DhBd1DStdYyvxFA2US47HQjoeFklFatOFP0jneFO8d/78XHyldWi+NOHw NWeBQNJFAttqvg0RKy2yQlu3eo52B1s+F6iM3bwv3cqTfL24/TWZp/X969Fyc8uH JqOJrQEm1Jbo3jZTme7EuTtSg7VSkFjoS+qbTY9yfqG7VhgdOGwMBK8aLAuHtqbq 76lEjR1xne9Q8ZmyxvX28Hsag2X/e6DhEbUfFZcEvyh26+foc4lE/FOav/2G9lP8 QDQg9NHlsCDeMqmTu1U/pMKAuccsc7c/EZVWFy2v0nT0a7rsfQmXu71jznFw74r1 w0H2zjHceGIFrJ8AaQfTn8zBoFRp3XZLF/IdQcJGCNpq5P9Kb9+vtA3KHp+yvCP3 QPT/OshfB5K4IHZXuzS7 =L13T -----END PGP SIGNATURE----- --XG0jWBK27HhJN4nS--