public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Instability with signals and threads
@ 2014-11-19 16:43 Mikulas Patocka
  2014-11-19 17:30 ` Mikulas Patocka
  2014-11-20 10:00 ` Corinna Vinschen
  0 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-19 16:43 UTC (permalink / raw)
  To: cygwin

Hi

I have a program that sets a repetitive timer with setitimer and spawns 
several threads.

The program is very unstable on cygwin, it locks up in few minutes.

The bug manifests itself in the following way: the signal thread calls 
cygheap->find_tls to find a thread to deliver the signal to. find_tls 
generates an exception when scanning the threadlist, jumps to the __except 
block and calls threadlist[idx]->remove(INFINITE).

The method threadlist[idx]->remove is called with invalid "this" pointer 
(sometimes it is zero, sometimes it points to unmapped memory), generates 
another exception on "initialized = 0" line and becomes stuck on this 
assignment.

I found out that when I modify the remove_tls method so that it always 
acquires the lock and removes the thread from the threadlist (change 
"tls_sentry here(wait)" to "tls_sentry here(INFINITE)"), the bug goes away 
and the multithreaded program is stable.

Alternativelly - the crash can be fixed if we change "_my_tls.remove (0)" 
to "_my_tls.remove (INFINITE)" in thread_wrapper (though, there is another 
_my_tls.remove (0) call in dll_entry in winsup/cygwin/init.cc and it could 
trigger the same crash)


I'd like to ask - what's the reason for not waiting for the lock in 
remove_tls? If the lock is already locked, remove_tls does nothing - but 
the _cygtls structure is freed anyway, so that there is dangling pointer 
no the thread list. Do you think that we can drop this "wait" argument and 
always wait for the lock in remove_tls?



Another possible bug - when find_tls exits, it drops the tls_sentry lock 
and returns the pointer to _cygtls. What happens if the thread owning the 
tls exits at this point? It seems that there is nothing that prevents it 
from exiting and that the caller of find_tls (sigpacket::process) will 
work with a pointer to invalid thread. It seems that we need to add some 
reference count to _cygtls to prevent it from disappearing while we are 
trying to send a signal to it. (or keep tls_sentry::lock locked until 
sigpacket::process is done with the signal, though I don't know if keeping 
the lock for so long won't cause deadlocks).

Mikulas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread
* Re: Instability with signals and threads
@ 2014-11-20 20:24 Mikulas Patocka
  2014-11-21  4:24 ` Mikulas Patocka
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-20 20:24 UTC (permalink / raw)
  To: Corinna Vinschen; +Cc: cygwin

> Never mind that.  I can fix your testcase by calling _my_tls.remove with 
> INFINITE as parameter in both places.  If I drop one of them, your 
> testcase will invariable fail at one point.  With both INFINITE params 
> in place, your testcase is now running half an hour without problems.

For me, this change doesn't fix the testcase, it just reduces the 
probability that it hangs.

With this change, the testcase still locks up, but with a different 
stacktrace:
thread1:
        Sleep
        _yield
        pthread::create
        sigdelayed ??
        _cygwin_exit_return ??
        _cygtls::call2

thread2:
        SetEvent
        muto::release
        init_cygheap::find_tls
        _cygtls::init_thread

thread3:
        WriteFile
        sig_send
        timer_thread
        cygthread::callfunc
        cygthread::stub
        _cygtls::call2

thread4:
        VirtualFree 
        thread_wrapper

thread5:
        only ntdll stuff

So, apparently, there is another bug, where thread->cygtls isn't being set 
and pthread::create loops endlessly calling yield.

> Thinking about it, the fact that _cygtls::remove allows to apply a 
> non-INFINITE wait is rather strange, isn't it?  Calling remove_tls with 
> a 0 wait, it allows to return the function silently, without actually 
> having removed the thread from the list.  This is bound to go downhill 
> at one point and looks like a kludge to me to circumvent some potential 
> hang in another situation...

Looking at CVS history, the "wait" argument was added to cygtls.cc version 
1.2 with a comment: "Add a 'wait' argument to control how long we wait for 
a lock before removing." There is no explanation why is it needed.

> I'm not exactly sure if that works as intended.  I will apply this patch 
> and create a new Cygwin snapshot on https://cygwin.com/snapshots/ in a 
> couple of minutes.  I'd appreciate if you and others would give it an 
> exhaustive test.  New spurious hangs or SEGVs in other situations which 
> so far worked fine would be good indicators for another problem in the 
> code.

Yes, I think it's correct to remove the wait argument.

> Other than that, there's certainly some room for improvement.  Calling 
> threadlist[idx]->remove from the find_tls exception handler looks 
> extremly hairy to me.  I wonder if that should be called at all at this 
> point, or if there shouldn't be better some "simplified" removal 
> operation which doesn't require the _cygtls pointer.  If the thread 
> doesn't exist anymore, so does its _cygtls area.

I suggest to remove that exception handler at all. This thing can't ever 
work reliably - it could reduce probability of crashes but not eliminate 
them. Even if we handled the page fault correctly - what happens if some 
other thread allocates a different object at the location that belonged to 
the tls before? - then find_tls thinks that this different object is tls 
and corrupts it.

I suggest to remove the exception handler and if it results in any 
regressions, fix them properly.

Mikulas

> Thanks, Corinna -- 
> Corinna Vinschen

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-12-04  9:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-19 16:43 Instability with signals and threads Mikulas Patocka
2014-11-19 17:30 ` Mikulas Patocka
2014-11-20 10:00 ` Corinna Vinschen
2014-11-20 16:22   ` Corinna Vinschen
2014-11-20 20:24 Mikulas Patocka
2014-11-21  4:24 ` Mikulas Patocka
2014-11-21 10:28 ` Corinna Vinschen
2014-11-21 10:38   ` Corinna Vinschen
2014-11-21 14:43 ` Mikulas Patocka
2014-11-21 14:46   ` Mikulas Patocka
2014-11-21 14:47   ` Corinna Vinschen
2014-11-21 16:30     ` Marco Atzeri
2014-11-21 16:37       ` Corinna Vinschen
2014-11-21 20:36         ` Yaakov Selkowitz
2014-11-21 20:44           ` Corinna Vinschen
2014-11-28 21:13             ` Yaakov Selkowitz
2014-11-28 22:35     ` Corinna Vinschen
2014-12-04  9:42       ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).