Re: Instability with signals and threads

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

* Re: Instability with signals and threads
@ 2014-11-20 20:24 Mikulas Patocka
  2014-11-21  4:24 ` Mikulas Patocka
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-20 20:24 UTC (permalink / raw)
  To: Corinna Vinschen; +Cc: cygwin

> Never mind that.  I can fix your testcase by calling _my_tls.remove with 
> INFINITE as parameter in both places.  If I drop one of them, your 
> testcase will invariable fail at one point.  With both INFINITE params 
> in place, your testcase is now running half an hour without problems.

For me, this change doesn't fix the testcase, it just reduces the 
probability that it hangs.

With this change, the testcase still locks up, but with a different 
stacktrace:
thread1:
        Sleep
        _yield
        pthread::create
        sigdelayed ??
        _cygwin_exit_return ??
        _cygtls::call2

thread2:
        SetEvent
        muto::release
        init_cygheap::find_tls
        _cygtls::init_thread

thread3:
        WriteFile
        sig_send
        timer_thread
        cygthread::callfunc
        cygthread::stub
        _cygtls::call2

thread4:
        VirtualFree 
        thread_wrapper

thread5:
        only ntdll stuff

So, apparently, there is another bug, where thread->cygtls isn't being set 
and pthread::create loops endlessly calling yield.

> Thinking about it, the fact that _cygtls::remove allows to apply a 
> non-INFINITE wait is rather strange, isn't it?  Calling remove_tls with 
> a 0 wait, it allows to return the function silently, without actually 
> having removed the thread from the list.  This is bound to go downhill 
> at one point and looks like a kludge to me to circumvent some potential 
> hang in another situation...

Looking at CVS history, the "wait" argument was added to cygtls.cc version 
1.2 with a comment: "Add a 'wait' argument to control how long we wait for 
a lock before removing." There is no explanation why is it needed.

> I'm not exactly sure if that works as intended.  I will apply this patch 
> and create a new Cygwin snapshot on https://cygwin.com/snapshots/ in a 
> couple of minutes.  I'd appreciate if you and others would give it an 
> exhaustive test.  New spurious hangs or SEGVs in other situations which 
> so far worked fine would be good indicators for another problem in the 
> code.

Yes, I think it's correct to remove the wait argument.

> Other than that, there's certainly some room for improvement.  Calling 
> threadlist[idx]->remove from the find_tls exception handler looks 
> extremly hairy to me.  I wonder if that should be called at all at this 
> point, or if there shouldn't be better some "simplified" removal 
> operation which doesn't require the _cygtls pointer.  If the thread 
> doesn't exist anymore, so does its _cygtls area.

I suggest to remove that exception handler at all. This thing can't ever 
work reliably - it could reduce probability of crashes but not eliminate 
them. Even if we handled the page fault correctly - what happens if some 
other thread allocates a different object at the location that belonged to 
the tls before? - then find_tls thinks that this different object is tls 
and corrupts it.

I suggest to remove the exception handler and if it results in any 
regressions, fix them properly.

Mikulas

> Thanks, Corinna -- 
> Corinna Vinschen

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-20 20:24 Instability with signals and threads Mikulas Patocka
@ 2014-11-21  4:24 ` Mikulas Patocka
  2014-11-21 10:28 ` Corinna Vinschen
  2014-11-21 14:43 ` Mikulas Patocka
  2 siblings, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-21  4:24 UTC (permalink / raw)
  To: Corinna Vinschen; +Cc: cygwin

> Never mind that.  I can fix your testcase by calling _my_tls.remove with 
> INFINITE as parameter in both places.  If I drop one of them, your 
> testcase will invariable fail at one point.  With both INFINITE params 
> in place, your testcase is now running half an hour without problems.

For me, this change doesn't fix the testcase, it just reduces the 
probability that it hangs.

With this change, the testcase still locks up, but with a different 
stacktrace:
thread1:
        Sleep
        _yield
        pthread::create
        sigdelayed ??
        _cygwin_exit_return ??
        _cygtls::call2

thread2:
        SetEvent
        muto::release
        init_cygheap::find_tls
        _cygtls::init_thread

thread3:
        WriteFile
        sig_send
        timer_thread
        cygthread::callfunc
        cygthread::stub
        _cygtls::call2

thread4:
        VirtualFree 
        thread_wrapper

thread5:
        only ntdll stuff

So, apparently, there is another bug, where thread->cygtls isn't being set 
and pthread::create loops endlessly calling yield.

> Thinking about it, the fact that _cygtls::remove allows to apply a 
> non-INFINITE wait is rather strange, isn't it?  Calling remove_tls with 
> a 0 wait, it allows to return the function silently, without actually 
> having removed the thread from the list.  This is bound to go downhill 
> at one point and looks like a kludge to me to circumvent some potential 
> hang in another situation...

Looking at CVS history, the "wait" argument was added to cygtls.cc version 
1.2 with a comment: "Add a 'wait' argument to control how long we wait for 
a lock before removing." There is no explanation why is it needed.

> I'm not exactly sure if that works as intended.  I will apply this patch 
> and create a new Cygwin snapshot on https://cygwin.com/snapshots/ in a 
> couple of minutes.  I'd appreciate if you and others would give it an 
> exhaustive test.  New spurious hangs or SEGVs in other situations which 
> so far worked fine would be good indicators for another problem in the 
> code.

Yes, I think it's correct to remove the wait argument.

> Other than that, there's certainly some room for improvement.  Calling 
> threadlist[idx]->remove from the find_tls exception handler looks 
> extremly hairy to me.  I wonder if that should be called at all at this 
> point, or if there shouldn't be better some "simplified" removal 
> operation which doesn't require the _cygtls pointer.  If the thread 
> doesn't exist anymore, so does its _cygtls area.

I suggest to remove that exception handler at all. This thing can't ever 
work reliably - it could reduce probability of crashes but not eliminate 
them. Even if we handled the page fault correctly - what happens if some 
other thread allocates a different object at the location that belonged to 
the tls before? - then find_tls thinks that this different object is tls 
and corrupts it.

I suggest to remove the exception handler and if it results in any 
regressions, fix them properly.

Mikulas

> Thanks, Corinna -- 
> Corinna Vinschen

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-20 20:24 Instability with signals and threads Mikulas Patocka
  2014-11-21  4:24 ` Mikulas Patocka
@ 2014-11-21 10:28 ` Corinna Vinschen
  2014-11-21 10:38   ` Corinna Vinschen
  2014-11-21 14:43 ` Mikulas Patocka
  2 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-21 10:28 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 4478 bytes --]

On Nov 20 21:22, Mikulas Patocka wrote:
> > Never mind that.  I can fix your testcase by calling _my_tls.remove with 
> > INFINITE as parameter in both places.  If I drop one of them, your 
> > testcase will invariable fail at one point.  With both INFINITE params 
> > in place, your testcase is now running half an hour without problems.
> 
> For me, this change doesn't fix the testcase, it just reduces the 
> probability that it hangs.
> 
> With this change, the testcase still locks up, but with a different 
> stacktrace:
> thread1:
>         Sleep
>         _yield
>         pthread::create
>         sigdelayed ??
>         _cygwin_exit_return ??
>         _cygtls::call2
> 
> thread2:
>         SetEvent
>         muto::release
>         init_cygheap::find_tls
>         _cygtls::init_thread
> 
> thread3:
>         WriteFile
>         sig_send
>         timer_thread
>         cygthread::callfunc
>         cygthread::stub
>         _cygtls::call2
> 
> thread4:
>         VirtualFree 
>         thread_wrapper
> 
> thread5:
>         only ntdll stuff

Do you use a DLL built with optimization by any chance?  I wouldn't
take the backtraces too serious in that case.  For debugging it helps
a lot to use a Cygwin DLL built without -O2.  Btw., are you testing
on 32 or 64 bit?  I'm testing on 64 bit.

I can't reproduce your backtrace, but I can reproduce another one, which
is related to thread_exit.  At one point after a couple thousand runs
through your testcase I have a variable number of threads hanging in
thread_exit, and a timer thread which is unable to send its signal.  the
other threads all hang in thread_exit, waiting for a muto which is taken
by a thread which doesn't exist anymore.  That's a very serious downside
of the muto implementation not being able to recognize being abandoned.
I wonder if that shouldn't be using a real OS mutex.

As a sidenote, the snapshot doesn't work well in other scenarios, too,
apparently.  Yaakov reported hangs in KDE :(

> > Thinking about it, the fact that _cygtls::remove allows to apply a 
> > non-INFINITE wait is rather strange, isn't it?  Calling remove_tls with 
> > a 0 wait, it allows to return the function silently, without actually 
> > having removed the thread from the list.  This is bound to go downhill 
> > at one point and looks like a kludge to me to circumvent some potential 
> > hang in another situation...
> 
> Looking at CVS history, the "wait" argument was added to cygtls.cc version 
> 1.2 with a comment: "Add a 'wait' argument to control how long we wait for 
  ^^^
  Wow.  So that's really old, more than 10 years.

> > Other than that, there's certainly some room for improvement.  Calling 
> > threadlist[idx]->remove from the find_tls exception handler looks 
> > extremly hairy to me.  I wonder if that should be called at all at this 
> > point, or if there shouldn't be better some "simplified" removal 
> > operation which doesn't require the _cygtls pointer.  If the thread 
> > doesn't exist anymore, so does its _cygtls area.
> 
> I suggest to remove that exception handler at all. This thing can't ever 
> work reliably - it could reduce probability of crashes but not eliminate 
> them. Even if we handled the page fault correctly - what happens if some 
> other thread allocates a different object at the location that belonged to 
> the tls before? - then find_tls thinks that this different object is tls 
> and corrupts it.

My point exactly.  AFAICS, the problem is that the cygtls area of a
thread is on the thread's own stack.  While this looks neat in the first
place, and works fine in most scenarios, the problem is that it gets
destroyed by the OS as soon as the thread exits.  So there's a chance
that another thread using the cygtls area of this thread (the signal
thread for instance) may end up with pointers into nirvana or, as you
point out, space taken for completely different tasks.

In the short term it's impossible to fix this thoroughly I guess,
because this requires a very careful overhaul of the cygtls handling.
What we need is a cygtls area which is created at thread start, but
which can be locked in memory as long as it's required by any thread.
Some synchronization is required.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 10:28 ` Corinna Vinschen
@ 2014-11-21 10:38   ` Corinna Vinschen
  0 siblings, 0 replies; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-21 10:38 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 816 bytes --]

On Nov 21 11:15, Corinna Vinschen wrote:
> I can't reproduce your backtrace, but I can reproduce another one, which
> is related to thread_exit.  At one point after a couple thousand runs
> through your testcase I have a variable number of threads hanging in
> thread_exit, and a timer thread which is unable to send its signal.  the
> other threads all hang in thread_exit, waiting for a muto which is taken
> by a thread which doesn't exist anymore.  That's a very serious downside
> of the muto implementation not being able to recognize being abandoned.
> I wonder if that shouldn't be using a real OS mutex.

s/thread_exit/exit_thread/g


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-20 20:24 Instability with signals and threads Mikulas Patocka
  2014-11-21  4:24 ` Mikulas Patocka
  2014-11-21 10:28 ` Corinna Vinschen
@ 2014-11-21 14:43 ` Mikulas Patocka
  2014-11-21 14:46   ` Mikulas Patocka
  2014-11-21 14:47   ` Corinna Vinschen
  2 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-21 14:43 UTC (permalink / raw)
  To: Corinna Vinschen; +Cc: cygwin

> Do you use a DLL built with optimization by any chance?  I wouldn't take 
> the backtraces too serious in that case.  For debugging it helps a lot 
> to use a Cygwin DLL built without -O2.

I use optimization. The stacktrace may contain some other functions that 
already finished but left the address on the stack.

> Btw., are you testing on 32 or 64 bit?

On 32-bit. The rebuild of cygwin1.dll requires large number of packages to 
create the documentation (including tex and java) and I haven't bloated 
the 64-bit cygwin installation with them yet. I wish it were possible to 
build the library without documentation and without such big dependecies.

> I'm testing on 64 bit. I can't reproduce your backtrace, but I can 
> reproduce another one, which is related to thread_exit.  At one point 
> after a couple thousand runs through your testcase I have a variable 
> number of threads hanging in thread_exit, and a timer thread which is 
> unable to send its signal.  the other threads all hang in thread_exit, 
> waiting for a muto which is taken by a thread which doesn't exist 
> anymore.

So you can - just for debugging - add a counter to thread local storage 
that is incremented when muto is taken and decremented when muto is 
released. If the thread exists, test the counter, if it is non-zero, print 
the backtrace or attach the debugger.

> That's a very serious downside of the muto implementation not 
> being able to recognize being abandoned. I wonder if that shouldn't be 
> using a real OS mutex.

That would hide the problem that a thread is exiting with locked muto, but 
not fix it.

> As a sidenote, the snapshot doesn't work well in 
> other scenarios, too, apparently.  Yaakov reported hangs in KDE :(

Mikulas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 14:43 ` Mikulas Patocka
@ 2014-11-21 14:46   ` Mikulas Patocka
  2014-11-21 14:47   ` Corinna Vinschen
  1 sibling, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-21 14:46 UTC (permalink / raw)
  To: Corinna Vinschen; +Cc: cygwin

> Do you use a DLL built with optimization by any chance?  I wouldn't take 
> the backtraces too serious in that case.  For debugging it helps a lot 
> to use a Cygwin DLL built without -O2.

I use optimization. The stacktrace may contain some other functions that 
already finished but left the address on the stack.

> Btw., are you testing on 32 or 64 bit?

On 32-bit. The rebuild of cygwin1.dll requires large number of packages to 
create the documentation (including tex and java) and I haven't bloated 
the 64-bit cygwin installation with them yet. I wish it were possible to 
build the library without documentation and without such big dependecies.

> I'm testing on 64 bit. I can't reproduce your backtrace, but I can 
> reproduce another one, which is related to thread_exit.  At one point 
> after a couple thousand runs through your testcase I have a variable 
> number of threads hanging in thread_exit, and a timer thread which is 
> unable to send its signal.  the other threads all hang in thread_exit, 
> waiting for a muto which is taken by a thread which doesn't exist 
> anymore.

So you can - just for debugging - add a counter to thread local storage 
that is incremented when muto is taken and decremented when muto is 
released. If the thread exists, test the counter, if it is non-zero, print 
the backtrace or attach the debugger.

> That's a very serious downside of the muto implementation not 
> being able to recognize being abandoned. I wonder if that shouldn't be 
> using a real OS mutex.

That would hide the problem that a thread is exiting with locked muto, but 
not fix it.

> As a sidenote, the snapshot doesn't work well in 
> other scenarios, too, apparently.  Yaakov reported hangs in KDE :(

Mikulas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 14:43 ` Mikulas Patocka
  2014-11-21 14:46   ` Mikulas Patocka
@ 2014-11-21 14:47   ` Corinna Vinschen
  2014-11-21 16:30     ` Marco Atzeri
  2014-11-28 22:35     ` Corinna Vinschen
  1 sibling, 2 replies; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-21 14:47 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 3361 bytes --]

[Please don't CC me, just send mail to the list.  Thank you]

On Nov 21 15:11, Mikulas Patocka wrote:
> > Do you use a DLL built with optimization by any chance?  I wouldn't take 
> > the backtraces too serious in that case.  For debugging it helps a lot 
> > to use a Cygwin DLL built without -O2.
> 
> I use optimization. The stacktrace may contain some other functions that 
> already finished but left the address on the stack.

There may also be functions completely missing on the stack.  I still
suggest to build with -g only.

> > Btw., are you testing on 32 or 64 bit?
> 
> On 32-bit. The rebuild of cygwin1.dll requires large number of packages to 
> create the documentation (including tex and java) and I haven't bloated 

Java?!?

> the 64-bit cygwin installation with them yet. I wish it were possible to 
> build the library without documentation and without such big dependecies.

You don't have to build the docs to build the DLL.  The make process
continues even if building the docs fails.

> > I'm testing on 64 bit. I can't reproduce your backtrace, but I can 
> > reproduce another one, which is related to thread_exit.  At one point 
> > after a couple thousand runs through your testcase I have a variable 
> > number of threads hanging in thread_exit, and a timer thread which is 
> > unable to send its signal.  the other threads all hang in thread_exit, 
> > waiting for a muto which is taken by a thread which doesn't exist 
> > anymore.
> 
> So you can - just for debugging - add a counter to thread local storage 
> that is incremented when muto is taken and decremented when muto is 
> released. If the thread exists, test the counter, if it is non-zero, print 
> the backtrace or attach the debugger.

For instance.

> > That's a very serious downside of the muto implementation not 
> > being able to recognize being abandoned. I wonder if that shouldn't be 
> > using a real OS mutex.
> 
> That would hide the problem that a thread is exiting with locked muto, but 
> not fix it.

See exit_thread and cgf-000017 in DevNotes.  This setup deliberately
locks the muto and then calls ExitThread.  The signal handler is
supposed to unlock the muto when the __SIGTHREADEXIT signal comes in,
but then it happens that it doesn't for some reason.  It seems the
problem here is that the SIGALRM is filling up the signal pipe so
the __SIGTHREADEXIT signal is not actually delivered.  I have a local
workaround, but it seems to open a can of worms.

I'm going to take a step back for now, and reevaluate what happens
before trying to apply even more hacks.  Ultimately the problem is that
the cygtls area is accessed from other threads (mainly the signal
thread) without locking, and worse, that the lock for the cygtls area is
a member of _cygtls itself.  The latter needs certainly a patch, and I'm
contemplating to extend cygheap::threadlist to become a per-thread
structure containing the _cygtls pointer, the thread ID, the main thread
HANDLE, and the tls muto.  This should allow to serialize access to the
cygtls area in a way which avoids the aforementioned problems without
a complete redesign.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 14:47   ` Corinna Vinschen
@ 2014-11-21 16:30     ` Marco Atzeri
  2014-11-21 16:37       ` Corinna Vinschen
  2014-11-28 22:35     ` Corinna Vinschen
  1 sibling, 1 reply; 18+ messages in thread
From: Marco Atzeri @ 2014-11-21 16:30 UTC (permalink / raw)
  To: cygwin

On 11/21/2014 3:43 PM, Corinna Vinschen wrote:
>
>

>> On 32-bit. The rebuild of cygwin1.dll requires large number of packages to
>> create the documentation (including tex and java) and I haven't bloated
>
> Java?!?

make[3]: [cygwin-ug-net/cygwin-ug-net.pdf] Error 127 (ignored)
xsltproc --xinclude ../../../../src_new/winsup/doc/fo.xsl 
../../../../src_new/winsup/doc/cygwin-api.xml | fop -q -fo - 
cygwin-api/cygwin-api.pdf
/bin/sh: fop: command not found
Making portrait pages on USletter paper (8.5inx11in)
Makefile:64: recipe for target 'cygwin-api/cygwin-api.pdf' failed

FOP is a Java application....grrr...
It will be nice to not use it for building cygwin docs.




Regards
Marco

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 16:30     ` Marco Atzeri
@ 2014-11-21 16:37       ` Corinna Vinschen
  2014-11-21 20:36         ` Yaakov Selkowitz
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-21 16:37 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1328 bytes --]

On Nov 21 16:50, Marco Atzeri wrote:
> On 11/21/2014 3:43 PM, Corinna Vinschen wrote:
> >
> >
> 
> >>On 32-bit. The rebuild of cygwin1.dll requires large number of packages to
> >>create the documentation (including tex and java) and I haven't bloated
> >
> >Java?!?
> 
> make[3]: [cygwin-ug-net/cygwin-ug-net.pdf] Error 127 (ignored)
> xsltproc --xinclude ../../../../src_new/winsup/doc/fo.xsl
> ../../../../src_new/winsup/doc/cygwin-api.xml | fop -q -fo -
> cygwin-api/cygwin-api.pdf
> /bin/sh: fop: command not found
> Making portrait pages on USletter paper (8.5inx11in)
> Makefile:64: recipe for target 'cygwin-api/cygwin-api.pdf' failed
> 
> FOP is a Java application....grrr...

Oh, right.

> It will be nice to not use it for building cygwin docs.

The html docs are not built with fop, only the pdf.  And even then,
make from the parent dir continues to build everything else, even
if make in the doc dir fails:

  $(SUBDIRS):
	@${MAKE} -C $@ all || ([ "$@" == doc ] && echo "*** error ignored")

If you have another way to create pdf files from xml input, which works
on Cygwin and Linux, please feel free to provide one.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 16:37       ` Corinna Vinschen
@ 2014-11-21 20:36         ` Yaakov Selkowitz
  2014-11-21 20:44           ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Yaakov Selkowitz @ 2014-11-21 20:36 UTC (permalink / raw)
  To: cygwin

On 2014-11-21 10:06, Corinna Vinschen wrote:
> On Nov 21 16:50, Marco Atzeri wrote:
>> On 11/21/2014 3:43 PM, Corinna Vinschen wrote:
>>>> On 32-bit. The rebuild of cygwin1.dll requires large number of packages to
>>>> create the documentation (including tex and java) and I haven't bloated
>>>
>>> Java?!?
>>
>> FOP is a Java application....grrr...
>
> Oh, right.
>
>> It will be nice to not use it for building cygwin docs.
>
> The html docs are not built with fop, only the pdf.  And even then,
> make from the parent dir continues to build everything else, even
> if make in the doc dir fails:
>
>    $(SUBDIRS):
> 	@${MAKE} -C $@ all || ([ "$@" == doc ] && echo "*** error ignored")
>
> If you have another way to create pdf files from xml input, which works
> on Cygwin and Linux, please feel free to provide one.

https://cygwin.com/ml/cygwin-developers/2013-07/msg00008.html

Which resulted in:

https://sourceware.org/viewvc/src/winsup/doc/Makefile.in?r1=1.33&r2=1.34

Has this bug been fixed yet in TeX Live?  I had no problems building the 
PDFs after reverting this commit.  I'll have to try on Fedora 20 and 21, 
unless someone else can get to it first.


Yaakov


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 20:36         ` Yaakov Selkowitz
@ 2014-11-21 20:44           ` Corinna Vinschen
  2014-11-28 21:13             ` Yaakov Selkowitz
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-21 20:44 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1759 bytes --]

On Nov 21 14:06, Yaakov Selkowitz wrote:
> On 2014-11-21 10:06, Corinna Vinschen wrote:
> >On Nov 21 16:50, Marco Atzeri wrote:
> >>On 11/21/2014 3:43 PM, Corinna Vinschen wrote:
> >>>>On 32-bit. The rebuild of cygwin1.dll requires large number of packages to
> >>>>create the documentation (including tex and java) and I haven't bloated
> >>>
> >>>Java?!?
> >>
> >>FOP is a Java application....grrr...
> >
> >Oh, right.
> >
> >>It will be nice to not use it for building cygwin docs.
> >
> >The html docs are not built with fop, only the pdf.  And even then,
> >make from the parent dir continues to build everything else, even
> >if make in the doc dir fails:
> >
> >   $(SUBDIRS):
> >	@${MAKE} -C $@ all || ([ "$@" == doc ] && echo "*** error ignored")
> >
> >If you have another way to create pdf files from xml input, which works
> >on Cygwin and Linux, please feel free to provide one.
> 
> https://cygwin.com/ml/cygwin-developers/2013-07/msg00008.html
> 
> Which resulted in:
> 
> https://sourceware.org/viewvc/src/winsup/doc/Makefile.in?r1=1.33&r2=1.34
> 
> Has this bug been fixed yet in TeX Live?  I had no problems building the
> PDFs after reverting this commit.  I'll have to try on Fedora 20 and 21,
> unless someone else can get to it first.

Again, if you have another way to create pdf files from xml input, which
works on Cygwin and Linux, please feel free to provide one.  I had no
problems creating pdfs on Linux using fop, so I had no incentive to
switch.  Just provide a patch to cygwin-patches.  If it works it's as
good as applied.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 20:44           ` Corinna Vinschen
@ 2014-11-28 21:13             ` Yaakov Selkowitz
  0 siblings, 0 replies; 18+ messages in thread
From: Yaakov Selkowitz @ 2014-11-28 21:13 UTC (permalink / raw)
  To: cygwin

On 2014-11-21 14:36, Corinna Vinschen wrote:
> On Nov 21 14:06, Yaakov Selkowitz wrote:
>> On 2014-11-21 10:06, Corinna Vinschen wrote:
>>> On Nov 21 16:50, Marco Atzeri wrote:
>>>> On 11/21/2014 3:43 PM, Corinna Vinschen wrote:
>>>>>> On 32-bit. The rebuild of cygwin1.dll requires large number of packages to
>>>>>> create the documentation (including tex and java) and I haven't bloated
>>>>>
>>>>> Java?!?
>>>>
>>>> FOP is a Java application....grrr...
>>>
>>> Oh, right.
>>>
>>>> It will be nice to not use it for building cygwin docs.
>>>
>>> The html docs are not built with fop, only the pdf.  And even then,
>>> make from the parent dir continues to build everything else, even
>>> if make in the doc dir fails:
>>>
>>>    $(SUBDIRS):
>>> 	@${MAKE} -C $@ all || ([ "$@" == doc ] && echo "*** error ignored")
>>>
>>> If you have another way to create pdf files from xml input, which works
>>> on Cygwin and Linux, please feel free to provide one.
>>
>> https://cygwin.com/ml/cygwin-developers/2013-07/msg00008.html
>>
>> Which resulted in:
>>
>> https://sourceware.org/viewvc/src/winsup/doc/Makefile.in?r1=1.33&r2=1.34
>>
>> Has this bug been fixed yet in TeX Live?  I had no problems building the
>> PDFs after reverting this commit.  I'll have to try on Fedora 20 and 21,
>> unless someone else can get to it first.
>
> Again, if you have another way to create pdf files from xml input, which
> works on Cygwin and Linux, please feel free to provide one.  I had no
> problems creating pdfs on Linux using fop, so I had no incentive to
> switch.  Just provide a patch to cygwin-patches.  If it works it's as
> good as applied.

For the archives:

https://cygwin.com/ml/cygwin-patches/2014-q4/msg00016.html

--
Yaakov


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-21 14:47   ` Corinna Vinschen
  2014-11-21 16:30     ` Marco Atzeri
@ 2014-11-28 22:35     ` Corinna Vinschen
  2014-12-04  9:42       ` Corinna Vinschen
  1 sibling, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-28 22:35 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

On Nov 21 15:43, Corinna Vinschen wrote:
> I'm going to take a step back for now, and reevaluate what happens
> before trying to apply even more hacks.  Ultimately the problem is that
> the cygtls area is accessed from other threads (mainly the signal
> thread) without locking, and worse, that the lock for the cygtls area is
> a member of _cygtls itself.  The latter needs certainly a patch, and I'm
> contemplating to extend cygheap::threadlist to become a per-thread
> structure containing the _cygtls pointer, the thread ID, the main thread
> HANDLE, and the tls muto.  This should allow to serialize access to the
> cygtls area in a way which avoids the aforementioned problems without
> a complete redesign.

I applied a patch which is supposed to fix this problem.  Your STC
works for me and everything else I tried, including X and Emacs in X11
GUI mode still work.

Please try the latest snapshot from https://cygwin.com/snapshots/

I'm not expecting that my first cut works OOTB, so I'd be glad for
more testing.


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-28 22:35     ` Corinna Vinschen
@ 2014-12-04  9:42       ` Corinna Vinschen
  0 siblings, 0 replies; 18+ messages in thread
From: Corinna Vinschen @ 2014-12-04  9:42 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1316 bytes --]

Mikulas, ping?

On Nov 28 22:12, Corinna Vinschen wrote:
> On Nov 21 15:43, Corinna Vinschen wrote:
> > I'm going to take a step back for now, and reevaluate what happens
> > before trying to apply even more hacks.  Ultimately the problem is that
> > the cygtls area is accessed from other threads (mainly the signal
> > thread) without locking, and worse, that the lock for the cygtls area is
> > a member of _cygtls itself.  The latter needs certainly a patch, and I'm
> > contemplating to extend cygheap::threadlist to become a per-thread
> > structure containing the _cygtls pointer, the thread ID, the main thread
> > HANDLE, and the tls muto.  This should allow to serialize access to the
> > cygtls area in a way which avoids the aforementioned problems without
> > a complete redesign.
> 
> I applied a patch which is supposed to fix this problem.  Your STC
> works for me and everything else I tried, including X and Emacs in X11
> GUI mode still work.
> 
> Please try the latest snapshot from https://cygwin.com/snapshots/
> 
> I'm not expecting that my first cut works OOTB, so I'd be glad for
> more testing.


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-20 10:00 ` Corinna Vinschen
@ 2014-11-20 16:22   ` Corinna Vinschen
  0 siblings, 0 replies; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-20 16:22 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 3476 bytes --]

On Nov 20 11:00, Corinna Vinschen wrote:
> Hi Mikulas,
> 
> On Nov 19 17:42, Mikulas Patocka wrote:
> > Hi
> > 
> > I have a program that sets a repetitive timer with setitimer and spawns 
> > several threads.
> > 
> > The program is very unstable on cygwin, it locks up in few minutes.
> > 
> > The bug manifests itself in the following way: the signal thread calls 
> > cygheap->find_tls to find a thread to deliver the signal to. find_tls 
> > generates an exception when scanning the threadlist, jumps to the __except 
> > block and calls threadlist[idx]->remove(INFINITE).
> > 
> > The method threadlist[idx]->remove is called with invalid "this" pointer 
> > (sometimes it is zero, sometimes it points to unmapped memory), generates 
> > another exception on "initialized = 0" line and becomes stuck on this 
> > assignment.
> 
> Now that you mention it, it makes sense.  The exception gets triggered
> by accessing an invalid member of threadlist.  Using the very same
> member in another method call looks.... borderline, to say the least.
> 
> > I found out that when I modify the remove_tls method so that it always 
> > acquires the lock and removes the thread from the threadlist (change 
> > "tls_sentry here(wait)" to "tls_sentry here(INFINITE)"), the bug goes away 
> > and the multithreaded program is stable.
> 
> [Noted your augmenting comment preceeding the testcase in your other mail]
> 
> > Alternativelly - the crash can be fixed if we change "_my_tls.remove (0)" 
> > to "_my_tls.remove (INFINITE)" in thread_wrapper (though, there is another 
> > _my_tls.remove (0) call in dll_entry in winsup/cygwin/init.cc and it could 
> > trigger the same crash)
> 
> I don't think so.  In dll_init, the call is done inside a DLL_THREAD_DETACH
> for this very thread, so &_my_tls is still a valid pointer.

Never mind that.  I can fix your testcase by calling _my_tls.remove with
INFINITE as parameter in both places.  If I drop one of them, your
testcase will invariable fail at one point.  With both INFINITE params
in place, your testcase is now running half an hour without problems.

Thinking about it, the fact that _cygtls::remove allows to apply
a non-INFINITE wait is rather strange, isn't it?  Calling remove_tls
with a 0 wait, it allows to return the function silently, without
actually having removed the thread from the list.  This is bound to
go downhill at one point and looks like a kludge to me to circumvent
some potential hang in another situation...

I'm not exactly sure if that works as intended.  I will apply this patch
and create a new Cygwin snapshot on https://cygwin.com/snapshots/ in a
couple of minutes.  I'd appreciate if you and others would give it an
exhaustive test.  New spurious hangs or SEGVs in other situations which
so far worked fine would be good indicators for another problem in the
code.

Other than that, there's certainly some room for improvement.  Calling
threadlist[idx]->remove from the find_tls exception handler looks
extremly hairy to me.  I wonder if that should be called at all at this
point, or if there shouldn't be better some "simplified" removal
operation which doesn't require the _cygtls pointer.  If the thread
doesn't exist anymore, so does its _cygtls area.

Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-19 16:43 Mikulas Patocka
  2014-11-19 17:30 ` Mikulas Patocka
@ 2014-11-20 10:00 ` Corinna Vinschen
  2014-11-20 16:22   ` Corinna Vinschen
  1 sibling, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2014-11-20 10:00 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 3583 bytes --]

Hi Mikulas,

On Nov 19 17:42, Mikulas Patocka wrote:
> Hi
> 
> I have a program that sets a repetitive timer with setitimer and spawns 
> several threads.
> 
> The program is very unstable on cygwin, it locks up in few minutes.
> 
> The bug manifests itself in the following way: the signal thread calls 
> cygheap->find_tls to find a thread to deliver the signal to. find_tls 
> generates an exception when scanning the threadlist, jumps to the __except 
> block and calls threadlist[idx]->remove(INFINITE).
> 
> The method threadlist[idx]->remove is called with invalid "this" pointer 
> (sometimes it is zero, sometimes it points to unmapped memory), generates 
> another exception on "initialized = 0" line and becomes stuck on this 
> assignment.

Now that you mention it, it makes sense.  The exception gets triggered
by accessing an invalid member of threadlist.  Using the very same
member in another method call looks.... borderline, to say the least.

> I found out that when I modify the remove_tls method so that it always 
> acquires the lock and removes the thread from the threadlist (change 
> "tls_sentry here(wait)" to "tls_sentry here(INFINITE)"), the bug goes away 
> and the multithreaded program is stable.

[Noted your augmenting comment preceeding the testcase in your other mail]

> Alternativelly - the crash can be fixed if we change "_my_tls.remove (0)" 
> to "_my_tls.remove (INFINITE)" in thread_wrapper (though, there is another 
> _my_tls.remove (0) call in dll_entry in winsup/cygwin/init.cc and it could 
> trigger the same crash)

I don't think so.  In dll_init, the call is done inside a DLL_THREAD_DETACH
for this very thread, so &_my_tls is still a valid pointer.

> I'd like to ask - what's the reason for not waiting for the lock in 
> remove_tls? If the lock is already locked, remove_tls does nothing - but 
> the _cygtls structure is freed anyway, so that there is dangling pointer 
> no the thread list. Do you think that we can drop this "wait" argument and 
> always wait for the lock in remove_tls?

Unfortunately I can't tell you about the reasoning behind this.  The guy
who created this code has left the project, so we just have to work with
the status quo.

> Another possible bug - when find_tls exits, it drops the tls_sentry lock 
> and returns the pointer to _cygtls. What happens if the thread owning the 
> tls exits at this point? It seems that there is nothing that prevents it 
> from exiting and that the caller of find_tls (sigpacket::process) will 
> work with a pointer to invalid thread. It seems that we need to add some 
> reference count to _cygtls to prevent it from disappearing while we are 
> trying to send a signal to it. (or keep tls_sentry::lock locked until 
> sigpacket::process is done with the signal, though I don't know if keeping 
> the lock for so long won't cause deadlocks).

I'm going to investigate this in the next couple of days.  However,
since you're obviously willing and able to debug this situation
yourself, I would very much appreciate your further input.  If you have
any fun to provide patches to Cygwin, please feel free to read
https://cygwin.com/contrib.html and follow up on the cygwin-developer
and cygwin-patches mailing lists.  The copyright assignment is still
required for non-trivial patches (~10 lines rule).  I hope that's ok.


Thanks a lot,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Instability with signals and threads
  2014-11-19 16:43 Mikulas Patocka
@ 2014-11-19 17:30 ` Mikulas Patocka
  2014-11-20 10:00 ` Corinna Vinschen
  1 sibling, 0 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-19 17:30 UTC (permalink / raw)
  To: cygwin



On Wed, 19 Nov 2014, Mikulas Patocka wrote:

> Hi
> 
> I have a program that sets a repetitive timer with setitimer and spawns 
> several threads.
> 
> The program is very unstable on cygwin, it locks up in few minutes.

This is a simplified example that triggers the lockup very quicky.

When I change remove_tls so that it always acquires the lock, it reduces 
the probability of the lockup, but doesn't eliminate it entirely - so 
there must be some other bug.

Mikulas



#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <sys/time.h>
#include <pthread.h>

static volatile sig_atomic_t events;

static void signal_handler(int sig)
{
	events++;
}

static void *thread_func(void *ptr)
{
	return NULL;
}

#define N_THREADS	12

static pthread_t threads[N_THREADS];

int main(void)
{
	int i, r;
	struct sigaction sa;
	struct itimerval timer;

	memset(&sa, 0, sizeof sa);
	sa.sa_handler = signal_handler;
	sigemptyset(&sa.sa_mask);
	sa.sa_flags |= SA_RESTART;
	r = sigaction(SIGALRM, &sa, NULL);
	if (r) perror("sigaction"), exit(1);

	timer.it_interval.tv_sec = 0;
	timer.it_interval.tv_usec = 1000;
	timer.it_value = timer.it_interval;
	r = setitimer(ITIMER_REAL, &timer, NULL);
	if (r) perror("setitimer"), exit(1);

	while (1) {
		for (i = 0; i < N_THREADS; i++) {
			r = pthread_create(&threads[i], NULL, thread_func, NULL);
			if (r) fprintf(stderr, "pthread_create: %s", strerror(r));
		}
		for (i = 0; i < N_THREADS; i++) {
			r = pthread_join(threads[i], NULL);
			if (r) fprintf(stderr, "pthread_create: %s", strerror(r));
		}
		printf("events: %d\n", events);
	}
}

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Instability with signals and threads
@ 2014-11-19 16:43 Mikulas Patocka
  2014-11-19 17:30 ` Mikulas Patocka
  2014-11-20 10:00 ` Corinna Vinschen
  0 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2014-11-19 16:43 UTC (permalink / raw)
  To: cygwin

Hi

I have a program that sets a repetitive timer with setitimer and spawns 
several threads.

The program is very unstable on cygwin, it locks up in few minutes.

The bug manifests itself in the following way: the signal thread calls 
cygheap->find_tls to find a thread to deliver the signal to. find_tls 
generates an exception when scanning the threadlist, jumps to the __except 
block and calls threadlist[idx]->remove(INFINITE).

The method threadlist[idx]->remove is called with invalid "this" pointer 
(sometimes it is zero, sometimes it points to unmapped memory), generates 
another exception on "initialized = 0" line and becomes stuck on this 
assignment.

I found out that when I modify the remove_tls method so that it always 
acquires the lock and removes the thread from the threadlist (change 
"tls_sentry here(wait)" to "tls_sentry here(INFINITE)"), the bug goes away 
and the multithreaded program is stable.

Alternativelly - the crash can be fixed if we change "_my_tls.remove (0)" 
to "_my_tls.remove (INFINITE)" in thread_wrapper (though, there is another 
_my_tls.remove (0) call in dll_entry in winsup/cygwin/init.cc and it could 
trigger the same crash)

I'd like to ask - what's the reason for not waiting for the lock in 
remove_tls? If the lock is already locked, remove_tls does nothing - but 
the _cygtls structure is freed anyway, so that there is dangling pointer 
no the thread list. Do you think that we can drop this "wait" argument and 
always wait for the lock in remove_tls?

Another possible bug - when find_tls exits, it drops the tls_sentry lock 
and returns the pointer to _cygtls. What happens if the thread owning the 
tls exits at this point? It seems that there is nothing that prevents it 
from exiting and that the caller of find_tls (sigpacket::process) will 
work with a pointer to invalid thread. It seems that we need to add some 
reference count to _cygtls to prevent it from disappearing while we are 
trying to send a signal to it. (or keep tls_sentry::lock locked until 
sigpacket::process is done with the signal, though I don't know if keeping 
the lock for so long won't cause deadlocks).

Mikulas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-12-04  9:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-20 20:24 Instability with signals and threads Mikulas Patocka
2014-11-21  4:24 ` Mikulas Patocka
2014-11-21 10:28 ` Corinna Vinschen
2014-11-21 10:38   ` Corinna Vinschen
2014-11-21 14:43 ` Mikulas Patocka
2014-11-21 14:46   ` Mikulas Patocka
2014-11-21 14:47   ` Corinna Vinschen
2014-11-21 16:30     ` Marco Atzeri
2014-11-21 16:37       ` Corinna Vinschen
2014-11-21 20:36         ` Yaakov Selkowitz
2014-11-21 20:44           ` Corinna Vinschen
2014-11-28 21:13             ` Yaakov Selkowitz
2014-11-28 22:35     ` Corinna Vinschen
2014-12-04  9:42       ` Corinna Vinschen
  -- strict thread matches above, loose matches on Subject: below --
2014-11-19 16:43 Mikulas Patocka
2014-11-19 17:30 ` Mikulas Patocka
2014-11-20 10:00 ` Corinna Vinschen
2014-11-20 16:22   ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).