RE: New pthread_once implementation

public inbox for pthreads-win32@sourceware.org
 help / color / mirror / Atom feed

* RE: New pthread_once implementation
       [not found] <97ffb3105052709425ce1126a@mail.gmail.com>
@ 2005-05-28  1:30 ` Vladimir Kliatchko
  2005-05-28  3:46   ` Ross Johnson
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Kliatchko @ 2005-05-28  1:30 UTC (permalink / raw)
  To: 'Gottlob Frege'; +Cc: 'Ross Johnson', pthreads-win32

Nice catch. Let me see if I can fix it.

Note that the same problem exists in the currently released event-based
implementation (cvs version 1.16):

thread1 comes in, start initing
thread2 creates event, starts waiting
thread3 comes in starts waiting
thread1 is cancelled, signals event
thread2 wakes up, proceeds to the point right before the resetEvent
thread3 wakes up, closes event handle 
thread2 resets closed handle

Re: you previous message:
>"If only one thread ever comes in, and is canceled in the init_routine, 
>then the semaphore is never cleaned up."

If only one thread ever comes in, and is canceled in the init_routine, then
the semaphore is never created to begin with, right?

Also, regarding my previous comment to Ross about very high cost of using
InterlockedExchangeAdd for MBR:
I did some simple benchmarking. Running pthread_once 50,000,000 on my pretty
slow single CPU machine takes about 2.1 seconds. Replacing
InterlockedExchangeAdd with simple read brings it down to 0.6 seconds. This
looks significant.

-----Original Message-----
From: Gottlob Frege [mailto:gottlobfrege@gmail.com] 
Sent: Friday, May 27, 2005 12:42 PM
To: Ross Johnson
Cc: Vladimir Kliatchko
Subject: Re: New pthread_once implementation

thread1 comes in, start initing
thread2 creates sema and waits
thread1 starts to cancel - resets control->state
thread3 comes in, goes into init
thread4 comes in, goes into else block
thread1 finishes cancel - releases semaphore
thread2 wakes up
thread2 decrements numSemaUsers to 0
thread4 increments numSemaUsers
thread4 does NOT set new semaphore
thread2 closes semaphore
thread4 tries to wait on closed semaphore...


On 5/27/05, Ross Johnson <ross.johnson@homemail.com.au> wrote:
> Guys,
> 
> Is there anything you want to change before I cast a new release?
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/pthreads/pthread_once.c?
> rev=1.18&content-type=text/x-cvsweb-markup&cvsroot=pthreads-win32
> 
> Thanks.
> Ross
> 
> On Fri, 2005-05-27 at 08:39 -0500, Tim Theisen wrote:
> > I picked up the latest and compiled with VC 7.1.  All tests passed.
> > Then, I ran 100 iterations of the once [1-4] tests.  These tests passed
> > as well.  So, it has my stamp of approval.
> >
> > ...Tim
> > --
> >        Tim Theisen                     Lead Research Software Engineer
> > Phone: +1 608 824 2848                 TomoTherapy Incorporated
> >   Fax: +1 608 824 2996                 1240 Deming Way
> >   Web: http://www.tomotherapy.com      Madison, WI 53717-1954
> >
> >
> >
> > -----Original Message-----
> > From: Ross Johnson [mailto:ross.johnson@homemail.com.au]
> > Sent: Friday, May 27, 2005 02:44
> > To: Tim Theisen
> > Cc: Vladimir Kliatchko; Gottlob Frege
> > Subject: New pthread_once implementation
> >
> >
> > Hi Tim,
> >
> > The current CVS head contains the latest and much simpler implementation
> > of pthread_once just presented on the mailing list. It passes on a UP
> > machine as usual but no-one has run it through an MP system yet. Could
> > you when you get time?
> >
> > Thanks.
> > Ross
> >
> >
> >
> 
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: New pthread_once implementation
  2005-05-28  1:30 ` New pthread_once implementation Vladimir Kliatchko
@ 2005-05-28  3:46   ` Ross Johnson
  2005-05-28 10:51     ` Vladimir Kliatchko
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Johnson @ 2005-05-28  3:46 UTC (permalink / raw)
  To: Vladimir Kliatchko; +Cc: 'Gottlob Frege', Pthreads-Win32 list

On Fri, 2005-05-27 at 21:30 -0400, Vladimir Kliatchko wrote:
> Nice catch. Let me see if I can fix it.
> 
> Note that the same problem exists in the currently released event-based
> implementation (cvs version 1.16):
> 
> thread1 comes in, start initing
> thread2 creates event, starts waiting
> thread3 comes in starts waiting
> thread1 is cancelled, signals event
> thread2 wakes up, proceeds to the point right before the resetEvent
> thread3 wakes up, closes event handle 
> thread2 resets closed handle

Relies on HANDLE uniqueness and assumes that an error will result. This
is why the 2.6.0 version (and earlier) checks the return code and
restores Win32 LastError if necessary - for GetLastError transparency.

Remember, these are very rare situations following a cancellation. It
might delay things a little but shouldn't break anything.

> Re: you previous message:
> >"If only one thread ever comes in, and is canceled in the init_routine, 
> >then the semaphore is never cleaned up."
> 
> If only one thread ever comes in, and is canceled in the init_routine, then
> the semaphore is never created to begin with, right?
> 
> Also, regarding my previous comment to Ross about very high cost of using
> InterlockedExchangeAdd for MBR:
> I did some simple benchmarking. Running pthread_once 50,000,000 on my pretty
> slow single CPU machine takes about 2.1 seconds. Replacing
> InterlockedExchangeAdd with simple read brings it down to 0.6 seconds. This
> looks significant.

Using the PTW32_INTERLOCKED_COMPARE_EXCHANGE macro as in your latest (in
CVS) version and building the library for inlined functions (nmake VC-
inlined) and x86 architecture causes customised versions of
InterlockedCompareExchange to be used, and this results in inlined asm.
Same for PTW32_INTERLOCKED_EXCHANGE.

Also, on single-CPU x86, the library dynamically switches to using
'cmpxchg' rather than 'lock cmpxchg' to avoid locking the bus. This
appears to match what the kernel32.dll versions do. On non-x86
architectures the kernel32.dll versions are called, with call overhead.

PTW32_INTERLOCKED_EXCHANGE_ADD could be added, as could other
architectures. See ptw32_InterlockedCompareExchange.c

> -----Original Message-----
> From: Gottlob Frege [mailto:gottlobfrege@gmail.com] 
> Sent: Friday, May 27, 2005 12:42 PM
> To: Ross Johnson
> Cc: Vladimir Kliatchko
> Subject: Re: New pthread_once implementation
> 
> thread1 comes in, start initing
> thread2 creates sema and waits
> thread1 starts to cancel - resets control->state
> thread3 comes in, goes into init
> thread4 comes in, goes into else block
> thread1 finishes cancel - releases semaphore
> thread2 wakes up
> thread2 decrements numSemaUsers to 0
> thread4 increments numSemaUsers
> thread4 does NOT set new semaphore
> thread2 closes semaphore
> thread4 tries to wait on closed semaphore...
> 
> 
> On 5/27/05, Ross Johnson <ross.johnson@homemail.com.au> wrote:
> > Guys,
> > 
> > Is there anything you want to change before I cast a new release?
> > 
> > http://sources.redhat.com/cgi-bin/cvsweb.cgi/pthreads/pthread_once.c?
> > rev=1.18&content-type=text/x-cvsweb-markup&cvsroot=pthreads-win32
> > 
> > Thanks.
> > Ross
> > 
> > On Fri, 2005-05-27 at 08:39 -0500, Tim Theisen wrote:
> > > I picked up the latest and compiled with VC 7.1.  All tests passed.
> > > Then, I ran 100 iterations of the once [1-4] tests.  These tests passed
> > > as well.  So, it has my stamp of approval.
> > >
> > > ...Tim
> > > --
> > >        Tim Theisen                     Lead Research Software Engineer
> > > Phone: +1 608 824 2848                 TomoTherapy Incorporated
> > >   Fax: +1 608 824 2996                 1240 Deming Way
> > >   Web: http://www.tomotherapy.com      Madison, WI 53717-1954
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ross Johnson [mailto:ross.johnson@homemail.com.au]
> > > Sent: Friday, May 27, 2005 02:44
> > > To: Tim Theisen
> > > Cc: Vladimir Kliatchko; Gottlob Frege
> > > Subject: New pthread_once implementation
> > >
> > >
> > > Hi Tim,
> > >
> > > The current CVS head contains the latest and much simpler implementation
> > > of pthread_once just presented on the mailing list. It passes on a UP
> > > machine as usual but no-one has run it through an MP system yet. Could
> > > you when you get time?
> > >
> > > Thanks.
> > > Ross
> > >
> > >
> > >
> > 
> >
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: New pthread_once implementation
  2005-05-28  3:46   ` Ross Johnson
@ 2005-05-28 10:51     ` Vladimir Kliatchko
  2005-05-28 13:54       ` Ross Johnson
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Kliatchko @ 2005-05-28 10:51 UTC (permalink / raw)
  To: 'Ross Johnson'; +Cc: 'Gottlob Frege', pthreads-win32



> -----Original Message-----
> From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32-
> owner@sources.redhat.com] On Behalf Of Ross Johnson
> Sent: Friday, May 27, 2005 11:48 PM
> To: Vladimir Kliatchko
> Cc: 'Gottlob Frege'; Pthreads-Win32 list
> Subject: RE: New pthread_once implementation
> 
> On Fri, 2005-05-27 at 21:30 -0400, Vladimir Kliatchko wrote:
> > Nice catch. Let me see if I can fix it.
> >
> > Note that the same problem exists in the currently released event-based
> > implementation (cvs version 1.16):
> >
> > thread1 comes in, start initing
> > thread2 creates event, starts waiting
> > thread3 comes in starts waiting
> > thread1 is cancelled, signals event
> > thread2 wakes up, proceeds to the point right before the resetEvent
> > thread3 wakes up, closes event handle
> > thread2 resets closed handle
> 
> Relies on HANDLE uniqueness and assumes that an error will result. This
> is why the 2.6.0 version (and earlier) checks the return code and
> restores Win32 LastError if necessary - for GetLastError transparency.

Does Windows guarantee that the handles are not reused? What happens if a
thread closes a handle while another thread is blocked on it? Is any of this
in Microsoft documentation? Consider the following scenario for the
event-based implementation:

thread1 comes in, starts initing
thread2 creates event, starts waiting
thread3 comes in starts waiting
thread1 is cancelled, signals event
thread2 wakes up, proceeds to the point right before the resetEvent
thread3 wakes up, proceeds to the point right before it closes event handle
thread2 resets handle, clears state, proceeds to the point right before
setting state to 'done'
thread4 comes in proceeds to the point right before waiting on the event
thread3 closes event handle
threads 3, 4, and 2 race to close, wait, and set event

Basically it appears, if I am not confused, the handle may be closed while
(or before) another thread calls wait/reset/setevent. The handle may even be
closed while BOTH a thread is waiting and another thread is calling
SetEvent. Are all of these ok? Does it mean similar things are ok for the
semaphore based version and we just need to restore lasterror?

> 
> Remember, these are very rare situations following a cancellation. It
> might delay things a little but shouldn't break anything.
> 
> > Re: you previous message:
> > >"If only one thread ever comes in, and is canceled in the init_routine,
> > >then the semaphore is never cleaned up."
> >
> > If only one thread ever comes in, and is canceled in the init_routine,
> then
> > the semaphore is never created to begin with, right?
> >
> > Also, regarding my previous comment to Ross about very high cost of
> using
> > InterlockedExchangeAdd for MBR:
> > I did some simple benchmarking. Running pthread_once 50,000,000 on my
> pretty
> > slow single CPU machine takes about 2.1 seconds. Replacing
> > InterlockedExchangeAdd with simple read brings it down to 0.6 seconds.
> This
> > looks significant.
> 
> Using the PTW32_INTERLOCKED_COMPARE_EXCHANGE macro as in your latest (in
> CVS) version and building the library for inlined functions (nmake VC-
> inlined) and x86 architecture causes customised versions of
> InterlockedCompareExchange to be used, and this results in inlined asm.
> Same for PTW32_INTERLOCKED_EXCHANGE.
> 
> Also, on single-CPU x86, the library dynamically switches to using
> 'cmpxchg' rather than 'lock cmpxchg' to avoid locking the bus. This
> appears to match what the kernel32.dll versions do. On non-x86
> architectures the kernel32.dll versions are called, with call overhead.
> 
> PTW32_INTERLOCKED_EXCHANGE_ADD could be added, as could other
> architectures. See ptw32_InterlockedCompareExchange.c

I have rerun my benchmark with VC-inline. The difference is now less
significant 0.9 vs 0.6 but still noticeable. I guess cmpxchg even without
locking is quite expensive. On multi-CPU systems the difference should be
much higher due to the time it takes to lock the bus and to the contention
it may cause. It sounded as if you did not care much to try to optimize it.
I did not mean to suggest that we have to do it right now either. I just
wanted to get your opinion on whether we want to deal with this in the
future. 

> 
> > -----Original Message-----
> > From: Gottlob Frege [mailto:gottlobfrege@gmail.com]
> > Sent: Friday, May 27, 2005 12:42 PM
> > To: Ross Johnson
> > Cc: Vladimir Kliatchko
> > Subject: Re: New pthread_once implementation
> >
> > thread1 comes in, start initing
> > thread2 creates sema and waits
> > thread1 starts to cancel - resets control->state
> > thread3 comes in, goes into init
> > thread4 comes in, goes into else block
> > thread1 finishes cancel - releases semaphore
> > thread2 wakes up
> > thread2 decrements numSemaUsers to 0
> > thread4 increments numSemaUsers
> > thread4 does NOT set new semaphore
> > thread2 closes semaphore
> > thread4 tries to wait on closed semaphore...
> >
> >
> > On 5/27/05, Ross Johnson <ross.johnson@homemail.com.au> wrote:
> > > Guys,
> > >
> > > Is there anything you want to change before I cast a new release?
> > >
> > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/pthreads/pthread_once.c?
> > > rev=1.18&content-type=text/x-cvsweb-markup&cvsroot=pthreads-win32
> > >
> > > Thanks.
> > > Ross
> > >
> > > On Fri, 2005-05-27 at 08:39 -0500, Tim Theisen wrote:
> > > > I picked up the latest and compiled with VC 7.1.  All tests passed.
> > > > Then, I ran 100 iterations of the once [1-4] tests.  These tests
> passed
> > > > as well.  So, it has my stamp of approval.
> > > >
> > > > ...Tim
> > > > --
> > > >        Tim Theisen                     Lead Research Software
> Engineer
> > > > Phone: +1 608 824 2848                 TomoTherapy Incorporated
> > > >   Fax: +1 608 824 2996                 1240 Deming Way
> > > >   Web: http://www.tomotherapy.com      Madison, WI 53717-1954
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ross Johnson [mailto:ross.johnson@homemail.com.au]
> > > > Sent: Friday, May 27, 2005 02:44
> > > > To: Tim Theisen
> > > > Cc: Vladimir Kliatchko; Gottlob Frege
> > > > Subject: New pthread_once implementation
> > > >
> > > >
> > > > Hi Tim,
> > > >
> > > > The current CVS head contains the latest and much simpler
> implementation
> > > > of pthread_once just presented on the mailing list. It passes on a
> UP
> > > > machine as usual but no-one has run it through an MP system yet.
> Could
> > > > you when you get time?
> > > >
> > > > Thanks.
> > > > Ross
> > > >
> > > >
> > > >
> > >
> > >
> >
> >


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: New pthread_once implementation
  2005-05-28 10:51     ` Vladimir Kliatchko
@ 2005-05-28 13:54       ` Ross Johnson
  2005-05-28 14:29         ` Vladimir Kliatchko
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Johnson @ 2005-05-28 13:54 UTC (permalink / raw)
  To: Vladimir Kliatchko; +Cc: 'Gottlob Frege', Pthreads-Win32 list

On Sat, 2005-05-28 at 06:51 -0400, Vladimir Kliatchko wrote:
> 
> > -----Original Message-----
> > From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32-
> > owner@sources.redhat.com] On Behalf Of Ross Johnson
> > Sent: Friday, May 27, 2005 11:48 PM
> > To: Vladimir Kliatchko
> > Cc: 'Gottlob Frege'; Pthreads-Win32 list
> > Subject: RE: New pthread_once implementation
> > 
> > On Fri, 2005-05-27 at 21:30 -0400, Vladimir Kliatchko wrote:
> > > Nice catch. Let me see if I can fix it.
> > >
> > > Note that the same problem exists in the currently released event-based
> > > implementation (cvs version 1.16):
> > >
> > > thread1 comes in, start initing
> > > thread2 creates event, starts waiting
> > > thread3 comes in starts waiting
> > > thread1 is cancelled, signals event
> > > thread2 wakes up, proceeds to the point right before the resetEvent
> > > thread3 wakes up, closes event handle
> > > thread2 resets closed handle
> > 
> > Relies on HANDLE uniqueness and assumes that an error will result. This
> > is why the 2.6.0 version (and earlier) checks the return code and
> > restores Win32 LastError if necessary - for GetLastError transparency.
> 
> Does Windows guarantee that the handles are not reused? What happens if a
> thread closes a handle while another thread is blocked on it? Is any of this
> in Microsoft documentation? Consider the following scenario for the
> event-based implementation:

Well, apparently they're not unique when recycled, so there is a bug
here to fix in both versions:
http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/dngenlib/html/msdn_handles1.asp
[Under "Native Windows NT Objects"]
"Unlike the handles that are maintained by the Win32 USER and GDI
subsystem components, handles to native objects under Windows NT are not
unique; that is, upon destruction of an object, the corresponding handle
may be recycled and will look exactly like the handle to the destroyed
object."

But they are local to the process, rather than system wide if that
helps.

> > > Also, regarding my previous comment to Ross about very high cost of
> > using
> > > InterlockedExchangeAdd for MBR:
> > > I did some simple benchmarking. Running pthread_once 50,000,000 on my
> > pretty
> > > slow single CPU machine takes about 2.1 seconds. Replacing
> > > InterlockedExchangeAdd with simple read brings it down to 0.6 seconds.
> > This
> > > looks significant.
> > 
> > Using the PTW32_INTERLOCKED_COMPARE_EXCHANGE macro as in your latest (in
> > CVS) version and building the library for inlined functions (nmake VC-
> > inlined) and x86 architecture causes customised versions of
> > InterlockedCompareExchange to be used, and this results in inlined asm.
> > Same for PTW32_INTERLOCKED_EXCHANGE.
> > 
> > Also, on single-CPU x86, the library dynamically switches to using
> > 'cmpxchg' rather than 'lock cmpxchg' to avoid locking the bus. This
> > appears to match what the kernel32.dll versions do. On non-x86
> > architectures the kernel32.dll versions are called, with call overhead.
> > 
> > PTW32_INTERLOCKED_EXCHANGE_ADD could be added, as could other
> > architectures. See ptw32_InterlockedCompareExchange.c
> 
> I have rerun my benchmark with VC-inline. The difference is now less
> significant 0.9 vs 0.6 but still noticeable. I guess cmpxchg even without
> locking is quite expensive. On multi-CPU systems the difference should be
> much higher due to the time it takes to lock the bus and to the contention
> it may cause. It sounded as if you did not care much to try to optimize it.
> I did not mean to suggest that we have to do it right now either. I just
> wanted to get your opinion on whether we want to deal with this in the
> future.

By all means include any optimisation you think is worthwhile. I was
just pointing out that the difference isn't necessarily 2.1 v 0.6.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: New pthread_once implementation
  2005-05-28 13:54       ` Ross Johnson
@ 2005-05-28 14:29         ` Vladimir Kliatchko
  2005-05-29 12:58           ` Vladimir Kliatchko
  2005-05-30 14:48           ` Ross Johnson
  0 siblings, 2 replies; 10+ messages in thread
From: Vladimir Kliatchko @ 2005-05-28 14:29 UTC (permalink / raw)
  To: 'Ross Johnson'; +Cc: 'Gottlob Frege', pthreads-win32

[-- Attachment #1: Type: text/plain, Size: 4787 bytes --]

What do you think of the attached implementation? I am still analyzing it,
but it passes the tests and appears to be free of that problem. It does have
one minor glitch though:
If two threads come in, the semaphore is created. If both are cancelled and
no new calls a made to finish the job, the semaphore is never destroyed.
I am not sure how big a deal this is.

Re. optimizations: Great, I will try to do something.

Thnx,
--vlad 

> -----Original Message-----
> From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32-
> owner@sources.redhat.com] On Behalf Of Ross Johnson
> Sent: Saturday, May 28, 2005 9:55 AM
> To: Vladimir Kliatchko
> Cc: 'Gottlob Frege'; Pthreads-Win32 list
> Subject: RE: New pthread_once implementation
> 
> On Sat, 2005-05-28 at 06:51 -0400, Vladimir Kliatchko wrote:
> >
> > > -----Original Message-----
> > > From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32-
> > > owner@sources.redhat.com] On Behalf Of Ross Johnson
> > > Sent: Friday, May 27, 2005 11:48 PM
> > > To: Vladimir Kliatchko
> > > Cc: 'Gottlob Frege'; Pthreads-Win32 list
> > > Subject: RE: New pthread_once implementation
> > >
> > > On Fri, 2005-05-27 at 21:30 -0400, Vladimir Kliatchko wrote:
> > > > Nice catch. Let me see if I can fix it.
> > > >
> > > > Note that the same problem exists in the currently released event-
> based
> > > > implementation (cvs version 1.16):
> > > >
> > > > thread1 comes in, start initing
> > > > thread2 creates event, starts waiting
> > > > thread3 comes in starts waiting
> > > > thread1 is cancelled, signals event
> > > > thread2 wakes up, proceeds to the point right before the resetEvent
> > > > thread3 wakes up, closes event handle
> > > > thread2 resets closed handle
> > >
> > > Relies on HANDLE uniqueness and assumes that an error will result.
> This
> > > is why the 2.6.0 version (and earlier) checks the return code and
> > > restores Win32 LastError if necessary - for GetLastError transparency.
> >
> > Does Windows guarantee that the handles are not reused? What happens if
> a
> > thread closes a handle while another thread is blocked on it? Is any of
> this
> > in Microsoft documentation? Consider the following scenario for the
> > event-based implementation:
> 
> Well, apparently they're not unique when recycled, so there is a bug
> here to fix in both versions:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/dngenlib/html/msdn_handles1.asp
> [Under "Native Windows NT Objects"]
> "Unlike the handles that are maintained by the Win32 USER and GDI
> subsystem components, handles to native objects under Windows NT are not
> unique; that is, upon destruction of an object, the corresponding handle
> may be recycled and will look exactly like the handle to the destroyed
> object."
> 
> But they are local to the process, rather than system wide if that
> helps.
> 
> > > > Also, regarding my previous comment to Ross about very high cost of
> > > using
> > > > InterlockedExchangeAdd for MBR:
> > > > I did some simple benchmarking. Running pthread_once 50,000,000 on
> my
> > > pretty
> > > > slow single CPU machine takes about 2.1 seconds. Replacing
> > > > InterlockedExchangeAdd with simple read brings it down to 0.6
> seconds.
> > > This
> > > > looks significant.
> > >
> > > Using the PTW32_INTERLOCKED_COMPARE_EXCHANGE macro as in your latest
> (in
> > > CVS) version and building the library for inlined functions (nmake VC-
> > > inlined) and x86 architecture causes customised versions of
> > > InterlockedCompareExchange to be used, and this results in inlined
> asm.
> > > Same for PTW32_INTERLOCKED_EXCHANGE.
> > >
> > > Also, on single-CPU x86, the library dynamically switches to using
> > > 'cmpxchg' rather than 'lock cmpxchg' to avoid locking the bus. This
> > > appears to match what the kernel32.dll versions do. On non-x86
> > > architectures the kernel32.dll versions are called, with call
> overhead.
> > >
> > > PTW32_INTERLOCKED_EXCHANGE_ADD could be added, as could other
> > > architectures. See ptw32_InterlockedCompareExchange.c
> >
> > I have rerun my benchmark with VC-inline. The difference is now less
> > significant 0.9 vs 0.6 but still noticeable. I guess cmpxchg even
> without
> > locking is quite expensive. On multi-CPU systems the difference should
> be
> > much higher due to the time it takes to lock the bus and to the
> contention
> > it may cause. It sounded as if you did not care much to try to optimize
> it.
> > I did not mean to suggest that we have to do it right now either. I just
> > wanted to get your opinion on whether we want to deal with this in the
> > future.
> 
> By all means include any optimisation you think is worthwhile. I was
> just pointing out that the difference isn't necessarily 2.1 v 0.6.
> 


[-- Attachment #2: vk_pthread_once4.c --]
[-- Type: text/plain, Size: 3358 bytes --]

#define PTHREAD_ONCE_INIT       {0, 0, 0, 0}

enum ptw32_once_state {
  PTW32_ONCE_INIT      = 0x0,
  PTW32_ONCE_STARTED   = 0x1,
  PTW32_ONCE_DONE      = 0x2
};

struct pthread_once_t_
{
  int          state;
  int          reserved;
  int          numSemaphoreUsers;
  HANDLE       semaphore;
};

static void PTW32_CDECL
ptw32_once_on_init_cancel (void * arg)
{
  pthread_once_t * once_control = (pthread_once_t *) arg;

  (void) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state, 
				    (LONG)PTW32_ONCE_INIT);

  if (InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 
			     0L)) /* MBR fence */
    {
      ReleaseSemaphore(once_control->semaphore, 1, NULL);
    }
}

static int
ptw32_once (pthread_once_t * once_control, void (*init_routine) (void))
{
  int state;
  HANDLE sema;

  while ((state = PTW32_INTERLOCKED_COMPARE_EXCHANGE(
		     (PTW32_INTERLOCKED_LPLONG)&once_control->state,
		     (PTW32_INTERLOCKED_LONG)PTW32_ONCE_STARTED,
		     (PTW32_INTERLOCKED_LONG)PTW32_ONCE_INIT)) != 
	                                                      PTW32_ONCE_DONE)
    {
      if (PTW32_ONCE_INIT == state)
	{

#ifdef _MSC_VER
#pragma inline_depth(0)
#endif

	  pthread_cleanup_push(ptw32_once_on_init_cancel, 
			       (void *) once_control);
	  (*init_routine)();
	  pthread_cleanup_pop(0);

#ifdef _MSC_VER
#pragma inline_depth()
#endif

	  (void) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state, 
					    (LONG)PTW32_ONCE_DONE);

	  /*
	   * we didn't create the semaphore.
	   * it is only there if there is someone waiting.
	   */
	  if (InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 
				     0L)) /* MBR fence */
	    {
	      ReleaseSemaphore(once_control->semaphore, 
			       once_control->numSemaphoreUsers, NULL);
	    }
	}
      else
	{
	  InterlockedIncrement((LPLONG)&once_control->numSemaphoreUsers);
	  if (!InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 
				      0L)) /* MBR fence */
	    {
	      sema = CreateSemaphore(NULL, 0, INT_MAX, NULL);
	      if (PTW32_INTERLOCKED_COMPARE_EXCHANGE(
		       (PTW32_INTERLOCKED_LPLONG)&once_control->semaphore,
		       (PTW32_INTERLOCKED_LONG)sema,
		       (PTW32_INTERLOCKED_LONG)0))
		{
		  CloseHandle(sema);
		}
	    }

	  /*
	   * Check 'state' again in case the initting thread has finished or
	     cancelled and left before seeing that there was a semaphore.
	   */
	  if (InterlockedExchangeAdd((LPLONG)&once_control->state, 
				     0L) == PTW32_ONCE_STARTED)
	    {
	      WaitForSingleObject(once_control->semaphore, INFINITE);
	    }

	  InterlockedDecrement((LPLONG)&once_control->numSemaphoreUsers);
	}
    }

  if (0 == InterlockedExchangeAdd((LPLONG)&once_control->numSemaphoreUsers,
				  0)) /* MBR */
    {
      if ((sema = (HANDLE) PTW32_INTERLOCKED_EXCHANGE(
		      (LPLONG)&once_control->semaphore,
		      (LONG)0)))
	{
	  CloseHandle(sema);
	}
    }
  
  return 0;
}

int
pthread_once (pthread_once_t * once_control, void (*init_routine) (void))
{
  int result;

  if (once_control == NULL || init_routine == NULL)
    {
      result = EINVAL;
    }
  else
    {
      if (InterlockedExchangeAdd((LPLONG)&once_control->state, 
				 0L) != PTW32_ONCE_DONE) /* MBR */
	{
	  result = ptw32_once(once_control, init_routine);
	}
      else 
	{
	  result = 0;
	}
    }

  return (result);

}				/* pthread_once */


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: New pthread_once implementation
  2005-05-28 14:29         ` Vladimir Kliatchko
@ 2005-05-29 12:58           ` Vladimir Kliatchko
  2005-05-30 14:48           ` Ross Johnson
  1 sibling, 0 replies; 10+ messages in thread
From: Vladimir Kliatchko @ 2005-05-29 12:58 UTC (permalink / raw)
  To: 'Ross Johnson'; +Cc: 'Gottlob Frege', pthreads-win32

[-- Attachment #1: Type: text/plain, Size: 6169 bytes --]

Hi,
I tried fixing that last problem (semaphore leak when multiple threads are
cancelled and the initialization is never completed) but it appears the
reference counting approach is inherently flawed. So I had to re-implement
the whole thing from scratch. This implementation uses MCS locks which are
not only fast but also require no clean-up thus avoiding the above problem. 
The implementation of MCS locks, which I had to build, probably deserves a
module (.c and .h) of its own since we may want to use them in other places.
This implementation has passed all the tests (again on a single CPU only)
and I hope it is finally bug-free now.

Pls, let me know what you think.
--vlad

PS.
Sorry it takes me so many iterations. I feel like I have been spamming you
with too many versions of this routine.

> -----Original Message-----
> From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32-
> owner@sources.redhat.com] On Behalf Of Vladimir Kliatchko
> Sent: Saturday, May 28, 2005 10:30 AM
> To: 'Ross Johnson'
> Cc: 'Gottlob Frege'; pthreads-win32@sources.redhat.com
> Subject: RE: New pthread_once implementation
> 
> What do you think of the attached implementation? I am still analyzing it,
> but it passes the tests and appears to be free of that problem. It does
> have
> one minor glitch though:
> If two threads come in, the semaphore is created. If both are cancelled
> and
> no new calls a made to finish the job, the semaphore is never destroyed.
> I am not sure how big a deal this is.
> 
> Re. optimizations: Great, I will try to do something.
> 
> Thnx,
> --vlad
> 
> > -----Original Message-----
> > From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32-
> > owner@sources.redhat.com] On Behalf Of Ross Johnson
> > Sent: Saturday, May 28, 2005 9:55 AM
> > To: Vladimir Kliatchko
> > Cc: 'Gottlob Frege'; Pthreads-Win32 list
> > Subject: RE: New pthread_once implementation
> >
> > On Sat, 2005-05-28 at 06:51 -0400, Vladimir Kliatchko wrote:
> > >
> > > > -----Original Message-----
> > > > From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-
> win32-
> > > > owner@sources.redhat.com] On Behalf Of Ross Johnson
> > > > Sent: Friday, May 27, 2005 11:48 PM
> > > > To: Vladimir Kliatchko
> > > > Cc: 'Gottlob Frege'; Pthreads-Win32 list
> > > > Subject: RE: New pthread_once implementation
> > > >
> > > > On Fri, 2005-05-27 at 21:30 -0400, Vladimir Kliatchko wrote:
> > > > > Nice catch. Let me see if I can fix it.
> > > > >
> > > > > Note that the same problem exists in the currently released event-
> > based
> > > > > implementation (cvs version 1.16):
> > > > >
> > > > > thread1 comes in, start initing
> > > > > thread2 creates event, starts waiting
> > > > > thread3 comes in starts waiting
> > > > > thread1 is cancelled, signals event
> > > > > thread2 wakes up, proceeds to the point right before the
> resetEvent
> > > > > thread3 wakes up, closes event handle
> > > > > thread2 resets closed handle
> > > >
> > > > Relies on HANDLE uniqueness and assumes that an error will result.
> > This
> > > > is why the 2.6.0 version (and earlier) checks the return code and
> > > > restores Win32 LastError if necessary - for GetLastError
> transparency.
> > >
> > > Does Windows guarantee that the handles are not reused? What happens
> if
> > a
> > > thread closes a handle while another thread is blocked on it? Is any
> of
> > this
> > > in Microsoft documentation? Consider the following scenario for the
> > > event-based implementation:
> >
> > Well, apparently they're not unique when recycled, so there is a bug
> > here to fix in both versions:
> > http://msdn.microsoft.com/library/default.asp?url=/library/en-
> > us/dngenlib/html/msdn_handles1.asp
> > [Under "Native Windows NT Objects"]
> > "Unlike the handles that are maintained by the Win32 USER and GDI
> > subsystem components, handles to native objects under Windows NT are not
> > unique; that is, upon destruction of an object, the corresponding handle
> > may be recycled and will look exactly like the handle to the destroyed
> > object."
> >
> > But they are local to the process, rather than system wide if that
> > helps.
> >
> > > > > Also, regarding my previous comment to Ross about very high cost
> of
> > > > using
> > > > > InterlockedExchangeAdd for MBR:
> > > > > I did some simple benchmarking. Running pthread_once 50,000,000 on
> > my
> > > > pretty
> > > > > slow single CPU machine takes about 2.1 seconds. Replacing
> > > > > InterlockedExchangeAdd with simple read brings it down to 0.6
> > seconds.
> > > > This
> > > > > looks significant.
> > > >
> > > > Using the PTW32_INTERLOCKED_COMPARE_EXCHANGE macro as in your latest
> > (in
> > > > CVS) version and building the library for inlined functions (nmake
> VC-
> > > > inlined) and x86 architecture causes customised versions of
> > > > InterlockedCompareExchange to be used, and this results in inlined
> > asm.
> > > > Same for PTW32_INTERLOCKED_EXCHANGE.
> > > >
> > > > Also, on single-CPU x86, the library dynamically switches to using
> > > > 'cmpxchg' rather than 'lock cmpxchg' to avoid locking the bus. This
> > > > appears to match what the kernel32.dll versions do. On non-x86
> > > > architectures the kernel32.dll versions are called, with call
> > overhead.
> > > >
> > > > PTW32_INTERLOCKED_EXCHANGE_ADD could be added, as could other
> > > > architectures. See ptw32_InterlockedCompareExchange.c
> > >
> > > I have rerun my benchmark with VC-inline. The difference is now less
> > > significant 0.9 vs 0.6 but still noticeable. I guess cmpxchg even
> > without
> > > locking is quite expensive. On multi-CPU systems the difference should
> > be
> > > much higher due to the time it takes to lock the bus and to the
> > contention
> > > it may cause. It sounded as if you did not care much to try to
> optimize
> > it.
> > > I did not mean to suggest that we have to do it right now either. I
> just
> > > wanted to get your opinion on whether we want to deal with this in the
> > > future.
> >
> > By all means include any optimisation you think is worthwhile. I was
> > just pointing out that the difference isn't necessarily 2.1 v 0.6.
> >


[-- Attachment #2: vk_pthread_once5.c --]
[-- Type: text/plain, Size: 4834 bytes --]

#define PTHREAD_ONCE_INIT       {PTW32_FALSE, 0, 0, 0}

struct pthread_once_t_
{
  int             done;
  void           *lock;
  int             reserved1;
  int             reserved2;
};

struct ptw32_mcs_node_
{
  struct ptw32_mcs_node_ **lock;        /* ptr to tail of queue */
  struct ptw32_mcs_node_  *next;        /* ptr to successor in queue */
  LONG                     readyFlag;   /* set after lock is released by
                                           predecessor */
  LONG                     nextFlag;    /* set after 'next' ptr is set by
                                           successor */
};

typedef struct ptw32_mcs_node_   ptw32_mcs_node;
typedef struct ptw32_mcs_node_  *ptw32_mcs_lock;

/*
 * ptw32_mcs_flag_set -- notify another thread about an event.
 * 
 * Set event if an event handle has been stored in the flag, and
 * set flag to -1 otherwise. Note that -1 cannot be a valid hande value.
 */
static void 
ptw32_mcs_flag_set(LONG *flag)
{
  HANDLE e = (HANDLE)PTW32_INTERLOCKED_COMPARE_EXCHANGE(
						(PTW32_INTERLOCKED_LPLONG)flag,
						(PTW32_INTERLOCKED_LONG)-1,
						(PTW32_INTERLOCKED_LONG)0);
  if (0 != e)
    {
      /* another thread has already stored an event handle in the flag */
      SetEvent(e);
    }
}

/*
 * ptw32_mcs_flag_set -- wait for notification from another.
 * 
 * Store an event handle in the flag and wait on it if the flag has not been
 * set, and proceed without creating an event otherwise.
 */
static void 
ptw32_mcs_flag_wait(LONG *flag)
{
  if (0 == InterlockedExchangeAdd((LPLONG)flag, 0)) /* MBR fence */
    {
      /* the flag is not set. create event. */

      HANDLE e = CreateEvent(NULL, PTW32_FALSE, PTW32_FALSE, NULL);

      if (0 == PTW32_INTERLOCKED_COMPARE_EXCHANGE(
				       (PTW32_INTERLOCKED_LPLONG)flag,
				       (PTW32_INTERLOCKED_LONG)e,
				       (PTW32_INTERLOCKED_LONG)0))
	{
	  /* stored handle in the flag. wait on it now. */
	  WaitForSingleObject(e, INFINITE);
	}

      CloseHandle(e);
    }
}

/*
 * ptw32_mcs_lock_aquire -- acquire an MCS lock.
 * 
 * See: 
 * J. M. Mellor-Crummey and M. L. Scott.
 * Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors.
 * ACM Transactions on Computer Systems, 9(1):21-65, Feb. 1991.
 */
static void 
ptw32_mcs_lock_aquire(ptw32_mcs_lock  *lock, 
		      ptw32_mcs_node  *node)
{
  ptw32_mcs_node  *pred;
  
  node->lock = lock;
  node->nextFlag = 0;
  node->readyFlag = 0;
  node->next = 0; /* initially, no successor */
  
  /* queue for the lock */
  pred = (ptw32_mcs_node *)PTW32_INTERLOCKED_EXCHANGE((LPLONG)lock, 
						      (LONG)node);
  if (0 != pred)
    {
      /* the lock was not free. link behind predecessor. */
      pred->next = node;
      ptw32_mcs_flag_set(&pred->nextFlag);
      ptw32_mcs_flag_wait(&node->readyFlag);
    }
}

/*
 * ptw32_mcs_lock_aquire -- release an MCS lock.
 * 
 * See: 
 * J. M. Mellor-Crummey and M. L. Scott.
 * Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors.
 * ACM Transactions on Computer Systems, 9(1):21-65, Feb. 1991.
 */
static void 
ptw32_mcs_lock_release(ptw32_mcs_node  *node)
{
  ptw32_mcs_lock *lock = node->lock;
  ptw32_mcs_node *next = (ptw32_mcs_node *)
    InterlockedExchangeAdd((LPLONG)&node->next, 0); /* MBR fence */

  if (0 == next)
    {
      /* no known successor */

      if (node == (ptw32_mcs_node *)PTW32_INTERLOCKED_COMPARE_EXCHANGE(
				       (PTW32_INTERLOCKED_LPLONG)lock,
				       (PTW32_INTERLOCKED_LONG)0,
				       (PTW32_INTERLOCKED_LONG)node))
	{
	  /* no successor, lock is free now */
	  return;
	}
  
      /* wait for successor */
      ptw32_mcs_flag_wait(&node->nextFlag);
      next = (ptw32_mcs_node *)InterlockedExchangeAdd((LPLONG)&node->next, 0);
                                                               /* MBR fence */
    }

  /* pass the lock */
  ptw32_mcs_flag_set(&next->readyFlag);
}

static void PTW32_CDECL
ptw32_once_on_init_cancel (void * arg)
{
  /* when the initting thread is cancelled we have to release the lock */
  ptw32_mcs_node *node = (ptw32_mcs_node*)arg;
  ptw32_mcs_lock_release(node);
}

int
pthread_once (pthread_once_t * once_control, void (*init_routine) (void))
{
  if (once_control == NULL || init_routine == NULL)
    {
      return EINVAL;
    }
  
  if (!InterlockedExchangeAdd((LPLONG)&once_control->done, 0)) /* MBR fence */
    {
      ptw32_mcs_node node;
      ptw32_mcs_lock_aquire(&(ptw32_mcs_lock)once_control->lock, &node);

      if (!once_control->done) 
	{

#ifdef _MSC_VER
#pragma inline_depth(0)
#endif

	  pthread_cleanup_push(ptw32_once_on_init_cancel, (void *)&node);
	  (*init_routine)();
	  pthread_cleanup_pop(0);

#ifdef _MSC_VER
#pragma inline_depth()
#endif

	  once_control->done = PTW32_TRUE;
	}

	ptw32_mcs_lock_release(&node);
    }

  return 0;

}				/* pthread_once */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: New pthread_once implementation
  2005-05-28 14:29         ` Vladimir Kliatchko
  2005-05-29 12:58           ` Vladimir Kliatchko
@ 2005-05-30 14:48           ` Ross Johnson
  2005-05-30 15:26             ` Vladimir Kliatchko
  1 sibling, 1 reply; 10+ messages in thread
From: Ross Johnson @ 2005-05-30 14:48 UTC (permalink / raw)
  To: Vladimir Kliatchko; +Cc: Gottlob Frege, Pthreads-Win32 list

Hi Vlad,

The nice thing about your implementation using semaphores was that: even
though you could release just one waiter on cancellation, all waiting
threads could be released in one call to the kernel when exiting
normally. In your MCS version, the dequeueing involves sequential calls
to SetEvent, which could be much slower in comparison. That's my only
concern with it. The threat of an async cancellation leaving waiters
stranded was a concern at one point, but none of the previous
implementations of this routine has been safe against it either.

Still pondering your previous version (and not yet convinced that it's
fatally flawed), I've tried another variation.

In this variation, the cancellation handler doesn't reset state to INIT,
but to a new state == CANCELLED so that any newly arriving threads plus
the awoken waiter are prevented from becoming the new initter until
state can be reset to INIT in the wait path [by one of those threads]
when semaphore is guaranteed to be valid. I think this removes any races
between semaphore closure and operations.

[NB. in the test before the WaitForSingleObject call, the == is now >=]

This variation passes repeated runs of once4.c (aggressive cancellation
with varying priority threads hitting the once_control) on my uni-
processor. I also went as far as adding Sleep(1); after every semicolon
and left-curly brace to try to break it.

PS. I'm also perhaps too conscious of 'spamming' the list with endless
versions of this stubborn little routine, but this is the purpose of the
list, so I'm not personally going to worry about it. I'm sure anyone who
finds it irritating will filter it or something.


#define PTHREAD_ONCE_INIT       {0, 0, 0, 0}

enum ptw32_once_state {
  PTW32_ONCE_INIT      = 0x0,
  PTW32_ONCE_DONE      = 0x1,
  PTW32_ONCE_STARTED   = 0x2,
  PTW32_ONCE_CANCELLED = 0x3
};

struct pthread_once_t_
{
  int          state;
  int          reserved;
  int          numSemaphoreUsers;
  HANDLE       semaphore;
};

static void PTW32_CDECL
ptw32_once_init_routine_cleanup(void * arg)
{
  pthread_once_t * once_control = (pthread_once_t *) arg;

  /*
   * Continue to direct new threads into the wait path until the waiter that we
   * release or a new thread can reset state to INIT.
   */
  (void) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state, (LONG)PTW32_ONCE_CANCELLED);

  if (InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 0L)) /* MBR fence */
    {
      ReleaseSemaphore(once_control->semaphore, 1, NULL);
    }
}

int
pthread_once (pthread_once_t * once_control, void (*init_routine) (void))
{
  int result;
  int state;
  HANDLE sema;

  if (once_control == NULL || init_routine == NULL)
    {
      result = EINVAL;
      goto FAIL0;
    }
  else
    {
      result = 0;
    }

  while ((state = (int)
	  PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control->state,
					     (PTW32_INTERLOCKED_LONG)PTW32_ONCE_STARTED,
					     (PTW32_INTERLOCKED_LONG)PTW32_ONCE_INIT))
	 != PTW32_ONCE_DONE)
    {
      if (PTW32_ONCE_INIT == state)
        {

#ifdef _MSC_VER
#pragma inline_depth(0)
#endif

          pthread_cleanup_push(ptw32_once_init_routine_cleanup, (void *) once_control);
          (*init_routine)();
          pthread_cleanup_pop(0);

#ifdef _MSC_VER
#pragma inline_depth()
#endif

          (void) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state, 
                                            (LONG)PTW32_ONCE_DONE);

          /*
           * we didn't create the semaphore.
           * it is only there if there is someone waiting.
           */
          if (InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 0L)) /* MBR fence */
            {
              ReleaseSemaphore(once_control->semaphore, 
                               once_control->numSemaphoreUsers, NULL);
            }
        }
      else
        {
          if (1 == InterlockedIncrement((LPLONG)&once_control->numSemaphoreUsers))
            {
              sema = CreateSemaphore(NULL, 0, INT_MAX, NULL);

              if (PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control->semaphore,
						     (PTW32_INTERLOCKED_LONG)sema,
						     (PTW32_INTERLOCKED_LONG)0))
                {
                  CloseHandle(sema);
                }
            }

	  /*
	   * If initter was cancelled then state is CANCELLED.
	   * Until state is reset to INIT, all new threads will enter the wait path.
	   * The woken waiter, if it exists, will also re-enter the wait path, but
	   * either it or a new thread will reset state = INIT here, continue around the Wait,
	   * and become the new initter. Any thread that is suspended in the wait path before
	   * this point will hit this check. Any thread suspended between this check and
	   * the Wait will wait on a valid semaphore, and possibly continue through it
	   * if the cancellation handler has incremented (released) it and there were
	   * no waiters.
	   */
	  (void) PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control->state,
						    (PTW32_INTERLOCKED_LONG)PTW32_ONCE_INIT,
						    (PTW32_INTERLOCKED_LONG)PTW32_ONCE_CANCELLED);

	  /*
	   * Check 'state' again in case the initting thread has finished
	   * and left before seeing that there was a semaphore.
	   */
          if (InterlockedExchangeAdd((LPLONG)&once_control->state, 0L) >= PTW32_ONCE_STARTED)
            {
              WaitForSingleObject(once_control->semaphore, INFINITE);
            }

          if (0 == InterlockedDecrement((LPLONG)&once_control->numSemaphoreUsers))
            {
              /* we were last */
              if ((sema =
		   (HANDLE) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->semaphore, (LONG)0)))
                {
                  CloseHandle(sema);
                }
            }
        }
    }

  /*
   * ------------
   * Failure Code
   * ------------
   */
FAIL0:
  return (result);
}                               /* pthread_once */


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: New pthread_once implementation
  2005-05-30 14:48           ` Ross Johnson
@ 2005-05-30 15:26             ` Vladimir Kliatchko
  2005-05-31 16:28               ` Gottlob Frege
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Kliatchko @ 2005-05-30 15:26 UTC (permalink / raw)
  To: 'Gottlob Frege', pthreads-win32

T1 comes in, proceeds to [0064]
T2 comes in, proceeds to [0119]
T1 is cancelled
T3 comes in, loops around, resets the state, proceeds to [0064]
T2 wakes up, proceeds to right before [0125]
T4 comes in and proceeds to right before [0119]
T3 proceeds to right before [0080]
T2,T3,T4 are ready to race for CloseHandle, ReleaseSemaphore, and
WaitForSingleObject respectively

Regaring MCS version:
It can be quite expensing, but only when multiple threads call pthread_once
simultaneously. Also, the overhead is proportional to the number of threads
so that:
1 thread - no overhead
2 simultaneous threads - the same overhead as in semaphore based version
3 or more simultaneous threads - higher overhead - but is this an important
case?

[0001]  #define PTHREAD_ONCE_INIT       {0, 0, 0, 0}
[0002]  
[0003]  enum ptw32_once_state {
[0004]    PTW32_ONCE_INIT      = 0x0,
[0005]    PTW32_ONCE_DONE      = 0x1,
[0006]    PTW32_ONCE_STARTED   = 0x2,
[0007]    PTW32_ONCE_CANCELLED = 0x3
[0008]  };
[0009]  
[0010]  struct pthread_once_t_
[0011]  {
[0012]    int          state;
[0013]    int          reserved;
[0014]    int          numSemaphoreUsers;
[0015]    HANDLE       semaphore;
[0016]  };
[0017]  
[0018]  static void PTW32_CDECL
[0019]  ptw32_once_init_routine_cleanup(void * arg) {
[0020]    pthread_once_t * once_control = (pthread_once_t *) arg;
[0021]  
[0022]    /*
[0023]     * Continue to direct new threads into the wait path until the
waiter that we
[0024]     * release or a new thread can reset state to INIT.
[0025]     */
[0026]    (void) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state,
(LONG)PTW32_ONCE_CANCELLED);
[0027]  
[0028]    if (InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 0L))
/* MBR fence */
[0029]      {
[0030]        ReleaseSemaphore(once_control->semaphore, 1, NULL);
[0031]      }
[0032]  }
[0033]  
[0034]  int
[0035]  pthread_once (pthread_once_t * once_control, void (*init_routine)
(void)) {
[0036]    int result;
[0037]    int state;
[0038]    HANDLE sema;
[0039]  
[0040]    if (once_control == NULL || init_routine == NULL)
[0041]      {
[0042]        result = EINVAL;
[0043]        goto FAIL0;
[0044]      }
[0045]    else
[0046]      {
[0047]        result = 0;
[0048]      }
[0049]  
[0050]    while ((state = (int)
[0051]
PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control->
state,
[0052]
(PTW32_INTERLOCKED_LONG)PTW32_ONCE_STARTED,
[0053]
(PTW32_INTERLOCKED_LONG)PTW32_ONCE_INIT))
[0054]  	 != PTW32_ONCE_DONE)
[0055]      {
[0056]        if (PTW32_ONCE_INIT == state)
[0057]          {
[0058]  
[0059]  #ifdef _MSC_VER
[0060]  #pragma inline_depth(0)
[0061]  #endif
[0062]  
[0063]            pthread_cleanup_push(ptw32_once_init_routine_cleanup,
(void *) once_control);
[0064]            (*init_routine)();
[0065]            pthread_cleanup_pop(0);
[0066]  
[0067]  #ifdef _MSC_VER
[0068]  #pragma inline_depth()
[0069]  #endif
[0070]  
[0071]            (void)
PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state,
[0072]                                              (LONG)PTW32_ONCE_DONE);
[0073]  
[0074]            /*
[0075]             * we didn't create the semaphore.
[0076]             * it is only there if there is someone waiting.
[0077]             */
[0078]            if
(InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 0L)) /* MBR fence
*/
[0079]              {
[0080]                ReleaseSemaphore(once_control->semaphore,
[0081]                                 once_control->numSemaphoreUsers,
NULL);
[0082]              }
[0083]          }
[0084]        else
[0085]          {
[0086]            if (1 ==
InterlockedIncrement((LPLONG)&once_control->numSemaphoreUsers))
[0087]              {
[0088]                sema = CreateSemaphore(NULL, 0, INT_MAX, NULL);
[0089]  
[0090]                if
(PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control-
>semaphore,
[0091]
(PTW32_INTERLOCKED_LONG)sema,
[0092]
(PTW32_INTERLOCKED_LONG)0))
[0093]                  {
[0094]                    CloseHandle(sema);
[0095]                  }
[0096]              }
[0097]  
[0098]  	  /*
[0099]  	   * If initter was cancelled then state is CANCELLED.
[0100]  	   * Until state is reset to INIT, all new threads will
enter the wait path.
[0101]  	   * The woken waiter, if it exists, will also re-enter the
wait path, but
[0102]  	   * either it or a new thread will reset state = INIT here,
continue around the Wait,
[0103]  	   * and become the new initter. Any thread that is
suspended in the wait path before
[0104]  	   * this point will hit this check. Any thread suspended
between this check and
[0105]  	   * the Wait will wait on a valid semaphore, and possibly
continue through it
[0106]  	   * if the cancellation handler has incremented (released)
it and there were
[0107]  	   * no waiters.
[0108]  	   */
[0109]  	  (void)
PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control->
state,
[0110]
(PTW32_INTERLOCKED_LONG)PTW32_ONCE_INIT,
[0111]
(PTW32_INTERLOCKED_LONG)PTW32_ONCE_CANCELLED);
[0112]  
[0113]  	  /*
[0114]  	   * Check 'state' again in case the initting thread has
finished
[0115]  	   * and left before seeing that there was a semaphore.
[0116]  	   */
[0117]            if (InterlockedExchangeAdd((LPLONG)&once_control->state,
0L) >= PTW32_ONCE_STARTED)
[0118]              {
[0119]                WaitForSingleObject(once_control->semaphore,
INFINITE);
[0120]              }
[0121]  
[0122]            if (0 ==
InterlockedDecrement((LPLONG)&once_control->numSemaphoreUsers))
[0123]              {
[0124]                /* we were last */
[0125]                if ((sema =
[0126]  		   (HANDLE)
PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->semaphore, (LONG)0)))
[0127]                  {
[0128]                    CloseHandle(sema);
[0129]                  }
[0130]              }
[0131]          }
[0132]      }
[0133]  
[0134]    /*
[0135]     * ------------
[0136]     * Failure Code
[0137]     * ------------
[0138]     */
[0139]  FAIL0:
[0140]    return (result);
[0141]  }                               /* pthread_once */
[0142]  

> -----Original Message-----
> From: Ross Johnson [mailto:ross.johnson@homemail.com.au]
> Sent: Monday, May 30, 2005 5:56 AM
> To: Vladimir Kliatchko
> Cc: Gottlob Frege; Pthreads-Win32 list
> Subject: RE: New pthread_once implementation
> 
> Hi Vlad,
> 
> The nice thing about your implementation using semaphores was that: even
> though you could release just one waiter on cancellation, all waiting
> threads could be released in one call to the kernel when exiting
> normally. In your MCS version, the dequeueing involves sequential calls
> to SetEvent, which could be much slower in comparison. That's my only
> concern with it. The threat of an async cancellation leaving waiters
> stranded was a concern at one point, but none of the previous
> implementations of this routine has been safe against it either.
> 
> Still pondering your previous version (and not yet convinced that it's
> fatally flawed), I've tried another variation.
> 
> In this variation, the cancellation handler doesn't reset state to INIT,
> but to a new state == CANCELLED so that any newly arriving threads plus
> the awoken waiter are prevented from becoming the new initter until
> state can be reset to INIT in the wait path [by one of those threads]
> when semaphore is guaranteed to be valid. I think this removes any races
> between semaphore closure and operations.
> 
> [NB. in the test before the WaitForSingleObject call, the == is now >=]
> 
> This variation passes repeated runs of once4.c (aggressive cancellation
> with varying priority threads hitting the once_control) on my uni-
> processor. I also went as far as adding Sleep(1); after every semicolon
> and left-curly brace to try to break it.
> 
> PS. I'm also perhaps too conscious of 'spamming' the list with endless
> versions of this stubborn little routine, but this is the purpose of the
> list, so I'm not personally going to worry about it. I'm sure anyone who
> finds it irritating will filter it or something.
> 
> 
> #define PTHREAD_ONCE_INIT       {0, 0, 0, 0}
> 
> enum ptw32_once_state {
>   PTW32_ONCE_INIT      = 0x0,
>   PTW32_ONCE_DONE      = 0x1,
>   PTW32_ONCE_STARTED   = 0x2,
>   PTW32_ONCE_CANCELLED = 0x3
> };
> 
> struct pthread_once_t_
> {
>   int          state;
>   int          reserved;
>   int          numSemaphoreUsers;
>   HANDLE       semaphore;
> };
> 
> static void PTW32_CDECL
> ptw32_once_init_routine_cleanup(void * arg)
> {
>   pthread_once_t * once_control = (pthread_once_t *) arg;
> 
>   /*
>    * Continue to direct new threads into the wait path until the waiter
> that we
>    * release or a new thread can reset state to INIT.
>    */
>   (void) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state,
> (LONG)PTW32_ONCE_CANCELLED);
> 
>   if (InterlockedExchangeAdd((LPLONG)&once_control->semaphore, 0L)) /* MBR
> fence */
>     {
>       ReleaseSemaphore(once_control->semaphore, 1, NULL);
>     }
> }
> 
> int
> pthread_once (pthread_once_t * once_control, void (*init_routine) (void))
> {
>   int result;
>   int state;
>   HANDLE sema;
> 
>   if (once_control == NULL || init_routine == NULL)
>     {
>       result = EINVAL;
>       goto FAIL0;
>     }
>   else
>     {
>       result = 0;
>     }
> 
>   while ((state = (int)
> 
> PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control
> ->state,
> 
> (PTW32_INTERLOCKED_LONG)PTW32_ONCE_STARTED,
> 
> (PTW32_INTERLOCKED_LONG)PTW32_ONCE_INIT))
> 	 != PTW32_ONCE_DONE)
>     {
>       if (PTW32_ONCE_INIT == state)
>         {
> 
> #ifdef _MSC_VER
> #pragma inline_depth(0)
> #endif
> 
>           pthread_cleanup_push(ptw32_once_init_routine_cleanup, (void *)
> once_control);
>           (*init_routine)();
>           pthread_cleanup_pop(0);
> 
> #ifdef _MSC_VER
> #pragma inline_depth()
> #endif
> 
>           (void) PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control->state,
>                                             (LONG)PTW32_ONCE_DONE);
> 
>           /*
>            * we didn't create the semaphore.
>            * it is only there if there is someone waiting.
>            */
>           if (InterlockedExchangeAdd((LPLONG)&once_control->semaphore,
> 0L)) /* MBR fence */
>             {
>               ReleaseSemaphore(once_control->semaphore,
>                                once_control->numSemaphoreUsers, NULL);
>             }
>         }
>       else
>         {
>           if (1 == InterlockedIncrement((LPLONG)&once_control-
> >numSemaphoreUsers))
>             {
>               sema = CreateSemaphore(NULL, 0, INT_MAX, NULL);
> 
>               if
> (PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_contro
> l->semaphore,
>
(PTW32_INTERLOCKED_LONG)sema,
>
(PTW32_INTERLOCKED_LONG)0))
>                 {
>                   CloseHandle(sema);
>                 }
>             }
> 
> 	  /*
> 	   * If initter was cancelled then state is CANCELLED.
> 	   * Until state is reset to INIT, all new threads will enter the
> wait path.
> 	   * The woken waiter, if it exists, will also re-enter the wait
> path, but
> 	   * either it or a new thread will reset state = INIT here,
> continue around the Wait,
> 	   * and become the new initter. Any thread that is suspended in the
> wait path before
> 	   * this point will hit this check. Any thread suspended between
> this check and
> 	   * the Wait will wait on a valid semaphore, and possibly continue
> through it
> 	   * if the cancellation handler has incremented (released) it and
> there were
> 	   * no waiters.
> 	   */
> 	  (void)
> PTW32_INTERLOCKED_COMPARE_EXCHANGE((PTW32_INTERLOCKED_LPLONG)&once_control
> ->state,
> 
> (PTW32_INTERLOCKED_LONG)PTW32_ONCE_INIT,
> 
> (PTW32_INTERLOCKED_LONG)PTW32_ONCE_CANCELLED);
> 
> 	  /*
> 	   * Check 'state' again in case the initting thread has finished
> 	   * and left before seeing that there was a semaphore.
> 	   */
>           if (InterlockedExchangeAdd((LPLONG)&once_control->state, 0L) >=
> PTW32_ONCE_STARTED)
>             {
>               WaitForSingleObject(once_control->semaphore, INFINITE);
>             }
> 
>           if (0 == InterlockedDecrement((LPLONG)&once_control-
> >numSemaphoreUsers))
>             {
>               /* we were last */
>               if ((sema =
> 		   (HANDLE)
PTW32_INTERLOCKED_EXCHANGE((LPLONG)&once_control-
> >semaphore, (LONG)0)))
>                 {
>                   CloseHandle(sema);
>                 }
>             }
>         }
>     }
> 
>   /*
>    * ------------
>    * Failure Code
>    * ------------
>    */
> FAIL0:
>   return (result);
> }                               /* pthread_once */
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: New pthread_once implementation
  2005-05-30 15:26             ` Vladimir Kliatchko
@ 2005-05-31 16:28               ` Gottlob Frege
  2005-06-01  3:02                 ` Ross Johnson
  0 siblings, 1 reply; 10+ messages in thread
From: Gottlob Frege @ 2005-05-31 16:28 UTC (permalink / raw)
  To: Vladimir Kliatchko; +Cc: pthreads-win32

On 5/30/05, Vladimir Kliatchko <vladimir@kliatchko.com> wrote:
> 
> Regaring MCS version:
> It can be quite expensing, but only when multiple threads call pthread_once
> simultaneously. Also, the overhead is proportional to the number of threads
> so that:
> 1 thread - no overhead
> 2 simultaneous threads - the same overhead as in semaphore based version
> 3 or more simultaneous threads - higher overhead - but is this an important
> case?
> 

I think that might be reasonable.

Here's another tact we could try:

In the case of cancel, release the semaphore BUT do NOT reset the
control state.  In the waiting case, after waiting, check if state ==
done.  If not done, we know something went wrong - fall back to using
a named mutex.  Note that ALL new threads coming in will go to the
wait case then into the namedmutex code - after a cancel, no threads
try to call init in the normal way.

Does that make sense?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: New pthread_once implementation
  2005-05-31 16:28               ` Gottlob Frege
@ 2005-06-01  3:02                 ` Ross Johnson
  0 siblings, 0 replies; 10+ messages in thread
From: Ross Johnson @ 2005-06-01  3:02 UTC (permalink / raw)
  To: Gottlob Frege; +Cc: Vladimir Kliatchko, Pthreads-Win32 list

On Tue, 2005-05-31 at 12:28 -0400, Gottlob Frege wrote:
> I think that might be reasonable.
> 
> Here's another tact we could try:
> 
> In the case of cancel, release the semaphore BUT do NOT reset the
> control state.  In the waiting case, after waiting, check if state ==
> done.  If not done, we know something went wrong - fall back to using
> a named mutex.  Note that ALL new threads coming in will go to the
> wait case then into the namedmutex code - after a cancel, no threads
> try to call init in the normal way.
> 
> Does that make sense?

The cancellation handler must change 'state' before checking for the
semaphore, otherwise there's no possibility of managing the race between
the cancelled thread and a thread that has just entered the wait track
but not yet created the semaphore. That waiter will be stranded until a
new thread comes by.

The form of pthread_once with the MCS queue-based lock is the same as
that proposed by Alexander. Vlad's MCS lock implementation is the
missing element: it is light-weight in the uncontended case; is
efficient otherwise (including attempting to minimise cache-coherence
operations on MP systems), and; can be incorporated without changing the
library's ABI. I think it deserves inclusion in the next release.

If there are refinements that can be made to pthread_once later then
they can be considered, but in the meantime we will have a fully
working, robust, and efficient pthread_once.

Ross

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-06-01  3:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <97ffb3105052709425ce1126a@mail.gmail.com>
2005-05-28  1:30 ` New pthread_once implementation Vladimir Kliatchko
2005-05-28  3:46   ` Ross Johnson
2005-05-28 10:51     ` Vladimir Kliatchko
2005-05-28 13:54       ` Ross Johnson
2005-05-28 14:29         ` Vladimir Kliatchko
2005-05-29 12:58           ` Vladimir Kliatchko
2005-05-30 14:48           ` Ross Johnson
2005-05-30 15:26             ` Vladimir Kliatchko
2005-05-31 16:28               ` Gottlob Frege
2005-06-01  3:02                 ` Ross Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).