From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27667 invoked by alias); 29 May 2005 12:58:41 -0000 Mailing-List: contact pthreads-win32-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: pthreads-win32-owner@sources.redhat.com Received: (qmail 27638 invoked by uid 22791); 29 May 2005 12:58:32 -0000 Received: from mta7.srv.hcvlny.cv.net (HELO mta7.srv.hcvlny.cv.net) (167.206.4.202) by sourceware.org (qpsmtpd/0.30-dev) with ESMTP; Sun, 29 May 2005 12:58:32 +0000 Received: from vbook (ool-182dac10.dyn.optonline.net [24.45.172.16]) by mta7.srv.hcvlny.cv.net (iPlanet Messaging Server 5.2 HotFix 1.25 (built Mar 3 2004)) with ESMTP id <0IH9009LD5DVF3@mta7.srv.hcvlny.cv.net> for pthreads-win32@sources.redhat.com; Sun, 29 May 2005 08:58:44 -0400 (EDT) Date: Sun, 29 May 2005 12:58:00 -0000 From: Vladimir Kliatchko Subject: RE: New pthread_once implementation In-reply-to: <0IH700C30EV237@mta8.srv.hcvlny.cv.net> To: 'Ross Johnson' Cc: 'Gottlob Frege' , pthreads-win32@sources.redhat.com Message-id: <0IH9009LF5DVF3@mta7.srv.hcvlny.cv.net> MIME-version: 1.0 Content-type: multipart/mixed; boundary="Boundary_(ID_ZhejV9UtJY0Zm60b9HwrXg)" X-SW-Source: 2005/txt/msg00100.txt.bz2 This is a multi-part message in MIME format. --Boundary_(ID_ZhejV9UtJY0Zm60b9HwrXg) Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7BIT Content-length: 6169 Hi, I tried fixing that last problem (semaphore leak when multiple threads are cancelled and the initialization is never completed) but it appears the reference counting approach is inherently flawed. So I had to re-implement the whole thing from scratch. This implementation uses MCS locks which are not only fast but also require no clean-up thus avoiding the above problem. The implementation of MCS locks, which I had to build, probably deserves a module (.c and .h) of its own since we may want to use them in other places. This implementation has passed all the tests (again on a single CPU only) and I hope it is finally bug-free now. Pls, let me know what you think. --vlad PS. Sorry it takes me so many iterations. I feel like I have been spamming you with too many versions of this routine. > -----Original Message----- > From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32- > owner@sources.redhat.com] On Behalf Of Vladimir Kliatchko > Sent: Saturday, May 28, 2005 10:30 AM > To: 'Ross Johnson' > Cc: 'Gottlob Frege'; pthreads-win32@sources.redhat.com > Subject: RE: New pthread_once implementation > > What do you think of the attached implementation? I am still analyzing it, > but it passes the tests and appears to be free of that problem. It does > have > one minor glitch though: > If two threads come in, the semaphore is created. If both are cancelled > and > no new calls a made to finish the job, the semaphore is never destroyed. > I am not sure how big a deal this is. > > Re. optimizations: Great, I will try to do something. > > Thnx, > --vlad > > > -----Original Message----- > > From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads-win32- > > owner@sources.redhat.com] On Behalf Of Ross Johnson > > Sent: Saturday, May 28, 2005 9:55 AM > > To: Vladimir Kliatchko > > Cc: 'Gottlob Frege'; Pthreads-Win32 list > > Subject: RE: New pthread_once implementation > > > > On Sat, 2005-05-28 at 06:51 -0400, Vladimir Kliatchko wrote: > > > > > > > -----Original Message----- > > > > From: pthreads-win32-owner@sources.redhat.com [mailto:pthreads- > win32- > > > > owner@sources.redhat.com] On Behalf Of Ross Johnson > > > > Sent: Friday, May 27, 2005 11:48 PM > > > > To: Vladimir Kliatchko > > > > Cc: 'Gottlob Frege'; Pthreads-Win32 list > > > > Subject: RE: New pthread_once implementation > > > > > > > > On Fri, 2005-05-27 at 21:30 -0400, Vladimir Kliatchko wrote: > > > > > Nice catch. Let me see if I can fix it. > > > > > > > > > > Note that the same problem exists in the currently released event- > > based > > > > > implementation (cvs version 1.16): > > > > > > > > > > thread1 comes in, start initing > > > > > thread2 creates event, starts waiting > > > > > thread3 comes in starts waiting > > > > > thread1 is cancelled, signals event > > > > > thread2 wakes up, proceeds to the point right before the > resetEvent > > > > > thread3 wakes up, closes event handle > > > > > thread2 resets closed handle > > > > > > > > Relies on HANDLE uniqueness and assumes that an error will result. > > This > > > > is why the 2.6.0 version (and earlier) checks the return code and > > > > restores Win32 LastError if necessary - for GetLastError > transparency. > > > > > > Does Windows guarantee that the handles are not reused? What happens > if > > a > > > thread closes a handle while another thread is blocked on it? Is any > of > > this > > > in Microsoft documentation? Consider the following scenario for the > > > event-based implementation: > > > > Well, apparently they're not unique when recycled, so there is a bug > > here to fix in both versions: > > http://msdn.microsoft.com/library/default.asp?url=/library/en- > > us/dngenlib/html/msdn_handles1.asp > > [Under "Native Windows NT Objects"] > > "Unlike the handles that are maintained by the Win32 USER and GDI > > subsystem components, handles to native objects under Windows NT are not > > unique; that is, upon destruction of an object, the corresponding handle > > may be recycled and will look exactly like the handle to the destroyed > > object." > > > > But they are local to the process, rather than system wide if that > > helps. > > > > > > > Also, regarding my previous comment to Ross about very high cost > of > > > > using > > > > > InterlockedExchangeAdd for MBR: > > > > > I did some simple benchmarking. Running pthread_once 50,000,000 on > > my > > > > pretty > > > > > slow single CPU machine takes about 2.1 seconds. Replacing > > > > > InterlockedExchangeAdd with simple read brings it down to 0.6 > > seconds. > > > > This > > > > > looks significant. > > > > > > > > Using the PTW32_INTERLOCKED_COMPARE_EXCHANGE macro as in your latest > > (in > > > > CVS) version and building the library for inlined functions (nmake > VC- > > > > inlined) and x86 architecture causes customised versions of > > > > InterlockedCompareExchange to be used, and this results in inlined > > asm. > > > > Same for PTW32_INTERLOCKED_EXCHANGE. > > > > > > > > Also, on single-CPU x86, the library dynamically switches to using > > > > 'cmpxchg' rather than 'lock cmpxchg' to avoid locking the bus. This > > > > appears to match what the kernel32.dll versions do. On non-x86 > > > > architectures the kernel32.dll versions are called, with call > > overhead. > > > > > > > > PTW32_INTERLOCKED_EXCHANGE_ADD could be added, as could other > > > > architectures. See ptw32_InterlockedCompareExchange.c > > > > > > I have rerun my benchmark with VC-inline. The difference is now less > > > significant 0.9 vs 0.6 but still noticeable. I guess cmpxchg even > > without > > > locking is quite expensive. On multi-CPU systems the difference should > > be > > > much higher due to the time it takes to lock the bus and to the > > contention > > > it may cause. It sounded as if you did not care much to try to > optimize > > it. > > > I did not mean to suggest that we have to do it right now either. I > just > > > wanted to get your opinion on whether we want to deal with this in the > > > future. > > > > By all means include any optimisation you think is worthwhile. I was > > just pointing out that the difference isn't necessarily 2.1 v 0.6. > > --Boundary_(ID_ZhejV9UtJY0Zm60b9HwrXg) Content-type: text/plain; name=vk_pthread_once5.c Content-transfer-encoding: 7BIT Content-disposition: attachment; filename=vk_pthread_once5.c Content-length: 4834 #define PTHREAD_ONCE_INIT {PTW32_FALSE, 0, 0, 0} struct pthread_once_t_ { int done; void *lock; int reserved1; int reserved2; }; struct ptw32_mcs_node_ { struct ptw32_mcs_node_ **lock; /* ptr to tail of queue */ struct ptw32_mcs_node_ *next; /* ptr to successor in queue */ LONG readyFlag; /* set after lock is released by predecessor */ LONG nextFlag; /* set after 'next' ptr is set by successor */ }; typedef struct ptw32_mcs_node_ ptw32_mcs_node; typedef struct ptw32_mcs_node_ *ptw32_mcs_lock; /* * ptw32_mcs_flag_set -- notify another thread about an event. * * Set event if an event handle has been stored in the flag, and * set flag to -1 otherwise. Note that -1 cannot be a valid hande value. */ static void ptw32_mcs_flag_set(LONG *flag) { HANDLE e = (HANDLE)PTW32_INTERLOCKED_COMPARE_EXCHANGE( (PTW32_INTERLOCKED_LPLONG)flag, (PTW32_INTERLOCKED_LONG)-1, (PTW32_INTERLOCKED_LONG)0); if (0 != e) { /* another thread has already stored an event handle in the flag */ SetEvent(e); } } /* * ptw32_mcs_flag_set -- wait for notification from another. * * Store an event handle in the flag and wait on it if the flag has not been * set, and proceed without creating an event otherwise. */ static void ptw32_mcs_flag_wait(LONG *flag) { if (0 == InterlockedExchangeAdd((LPLONG)flag, 0)) /* MBR fence */ { /* the flag is not set. create event. */ HANDLE e = CreateEvent(NULL, PTW32_FALSE, PTW32_FALSE, NULL); if (0 == PTW32_INTERLOCKED_COMPARE_EXCHANGE( (PTW32_INTERLOCKED_LPLONG)flag, (PTW32_INTERLOCKED_LONG)e, (PTW32_INTERLOCKED_LONG)0)) { /* stored handle in the flag. wait on it now. */ WaitForSingleObject(e, INFINITE); } CloseHandle(e); } } /* * ptw32_mcs_lock_aquire -- acquire an MCS lock. * * See: * J. M. Mellor-Crummey and M. L. Scott. * Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. * ACM Transactions on Computer Systems, 9(1):21-65, Feb. 1991. */ static void ptw32_mcs_lock_aquire(ptw32_mcs_lock *lock, ptw32_mcs_node *node) { ptw32_mcs_node *pred; node->lock = lock; node->nextFlag = 0; node->readyFlag = 0; node->next = 0; /* initially, no successor */ /* queue for the lock */ pred = (ptw32_mcs_node *)PTW32_INTERLOCKED_EXCHANGE((LPLONG)lock, (LONG)node); if (0 != pred) { /* the lock was not free. link behind predecessor. */ pred->next = node; ptw32_mcs_flag_set(&pred->nextFlag); ptw32_mcs_flag_wait(&node->readyFlag); } } /* * ptw32_mcs_lock_aquire -- release an MCS lock. * * See: * J. M. Mellor-Crummey and M. L. Scott. * Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. * ACM Transactions on Computer Systems, 9(1):21-65, Feb. 1991. */ static void ptw32_mcs_lock_release(ptw32_mcs_node *node) { ptw32_mcs_lock *lock = node->lock; ptw32_mcs_node *next = (ptw32_mcs_node *) InterlockedExchangeAdd((LPLONG)&node->next, 0); /* MBR fence */ if (0 == next) { /* no known successor */ if (node == (ptw32_mcs_node *)PTW32_INTERLOCKED_COMPARE_EXCHANGE( (PTW32_INTERLOCKED_LPLONG)lock, (PTW32_INTERLOCKED_LONG)0, (PTW32_INTERLOCKED_LONG)node)) { /* no successor, lock is free now */ return; } /* wait for successor */ ptw32_mcs_flag_wait(&node->nextFlag); next = (ptw32_mcs_node *)InterlockedExchangeAdd((LPLONG)&node->next, 0); /* MBR fence */ } /* pass the lock */ ptw32_mcs_flag_set(&next->readyFlag); } static void PTW32_CDECL ptw32_once_on_init_cancel (void * arg) { /* when the initting thread is cancelled we have to release the lock */ ptw32_mcs_node *node = (ptw32_mcs_node*)arg; ptw32_mcs_lock_release(node); } int pthread_once (pthread_once_t * once_control, void (*init_routine) (void)) { if (once_control == NULL || init_routine == NULL) { return EINVAL; } if (!InterlockedExchangeAdd((LPLONG)&once_control->done, 0)) /* MBR fence */ { ptw32_mcs_node node; ptw32_mcs_lock_aquire(&(ptw32_mcs_lock)once_control->lock, &node); if (!once_control->done) { #ifdef _MSC_VER #pragma inline_depth(0) #endif pthread_cleanup_push(ptw32_once_on_init_cancel, (void *)&node); (*init_routine)(); pthread_cleanup_pop(0); #ifdef _MSC_VER #pragma inline_depth() #endif once_control->done = PTW32_TRUE; } ptw32_mcs_lock_release(&node); } return 0; } /* pthread_once */ --Boundary_(ID_ZhejV9UtJY0Zm60b9HwrXg)--