public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Possible race in SYSV IPC (semaphores)
@ 2012-10-26 18:37 Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-10-27 10:24 ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-10-26 18:37 UTC (permalink / raw)
  To: cygwin

Hi,

For now, I can only report the observed (mis) behavior of SYSV semop() call,
which (on the client side) gets manifested as the following:

transport_layer_pipes::connect: lost connection to cygserver, error = 2

(this code then does a by-hand adjustment with semctl(SETVAL)).

Note that there is a dedicated cygserver process running for my single-threaded
application.

Looking at the debugging output of cygserver, this is what I see in the log
(around the only time semctl() is logged there):

cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/transport_pipes.cc, line 132: Try to create named pipe: \\.\pipe\cygwin-13a7ed34cc1953a9-lpc
cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/transport_pipes.cc, line 132: Try to create named pipe: \\.\pipe\cygwin-13a7ed34cc1953a9-lpc

Note the double pipe creation call, and only a single "exit" log line such as:

cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/sem.cc, line 81: leaving (3416)

Cygserver does not stop (also, since SIGSYS is set to ignore in the program,
it also keeps running -- although, not always quite successfully once the semop()
failure occurred.)

The semaphore operations are very intensive; and involve arrays of 5 sems at
some times;  also, there are quite large chunks of shmem updated every now
and then.

I studied the source of cygserver, and noticed that pipe_instance (transport_pipes.cc)
is not declared "volatile".  This is strange because the compiler can rearrange lines
of code that include this variable, otherwise.  And that seems rather critical.

Right now what I observe, is that SYSV IPC is unreliable, and I'm yet to figure
out why;  the very same code (the locking logic) works on Linux/Solaris/Mac for
years and on thousands (yes, that many) of hosts.  With CYGWIN the instability can
appear within a wide range of run time: from just a few minutes to some long hours,
rather randomly.

Any input can be greatly appreciated.

Anton Lavrentiev
Contractor NIH/NLM/NCBI


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-10-26 18:37 Possible race in SYSV IPC (semaphores) Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2012-10-27 10:24 ` Corinna Vinschen
  0 siblings, 0 replies; 18+ messages in thread
From: Corinna Vinschen @ 2012-10-27 10:24 UTC (permalink / raw)
  To: cygwin

On Oct 26 14:36, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> Hi,
> 
> For now, I can only report the observed (mis) behavior of SYSV semop() call,
> which (on the client side) gets manifested as the following:
> 
> transport_layer_pipes::connect: lost connection to cygserver, error = 2
> 
> (this code then does a by-hand adjustment with semctl(SETVAL)).
> 
> Note that there is a dedicated cygserver process running for my single-threaded
> application.
> 
> Looking at the debugging output of cygserver, this is what I see in the log
> (around the only time semctl() is logged there):
> 
> cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/transport_pipes.cc, line 132: Try to create named pipe: \\.\pipe\cygwin-13a7ed34cc1953a9-lpc
> cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/transport_pipes.cc, line 132: Try to create named pipe: \\.\pipe\cygwin-13a7ed34cc1953a9-lpc
> 
> Note the double pipe creation call, and only a single "exit" log line such as:
> cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/sem.cc, line 81: leaving (3416)

That's normal, AFAICS.  Look into cygserver.cc, method
server_submission_loop::request_loop().  It calls _transport->accept()
in a loop.  _transport->accept always prints the "Try to create" log
message.  Every time a client connects, it will return, the cleint
request is put into the request queue, and then the loop returns to
call _transport->accept().  So you see this log message one more time
than clients have connected up to that point in time.

> Cygserver does not stop (also, since SIGSYS is set to ignore in the program,
> it also keeps running -- although, not always quite successfully once the semop()
> failure occurred.)
> 
> The semaphore operations are very intensive; and involve arrays of 5 sems at
> some times;  also, there are quite large chunks of shmem updated every now
> and then.
> 
> I studied the source of cygserver, and noticed that pipe_instance (transport_pipes.cc)
> is not declared "volatile".  This is strange because the compiler can rearrange lines
> of code that include this variable, otherwise.  And that seems rather critical.

You could build cygserver with the variable set to volatile and see if
it changes anything.  But I doubt it.

> Right now what I observe, is that SYSV IPC is unreliable, and I'm yet to figure
> out why;

For testing it would be most helpful if you would create a STC.  Ideally
it would be the shortest possible code which allows to reproduce the issue.

Other than that, is it possible that you just stumble over a restriction
in the sizes of the IPC objects?  Did you try to tweak /etc/cygserver.conf
for a start?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
  2012-11-27  7:13                     ` Corinna Vinschen
  2012-11-27 14:37                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2012-11-28 21:20                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  1 sibling, 0 replies; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-28 21:20 UTC (permalink / raw)
  To: cygwin

> > > I just checked in the change.
> >
> > Thank you.  I will test it as soon as it's available in a snapshot.

I've rebuilt cygserver, and it looks/works fine!

Anton Lavrentiev
Contractor NIH/NLM/NCBI


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
  2012-11-27  7:13                     ` Corinna Vinschen
@ 2012-11-27 14:37                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-28 21:20                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  1 sibling, 0 replies; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-27 14:37 UTC (permalink / raw)
  To: cygwin

> You don't have space (or quota) issues on your hard disk by any chance?

Nope, no quota whatsoever.  Yes, the file grows big but there is plenty of free space on the disk.
The machine is pretty slow, though, a dual-core thing.  There's definitely another case of a race
(the point of lock-up is never the same), but luckily it affects only the cygserver running in the verbose
mode (so maybe, there is a FILE* sync thing -- a side effect of the active logging).

Anton Lavrentiev
Contractor NIH/NLM/NCBI

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-26 18:09                   ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-26 18:14                     ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2012-11-27  7:13                     ` Corinna Vinschen
  2012-11-27 14:37                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-28 21:20                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  1 sibling, 2 replies; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-27  7:13 UTC (permalink / raw)
  To: cygwin

On Nov 26 18:09, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > I just checked in the change.
> 
> Thank you.  I will test it as soon as it's available in a snapshot.
> 
> > ...especially since I can't reproduce this.  I tried with cygserver
> > before and after my patch and in both cases your semaphore testcase
> > worked as expected with -d -l 7.
> 
> I see.  Maybe this will help.  My sequence of actions:
> 
> In an elevated cmd.exe, from Cygwin's /bin:
> 
> cygrunsrv.exe -I cygserver -x /usr/sbin/cygserver.exe -a "-d -l 7"
> cygrunsrv.exe -S cygserver
> 
> In a regular cmd.exe:
> 
> a.exe
> 
> (this is my testcase that locks/unlocks the semaphores).
> It never gets past iteration 1000.

Nope, no luck.  WJFFM.

You don't have space (or quota) issues on your hard disk by any chance?
The stderr output will go to /var/log/cygserver.log.  The file has a
size of about 60 Megs after 20K iterations and every run of the cygserver
service will append to the file.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
  2012-11-26 18:09                   ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2012-11-26 18:14                     ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-27  7:13                     ` Corinna Vinschen
  1 sibling, 0 replies; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-26 18:14 UTC (permalink / raw)
  To: cygwin

Correction:

> cygrunsrv.exe -I cygserver -x /usr/sbin/cygserver.exe -a "-d -l 7"

  cygrunsrv.exe -I cygserver -p /usr/sbin/cygserver.exe -a "-d -l 7"

Anton Lavrentiev
Contractor NIH/NLM/NCBI


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
  2012-11-26 12:54                 ` Corinna Vinschen
@ 2012-11-26 18:09                   ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-26 18:14                     ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-27  7:13                     ` Corinna Vinschen
  0 siblings, 2 replies; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-26 18:09 UTC (permalink / raw)
  To: cygwin

> I just checked in the change.

Thank you.  I will test it as soon as it's available in a snapshot.

> ...especially since I can't reproduce this.  I tried with cygserver
> before and after my patch and in both cases your semaphore testcase
> worked as expected with -d -l 7.

I see.  Maybe this will help.  My sequence of actions:

In an elevated cmd.exe, from Cygwin's /bin:

cygrunsrv.exe -I cygserver -x /usr/sbin/cygserver.exe -a "-d -l 7"
cygrunsrv.exe -S cygserver

In a regular cmd.exe:

a.exe

(this is my testcase that locks/unlocks the semaphores).
It never gets past iteration 1000.

Also, if I start cygserver just as a regular process (not as a background
thing), so that it logs right to the window, it does not lock up.

Anton Lavrentiev
Contractor NIH/NLM/NCBI


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-23 17:59               ` Corinna Vinschen
@ 2012-11-26 12:54                 ` Corinna Vinschen
  2012-11-26 18:09                   ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-26 12:54 UTC (permalink / raw)
  To: cygwin

On Nov 23 18:59, Corinna Vinschen wrote:
> On Nov 23 17:47, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > > I agree.  I'll do that in the next couple of days.
> > 
> > Thank you.  Please let me know when you do, so that I can
> > check the relevant snapshot out.
> 
> Yes, I'll try.

I just checked in the change.

> > Incidentally, have you been able to confirm the dead-locking
> > that I reported earlier (below)?  I just ran the patched cygserver
> > with the "-d -l 7" arguments, along with my test case, and both
> > dead-locked just as before.
> 
> No, I didn't test that.  If you have any chance to debug that, I'd
> appreciate it.

...especially since I can't reproduce this.  I tried with cygserver
before and after my patch and in both cases your semaphore testcase
worked as expectd with -d -l 7.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-23 17:47             ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2012-11-23 17:59               ` Corinna Vinschen
  2012-11-26 12:54                 ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-23 17:59 UTC (permalink / raw)
  To: cygwin

On Nov 23 17:47, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > I agree.  I'll do that in the next couple of days.
> 
> Thank you.  Please let me know when you do, so that I can
> check the relevant snapshot out.

Yes, I'll try.

> Incidentally, have you been able to confirm the dead-locking
> that I reported earlier (below)?  I just ran the patched cygserver
> with the "-d -l 7" arguments, along with my test case, and both
> dead-locked just as before.

No, I didn't test that.  If you have any chance to debug that, I'd
appreciate it.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
  2012-11-23 17:29           ` Corinna Vinschen
@ 2012-11-23 17:47             ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-23 17:59               ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-23 17:47 UTC (permalink / raw)
  To: cygwin

> I agree.  I'll do that in the next couple of days.

Thank you.  Please let me know when you do, so that I can
check the relevant snapshot out.

Incidentally, have you been able to confirm the dead-locking
that I reported earlier (below)?  I just ran the patched cygserver
with the "-d -l 7" arguments, along with my test case, and both
dead-locked just as before.

> The test case attached from my earlier post from today, also shows
> a race (dead-lock, actually) occurring in CYGSERVER, when it is run
> with the "-d -l 7" command line (even when it is only one application,
> the test case is using it).  There is an internal disconcert between
> the cygserver threads.
> 
> The log ends with this, and there is no way to stop the server
> (like with "cygrunsrv -E") other than killing it from the task
> manager.

Anton Lavrentiev
Contractor NIH/NLM/NCBI




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-23 17:16         ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2012-11-23 17:29           ` Corinna Vinschen
  2012-11-23 17:47             ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-23 17:29 UTC (permalink / raw)
  To: cygwin

On Nov 23 17:16, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > This should avoid the race (*and* work...)
> > Please give it a try.
> 
> Thanks!  I've tried both patches (pipe race + semadj), and they do seem to work!

Thanks for testing.

> Although (not being a party pooper :-), I think that all the logic
> around "pipe_instance" can now be dropped entirely, and benefit from eliminating
> an unnecessary synchronization point (pipe_instance is now only used
> for logging a proper pipe ID, which is not even seen in regular use,
> and for an assert(), which again is only there in debug mode builds).
> So there is no need to arrange Enter'ing/Leave'ing of the critical section
> to do just the hidden things, yet the working code would be a bit cleaner.

I agree.  I'll do that in the next couple of days.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
  2012-11-23 16:44       ` Corinna Vinschen
@ 2012-11-23 17:16         ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-23 17:29           ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-23 17:16 UTC (permalink / raw)
  To: cygwin

> This should avoid the race (*and* work...)
> Please give it a try.

Thanks!  I've tried both patches (pipe race + semadj), and they do seem to work!

Although (not being a party pooper :-), I think that all the logic
around "pipe_instance" can now be dropped entirely, and benefit from eliminating
an unnecessary synchronization point (pipe_instance is now only used
for logging a proper pipe ID, which is not even seen in regular use,
and for an assert(), which again is only there in debug mode builds).
So there is no need to arrange Enter'ing/Leave'ing of the critical section
to do just the hidden things, yet the working code would be a bit cleaner.

Anton Lavrentiev
Contractor NIH/NLM/NCBI


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-23 13:34     ` Corinna Vinschen
@ 2012-11-23 16:44       ` Corinna Vinschen
  2012-11-23 17:16         ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-23 16:44 UTC (permalink / raw)
  To: cygwin

On Nov 23 14:33, Corinna Vinschen wrote:
> On Nov 23 14:10, Corinna Vinschen wrote:
> > On Nov 23 12:36, Corinna Vinschen wrote:
> > > On Nov 19 21:06, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > > > Hello again,
> > > > 
> > > > I can now positively confirm the race condition in cygserver w.r.t. the named
> > > > pipe used to serialize SYSV requests through the server.  The race is due to
> > > > that transport_layer_pipes::accept() (bool *const recoverable) (file: transport_pipes.cc) does actually _create_ the pipe when pipe_instance == 0
> > > > (ironically, transport_layer_pipes::listen() does not create any OS primitives
> > > > at all!).
> > > > 
> > > > This means that under heavy load, cygserver threads may all end up processing
> > > > their requests and closing all instances of the pipe (bringing pipe_instance == 0)
> > > > yet not being able to get to the point of accepting new request (that is, to
> > > > re-create the pipe).  For the client (user process), this looks like the pipe
> > > > does not exist (during that very tiny period of time), and the following message
> > > > gets printed:
> > > > 
> > > > Iteration 3016
> > > >       1 [main] a 4872 transport_layer_pipes::connect: lost connection to cygserver, error = 2
> > > 
> > > Thanks for analyzing this situation.  IIUC, that means if we create the
> > > pipe in listen(), and then create another pipe instance per accepting
> > > thread in accept(), we will always have at least one instance of the pipe,
> > > whatever the load, right?
> > > 
> > > Are you set up to test a patch?  If so, I'd propose the below patch.  It
> > > would be nice if you could test it in your environment.
> > 
> > Forget it.  That doesn't work.
> > 
> > Surprisingly, the next client calling CreateFile is *not* connected to
> > any server side of the pipe which calls ConnectNamedPipe, but apparently
> > the pipes are used in the order of creation.  Thus, the first client
> > hangs waiting for the pipe instance created in the listen() method, but
> > since the server never calls ConnectNamedPipe on that instance, the
> > client will wait forever.
> > 
> > Bummer.
> > 
> > Back to the drawing board...
> 
> Try the below one.  It also connects to the listening pipe in listen(),
> so there's a single instance of the pipe connected and blocked for
> further use forever.  This should avoid the race (*and* work...)
> Please give it a try.

You can also simply try the today's snapshot version of cygserver from
http://cygwin.com/snapshots/  It contains this patch, as well as the
semaphore patch I mailed to the other thread.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-23 13:11   ` Corinna Vinschen
@ 2012-11-23 13:34     ` Corinna Vinschen
  2012-11-23 16:44       ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-23 13:34 UTC (permalink / raw)
  To: cygwin

On Nov 23 14:10, Corinna Vinschen wrote:
> On Nov 23 12:36, Corinna Vinschen wrote:
> > On Nov 19 21:06, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > > Hello again,
> > > 
> > > I can now positively confirm the race condition in cygserver w.r.t. the named
> > > pipe used to serialize SYSV requests through the server.  The race is due to
> > > that transport_layer_pipes::accept() (bool *const recoverable) (file: transport_pipes.cc) does actually _create_ the pipe when pipe_instance == 0
> > > (ironically, transport_layer_pipes::listen() does not create any OS primitives
> > > at all!).
> > > 
> > > This means that under heavy load, cygserver threads may all end up processing
> > > their requests and closing all instances of the pipe (bringing pipe_instance == 0)
> > > yet not being able to get to the point of accepting new request (that is, to
> > > re-create the pipe).  For the client (user process), this looks like the pipe
> > > does not exist (during that very tiny period of time), and the following message
> > > gets printed:
> > > 
> > > Iteration 3016
> > >       1 [main] a 4872 transport_layer_pipes::connect: lost connection to cygserver, error = 2
> > 
> > Thanks for analyzing this situation.  IIUC, that means if we create the
> > pipe in listen(), and then create another pipe instance per accepting
> > thread in accept(), we will always have at least one instance of the pipe,
> > whatever the load, right?
> > 
> > Are you set up to test a patch?  If so, I'd propose the below patch.  It
> > would be nice if you could test it in your environment.
> 
> Forget it.  That doesn't work.
> 
> Surprisingly, the next client calling CreateFile is *not* connected to
> any server side of the pipe which calls ConnectNamedPipe, but apparently
> the pipes are used in the order of creation.  Thus, the first client
> hangs waiting for the pipe instance created in the listen() method, but
> since the server never calls ConnectNamedPipe on that instance, the
> client will wait forever.
> 
> Bummer.
> 
> Back to the drawing board...

Try the below one.  It also connects to the listening pipe in listen(),
so there's a single instance of the pipe connected and blocked for
further use forever.  This should avoid the race (*and* work...)
Please give it a try.


Corinna


        * cygserver.cc (main): Call listen right after creating the
        transport.
        * transport_pipes.cc (transport_layer_pipes::listen): Create
        first instance of the named pipe here.  Connect the client side
	to block it for further use by the system.
        (transport_layer_pipes::accept): Don't handle first pipe instance
        here.  Change debug output accordingly.


Index: cygserver.cc
===================================================================
RCS file: /cvs/src/src/winsup/cygserver/cygserver.cc,v
retrieving revision 1.16
diff -u -p -r1.16 cygserver.cc
--- cygserver.cc	10 Oct 2011 15:48:54 -0000	1.16
+++ cygserver.cc	23 Nov 2012 13:32:22 -0000
@@ -715,15 +715,15 @@ main (const int argc, char *argv[])
   transport_layer_base *const transport = create_server_transport ();
   assert (transport);
 
+  if (transport->listen () == -1)
+    return 1;
+
   process_cache cache (process_cache_size, cleanup_threads);
 
   server_submission_loop submission_loop (&request_queue, transport, &cache);
 
   request_queue.add_submission_loop (&submission_loop);
 
-  if (transport->listen () == -1)
-    return 1;
-
   cache.start ();
 
   request_queue.start ();
Index: transport_pipes.cc
===================================================================
RCS file: /cvs/src/src/winsup/cygserver/transport_pipes.cc,v
retrieving revision 1.14
diff -u -p -r1.14 transport_pipes.cc
--- transport_pipes.cc	14 Feb 2012 14:12:11 -0000	1.14
+++ transport_pipes.cc	23 Nov 2012 13:32:22 -0000
@@ -107,7 +107,32 @@ transport_layer_pipes::listen ()
 
   _is_listening_endpoint = true;
 
-  /* no-op */
+  debug ("Try to create named pipe: %ls", _pipe_name);
+
+  HANDLE listen_pipe =
+    CreateNamedPipeW (_pipe_name,
+		      PIPE_ACCESS_DUPLEX | FILE_FLAG_FIRST_PIPE_INSTANCE,
+		      PIPE_TYPE_BYTE | PIPE_WAIT, PIPE_UNLIMITED_INSTANCES,
+		      0, 0, 1000, &sec_all_nih);
+  if (listen_pipe != INVALID_HANDLE_VALUE)
+    {
+      HANDLE connect_pipe =
+	CreateFileW (_pipe_name, GENERIC_READ | GENERIC_WRITE, 0, &sec_all_nih,
+		     OPEN_EXISTING, 0, NULL);
+      if (connect_pipe == INVALID_HANDLE_VALUE)
+	{
+	  CloseHandle (listen_pipe);
+	  listen_pipe = INVALID_HANDLE_VALUE;
+	}
+    }
+
+  if (listen_pipe == INVALID_HANDLE_VALUE)
+    {
+      system_printf ("failed to create named pipe: "
+		     "is the daemon already running?");
+      return -1;
+    }
+
   return 0;
 }
 
@@ -125,38 +150,19 @@ transport_layer_pipes::accept (bool *con
   // Read: http://www.securityinternals.com/research/papers/namedpipe.php
   // See also the Microsoft security bulletins MS00-053 and MS01-031.
 
-  // FIXME: Remove FILE_CREATE_PIPE_INSTANCE.
-
-  const bool first_instance = (pipe_instance == 0);
-
-  debug ("Try to create named pipe: %ls", _pipe_name);
+  debug ("Try to create named pipe instance %ld: %ls",
+	 pipe_instance + 1, _pipe_name);
 
   const HANDLE accept_pipe =
-    CreateNamedPipeW (_pipe_name,
-		     (PIPE_ACCESS_DUPLEX
-		      | (first_instance ? FILE_FLAG_FIRST_PIPE_INSTANCE : 0)),
-		     (PIPE_TYPE_BYTE | PIPE_WAIT),
-		     PIPE_UNLIMITED_INSTANCES,
-		     0, 0, 1000,
-		     &sec_all_nih);
-
-  const bool duplicate = (accept_pipe == INVALID_HANDLE_VALUE
-			  && pipe_instance == 0
-			  && GetLastError () == ERROR_ACCESS_DENIED);
+    CreateNamedPipeW (_pipe_name, PIPE_ACCESS_DUPLEX,
+		      PIPE_TYPE_BYTE | PIPE_WAIT, PIPE_UNLIMITED_INSTANCES,
+		      0, 0, 1000, &sec_all_nih);
 
   if (accept_pipe != INVALID_HANDLE_VALUE)
     InterlockedIncrement (&pipe_instance);
 
   LeaveCriticalSection (&pipe_instance_lock);
 
-  if (duplicate)
-    {
-      *recoverable = false;
-      system_printf ("failed to create named pipe: "
-		     "is the daemon already running?");
-      return NULL;
-    }
-
   if (accept_pipe == INVALID_HANDLE_VALUE)
     {
       debug_printf ("error creating pipe (%lu).", GetLastError ());
@@ -164,8 +170,6 @@ transport_layer_pipes::accept (bool *con
       return NULL;
     }
 
-  assert (accept_pipe);
-
   if (!ConnectNamedPipe (accept_pipe, NULL)
       && GetLastError () != ERROR_PIPE_CONNECTED)
     {

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-23 11:37 ` Corinna Vinschen
@ 2012-11-23 13:11   ` Corinna Vinschen
  2012-11-23 13:34     ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-23 13:11 UTC (permalink / raw)
  To: cygwin

On Nov 23 12:36, Corinna Vinschen wrote:
> On Nov 19 21:06, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > Hello again,
> > 
> > I can now positively confirm the race condition in cygserver w.r.t. the named
> > pipe used to serialize SYSV requests through the server.  The race is due to
> > that transport_layer_pipes::accept() (bool *const recoverable) (file: transport_pipes.cc) does actually _create_ the pipe when pipe_instance == 0
> > (ironically, transport_layer_pipes::listen() does not create any OS primitives
> > at all!).
> > 
> > This means that under heavy load, cygserver threads may all end up processing
> > their requests and closing all instances of the pipe (bringing pipe_instance == 0)
> > yet not being able to get to the point of accepting new request (that is, to
> > re-create the pipe).  For the client (user process), this looks like the pipe
> > does not exist (during that very tiny period of time), and the following message
> > gets printed:
> > 
> > Iteration 3016
> >       1 [main] a 4872 transport_layer_pipes::connect: lost connection to cygserver, error = 2
> 
> Thanks for analyzing this situation.  IIUC, that means if we create the
> pipe in listen(), and then create another pipe instance per accepting
> thread in accept(), we will always have at least one instance of the pipe,
> whatever the load, right?
> 
> Are you set up to test a patch?  If so, I'd propose the below patch.  It
> would be nice if you could test it in your environment.

Forget it.  That doesn't work.

Surprisingly, the next client calling CreateFile is *not* connected to
any server side of the pipe which calls ConnectNamedPipe, but apparently
the pipes are used in the order of creation.  Thus, the first client
hangs waiting for the pipe instance created in the listen() method, but
since the server never calls ConnectNamedPipe on that instance, the
client will wait forever.

Bummer.

Back to the drawing board...


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible race in SYSV IPC (semaphores)
  2012-11-19 21:06 Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2012-11-23 11:37 ` Corinna Vinschen
  2012-11-23 13:11   ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Corinna Vinschen @ 2012-11-23 11:37 UTC (permalink / raw)
  To: cygwin

On Nov 19 21:06, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> Hello again,
> 
> I can now positively confirm the race condition in cygserver w.r.t. the named
> pipe used to serialize SYSV requests through the server.  The race is due to
> that transport_layer_pipes::accept() (bool *const recoverable) (file: transport_pipes.cc) does actually _create_ the pipe when pipe_instance == 0
> (ironically, transport_layer_pipes::listen() does not create any OS primitives
> at all!).
> 
> This means that under heavy load, cygserver threads may all end up processing
> their requests and closing all instances of the pipe (bringing pipe_instance == 0)
> yet not being able to get to the point of accepting new request (that is, to
> re-create the pipe).  For the client (user process), this looks like the pipe
> does not exist (during that very tiny period of time), and the following message
> gets printed:
> 
> Iteration 3016
>       1 [main] a 4872 transport_layer_pipes::connect: lost connection to cygserver, error = 2

Thanks for analyzing this situation.  IIUC, that means if we create the
pipe in listen(), and then create another pipe instance per accepting
thread in accept(), we will always have at least one instance of the pipe,
whatever the load, right?

Are you set up to test a patch?  If so, I'd propose the below patch.  It
would be nice if you could test it in your environment.


Thanks,
Corinna


	* cygserver.cc (main): Call listen right after creating the
	transport.
	* transport_pipes.cc (transport_layer_pipes::listen): Create
	first instance of the named pipe here.
	(transport_layer_pipes::accept): Don't handle first pipe instance
	here.  Change debug output accordingly.

Index: cygserver.cc
===================================================================
RCS file: /cvs/src/src/winsup/cygserver/cygserver.cc,v
retrieving revision 1.16
diff -u -p -r1.16 cygserver.cc
--- cygserver.cc	10 Oct 2011 15:48:54 -0000	1.16
+++ cygserver.cc	23 Nov 2012 11:31:45 -0000
@@ -715,15 +715,15 @@ main (const int argc, char *argv[])
   transport_layer_base *const transport = create_server_transport ();
   assert (transport);
 
+  if (transport->listen () == -1)
+    return 1;
+
   process_cache cache (process_cache_size, cleanup_threads);
 
   server_submission_loop submission_loop (&request_queue, transport, &cache);
 
   request_queue.add_submission_loop (&submission_loop);
 
-  if (transport->listen () == -1)
-    return 1;
-
   cache.start ();
 
   request_queue.start ();
Index: transport_pipes.cc
===================================================================
RCS file: /cvs/src/src/winsup/cygserver/transport_pipes.cc,v
retrieving revision 1.14
diff -u -p -r1.14 transport_pipes.cc
--- transport_pipes.cc	14 Feb 2012 14:12:11 -0000	1.14
+++ transport_pipes.cc	23 Nov 2012 11:31:45 -0000
@@ -107,7 +107,21 @@ transport_layer_pipes::listen ()
 
   _is_listening_endpoint = true;
 
-  /* no-op */
+  debug ("Try to create named pipe: %ls", _pipe_name);
+
+  const HANDLE listen_pipe =
+    CreateNamedPipeW (_pipe_name,
+		      PIPE_ACCESS_DUPLEX | FILE_FLAG_FIRST_PIPE_INSTANCE,
+		      PIPE_TYPE_BYTE | PIPE_WAIT, PIPE_UNLIMITED_INSTANCES,
+		      0, 0, 1000, &sec_all_nih);
+
+  if (listen_pipe == INVALID_HANDLE_VALUE)
+    {
+      system_printf ("failed to create named pipe: "
+		     "is the daemon already running?");
+      return -1;
+    }
+
   return 0;
 }
 
@@ -125,38 +139,19 @@ transport_layer_pipes::accept (bool *con
   // Read: http://www.securityinternals.com/research/papers/namedpipe.php
   // See also the Microsoft security bulletins MS00-053 and MS01-031.
 
-  // FIXME: Remove FILE_CREATE_PIPE_INSTANCE.
-
-  const bool first_instance = (pipe_instance == 0);
-
-  debug ("Try to create named pipe: %ls", _pipe_name);
+  debug ("Try to create named pipe instance %ld: %ls",
+	 pipe_instance + 1, _pipe_name);
 
   const HANDLE accept_pipe =
-    CreateNamedPipeW (_pipe_name,
-		     (PIPE_ACCESS_DUPLEX
-		      | (first_instance ? FILE_FLAG_FIRST_PIPE_INSTANCE : 0)),
-		     (PIPE_TYPE_BYTE | PIPE_WAIT),
-		     PIPE_UNLIMITED_INSTANCES,
-		     0, 0, 1000,
-		     &sec_all_nih);
-
-  const bool duplicate = (accept_pipe == INVALID_HANDLE_VALUE
-			  && pipe_instance == 0
-			  && GetLastError () == ERROR_ACCESS_DENIED);
+    CreateNamedPipeW (_pipe_name, PIPE_ACCESS_DUPLEX,
+		      PIPE_TYPE_BYTE | PIPE_WAIT, PIPE_UNLIMITED_INSTANCES,
+		      0, 0, 1000, &sec_all_nih);
 
   if (accept_pipe != INVALID_HANDLE_VALUE)
     InterlockedIncrement (&pipe_instance);
 
   LeaveCriticalSection (&pipe_instance_lock);
 
-  if (duplicate)
-    {
-      *recoverable = false;
-      system_printf ("failed to create named pipe: "
-		     "is the daemon already running?");
-      return NULL;
-    }
-
   if (accept_pipe == INVALID_HANDLE_VALUE)
     {
       debug_printf ("error creating pipe (%lu).", GetLastError ());
@@ -164,8 +159,6 @@ transport_layer_pipes::accept (bool *con
       return NULL;
     }
 
-  assert (accept_pipe);
-
   if (!ConnectNamedPipe (accept_pipe, NULL)
       && GetLastError () != ERROR_PIPE_CONNECTED)
     {


-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
@ 2012-11-19 21:06 Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2012-11-23 11:37 ` Corinna Vinschen
  0 siblings, 1 reply; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-19 21:06 UTC (permalink / raw)
  To: cygwin

Hello again,

I can now positively confirm the race condition in cygserver w.r.t. the named
pipe used to serialize SYSV requests through the server.  The race is due to
that transport_layer_pipes::accept() (bool *const recoverable) (file: transport_pipes.cc) does actually _create_ the pipe when pipe_instance == 0
(ironically, transport_layer_pipes::listen() does not create any OS primitives
at all!).

This means that under heavy load, cygserver threads may all end up processing
their requests and closing all instances of the pipe (bringing pipe_instance == 0)
yet not being able to get to the point of accepting new request (that is, to
re-create the pipe).  For the client (user process), this looks like the pipe
does not exist (during that very tiny period of time), and the following message
gets printed:

Iteration 3016
      1 [main] a 4872 transport_layer_pipes::connect: lost connection to cygserver, error = 2

Note "error 2" for ERROR_FILE_NOT_FOUND.

The error is so subtle it is very difficult to reproduce with the default number of
threads in cygserver, 10 (although, not impossible, esp. on a moderately-to-heavily
loaded system).

The error is not due to deadlocking (reported earlier today) -- the server remains
afloat coming to CreateNamedPipe() just a jiffy too late, it's the client that suffers
from the pipe disappearance.  Deadlocking of the logging cygserver seems to remain a
yet another issue.

Anton Lavrentiev
Contractor NIH/NLM/NCBI

P.S. The code to reveal the race is the very same test program posted today
(stripped of the BUGx conditionals):

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/sem.h>


#define SEMKEY 1


union semun {
    int val;                    /* value for SETVAL */
    struct semid_ds *buf;       /* buffer for IPC_STAT, IPC_SET */
    unsigned short  *array;     /* array for GETALL, SETALL */
};


static void doCriticalWork(void)
{
    return;
}


int main(void)
{
    struct sembuf lock[2];
    int n, semid;

    if ((semid = semget(SEMKEY, 2, IPC_CREAT | 0666)) < 0) {
        perror("semget(IPC_CREATE)");
        return 1;
    }

    lock[0].sem_num = 0;
    lock[0].sem_op  = 0;
    lock[0].sem_flg = IPC_NOWAIT;
    lock[1].sem_num = 0;
    lock[1].sem_op  = 1;

    lock[1].sem_flg = SEM_UNDO;

    if (semop(semid, lock, 2) != 0) {
        perror("semop(LOCK[0])");
        return 1;
    }

    for (n = 0; ; ++n) {
        static const union semun arg = { 0 };
        int error;

        printf("Iteration %d\n", n);

        lock[0].sem_num = 1;
        lock[0].sem_op  = 0; /* precondition:  [1] == 0 */
        lock[0].sem_flg = 0;
        lock[1].sem_num = 1;
        lock[1].sem_op  = 1; /* postcondition: [1] == 1 */
        lock[1].sem_flg = SEM_UNDO;

        if (semop(semid, lock, 2) < 0) {
            error = errno;
            fprintf(stderr, "semop(LOCK[1]): %d, %s\n",
                    error, strerror(error));
            break;
        }

        doCriticalWork();

        lock[0].sem_num =  1;
        lock[0].sem_op  = -1;
        lock[0].sem_flg =  SEM_UNDO | IPC_NOWAIT;

        if (semop(semid, lock, 1) < 0) {
            error = errno;
            fprintf(stderr, "semop(UNLOCK[1]): %d, %s\n",
                    error, strerror(error));
            break;
        }
    }

    return 1;
}


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Possible race in SYSV IPC (semaphores)
@ 2012-11-19 16:23 Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  0 siblings, 0 replies; 18+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2012-11-19 16:23 UTC (permalink / raw)
  To: cygwin

> For testing it would be most helpful if you would create a STC.

The test case attached from my earlier post from today, also shows
a race (dead-lock, actually) occurring in CYGSERVER, when it is run
with the "-d -l 7" command line (even when it is only one application,
the test case is using it).  There is an internal disconcert between
the cygserver threads.

The log ends with this, and there is no way to stop the server
(like with "cygrunsrv -S") other than killing it from the task manager:

cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/transport_pipes.cc, line 132: Try to create named pipe: \\.\pipe\cygwin-13a7ed34cc1953a9-lpccygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/process.cc, line 287: Try hold(4956)
cygserver: /home/corinna/src/cygwin/cygwin-1.7.15/cygwin-1.7.15-1/src/cygwin-1.7.15/winsup/cygserver/process.cc, line 287: holding (4956)

Anton Lavrentiev
Contractor NIH/NLM/NCBI


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-11-28 21:20 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-26 18:37 Possible race in SYSV IPC (semaphores) Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-10-27 10:24 ` Corinna Vinschen
2012-11-19 16:23 Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-11-19 21:06 Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-11-23 11:37 ` Corinna Vinschen
2012-11-23 13:11   ` Corinna Vinschen
2012-11-23 13:34     ` Corinna Vinschen
2012-11-23 16:44       ` Corinna Vinschen
2012-11-23 17:16         ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-11-23 17:29           ` Corinna Vinschen
2012-11-23 17:47             ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-11-23 17:59               ` Corinna Vinschen
2012-11-26 12:54                 ` Corinna Vinschen
2012-11-26 18:09                   ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-11-26 18:14                     ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-11-27  7:13                     ` Corinna Vinschen
2012-11-27 14:37                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2012-11-28 21:20                       ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).