public inbox for ecos-discuss@sourceware.org
 help / color / mirror / Atom feed
* [ECOS] Re: accept() FreeBSD hangs when out of resources
@ 2007-06-11 23:15 Tad
  2007-06-12  3:51 ` Andrew Lunn
  0 siblings, 1 reply; 11+ messages in thread
From: Tad @ 2007-06-11 23:15 UTC (permalink / raw)
  To: ecos-discuss

>> accept() won't return and won't timeout (>12hrs) when listen() indicates 
>> a new connection, if out of sockets/file-descriptors and all TCP 
>> connections are in ESTABLISHED state.
> 
> Where exactly is it blocked. Please could you provide a call stack.

Couldn't see why it would hang either, Andrew, but seems to reliably.

Wish I could help more.  Submitted 20 hrs of digging.  My system doesn't 
have any gdb or printf capablities.  Think I gave enough reproduction 
situation for someone with gdb capabilities to take it further.



-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ECOS] Re: accept() FreeBSD hangs when out of resources
  2007-06-11 23:15 [ECOS] Re: accept() FreeBSD hangs when out of resources Tad
@ 2007-06-12  3:51 ` Andrew Lunn
  2007-06-12  3:57   ` Tad
  2007-06-12  4:05   ` [ECOS] Re: Re: accept() FreeBSD hangs when out of resources Tad
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Lunn @ 2007-06-12  3:51 UTC (permalink / raw)
  To: Tad; +Cc: ecos-discuss

On Mon, Jun 11, 2007 at 03:42:07PM -0800, Tad wrote:
> >>accept() won't return and won't timeout (>12hrs) when listen() indicates 
> >>a new connection, if out of sockets/file-descriptors and all TCP 
> >>connections are in ESTABLISHED state.
> >
> >Where exactly is it blocked. Please could you provide a call stack.
> 
> Couldn't see why it would hang either, Andrew, but seems to reliably.
> 
> Wish I could help more.  Submitted 20 hrs of digging.  My system doesn't 
> have any gdb or printf capablities.  Think I gave enough reproduction 
> situation for someone with gdb capabilities to take it further.

For situations like this i find working on the synthetic target much
better. You have full gdb support, diag_printf etc. 

What i would ideally like is a test case we can add to the standard
tests. The test case should fail now, but once we have fix the problem
we can keep the test case for regression tests.

   Andrew

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ECOS] Re: accept() FreeBSD hangs when out of resources
  2007-06-12  3:51 ` Andrew Lunn
@ 2007-06-12  3:57   ` Tad
  2007-06-12  6:54     ` Andrew Lunn
  2007-06-12  4:05   ` [ECOS] Re: Re: accept() FreeBSD hangs when out of resources Tad
  1 sibling, 1 reply; 11+ messages in thread
From: Tad @ 2007-06-12  3:57 UTC (permalink / raw)
  To: ecos-discuss

Andrew Lunn wrote:
> On Mon, Jun 11, 2007 at 03:42:07PM -0800, Tad wrote:
>   
>>>> accept() won't return and won't timeout (>12hrs) when listen() indicates 
>>>> a new connection, if out of sockets/file-descriptors and all TCP 
>>>> connections are in ESTABLISHED state.
>>>>         
>>> Where exactly is it blocked. Please could you provide a call stack.

It's possible that the block is somewhere such as this "FIXME" code that 
wasn't finished in sys/kern/sockio.c

   /*
348      * At this point we know that there is at least one connection
349      * ready to be accepted. Remove it from the queue prior to
350      * allocating the file descriptor for it since falloc() may
351      * block allowing another process to accept the connection
352      * instead.
353      */
354     so = TAILQ_FIRST(&head->so_comp);
355     TAILQ_REMOVE(&head->so_comp, so, so_list);
356     head->so_qlen--;
357
358 #if 0 // FIXME
359     fflag = lfp->f_flag;
360     error = falloc(p, &nfp, &fd);
361     if (error) {
362         /*
363          * Probably ran out of file descriptors. Put the
364          * unaccepted connection back onto the queue and
365          * do another wakeup so some other process might
366          * have a chance at it.
367          */
368         TAILQ_INSERT_HEAD(&head->so_comp, so, so_list);
369         head->so_qlen++;
370         wakeup_one(&head->so_timeo);
371         splx(s);
372         goto done;
373     }
374     fhold(nfp);
375     p->p_retval[0] = fd;
376
377     /* connection has been removed from the listen queue */
378     KNOTE(&head->so_rcv.sb_sel.si_note, 0);
379 #endif
380
381     so->so_state &= ~SS_COMP;
382     so->so_head = NULL;
383
384     cyg_selinit(&so->so_rcv.sb_sel);
385     cyg_selinit(&so->so_snd.sb_sel);
386    
387     new_fp->f_type      = DTYPE_SOCKET;

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ECOS] Re: Re: accept() FreeBSD hangs when out of resources
  2007-06-12  3:51 ` Andrew Lunn
  2007-06-12  3:57   ` Tad
@ 2007-06-12  4:05   ` Tad
  2007-06-12 11:06     ` Andrew Lunn
  1 sibling, 1 reply; 11+ messages in thread
From: Tad @ 2007-06-12  4:05 UTC (permalink / raw)
  To: ecos-discuss

Andrew Lunn wrote:
> What i would ideally like is a test case we can add to the standard
> tests. The test case should fail now, but once we have fix the problem
> we can keep the test case for regression tests.

16 rapid http POSTS to any ATHTTP server compiled with 16 max sockets 
should lock the server up forever (as long as they're 
<CYG_HTTPD_SOCKET_IDLE_TIMEOUT(300) secs so the TCP conns stay in 
ESTABLISHED rather than TIMED_WAIT)

FWIW, I raised the both MAX file NFD, NFILES? while keeping the 
MAX_SOCKETS the same with no change.

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ECOS] Re: accept() FreeBSD hangs when out of resources
  2007-06-12  3:57   ` Tad
@ 2007-06-12  6:54     ` Andrew Lunn
  2007-06-12 15:37       ` Tad
  2007-06-12 16:08       ` [ECOS] listen (x, 0) on new TCP incoming connections doesn't stop select()/accept() Tad
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Lunn @ 2007-06-12  6:54 UTC (permalink / raw)
  To: Tad; +Cc: eCos Disuss

On Mon, Jun 11, 2007 at 04:05:57PM -0800, Tad wrote:
> Andrew Lunn wrote:
> >On Mon, Jun 11, 2007 at 03:42:07PM -0800, Tad wrote:
> >  
> >>>>accept() won't return and won't timeout (>12hrs) when listen() 
> >>>>indicates a new connection, if out of sockets/file-descriptors and all 
> >>>>TCP connections are in ESTABLISHED state.
> >>>>        
> >>>Where exactly is it blocked. Please could you provide a call stack.
> 
> It's possible that the block is somewhere such as this "FIXME" code that 
> wasn't finished in sys/kern/sockio.c

Yes, i already looked at this code. However this code is creating a
new file descriptor. However the way eCos works is that the file
descriptor has already been allocated and is passed into the function
as a parameter. So i went back and looked at what called this function
and where is the file descriptor allocated. That code does appear to
correct handle insufficient resources.

So, i really need more information, eg the test case, or a backtrace
when the thread is blocked.

     Andrew

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ECOS] Re: Re: accept() FreeBSD hangs when out of resources
  2007-06-12  4:05   ` [ECOS] Re: Re: accept() FreeBSD hangs when out of resources Tad
@ 2007-06-12 11:06     ` Andrew Lunn
  2007-06-12 11:19       ` Andrew Lunn
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Lunn @ 2007-06-12 11:06 UTC (permalink / raw)
  To: Tad; +Cc: ecos-discuss

On Mon, Jun 11, 2007 at 04:14:45PM -0800, Tad wrote:
> Andrew Lunn wrote:
> >What i would ideally like is a test case we can add to the standard
> >tests. The test case should fail now, but once we have fix the problem
> >we can keep the test case for regression tests.
> 
> 16 rapid http POSTS to any ATHTTP server compiled with 16 max sockets 
> should lock the server up forever (as long as they're 
> <CYG_HTTPD_SOCKET_IDLE_TIMEOUT(300) secs so the TCP conns stay in 
> ESTABLISHED rather than TIMED_WAIT)

So, maybe you can modify the server test case, assuming there is one,
by adding a new thread which makes 16 connections to 127.0.0.1:80.

   Andrew

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ECOS] Re: Re: accept() FreeBSD hangs when out of resources
  2007-06-12 11:06     ` Andrew Lunn
@ 2007-06-12 11:19       ` Andrew Lunn
       [not found]         ` <466F2FC7.8060704@ds3switch.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Lunn @ 2007-06-12 11:19 UTC (permalink / raw)
  To: Tad; +Cc: eCos Disuss

On Tue, Jun 12, 2007 at 05:51:04AM +0200, Andrew Lunn wrote:
> On Mon, Jun 11, 2007 at 04:14:45PM -0800, Tad wrote:
> > Andrew Lunn wrote:
> > >What i would ideally like is a test case we can add to the standard
> > >tests. The test case should fail now, but once we have fix the problem
> > >we can keep the test case for regression tests.
> > 
> > 16 rapid http POSTS to any ATHTTP server compiled with 16 max sockets 
> > should lock the server up forever (as long as they're 
> > <CYG_HTTPD_SOCKET_IDLE_TIMEOUT(300) secs so the TCP conns stay in 
> > ESTABLISHED rather than TIMED_WAIT)
> 
> So, maybe you can modify the server test case, assuming there is one,
> by adding a new thread which makes 16 connections to 127.0.0.1:80.

Actually, tcp_lo_test.c probably has 90% of the code you need for
writing a much simpler test case.

        Andrew

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ECOS] Re: accept() FreeBSD hangs when out of resources
  2007-06-12  6:54     ` Andrew Lunn
@ 2007-06-12 15:37       ` Tad
  2007-06-12 15:49         ` Lars Povlsen
  2007-06-12 16:08       ` [ECOS] listen (x, 0) on new TCP incoming connections doesn't stop select()/accept() Tad
  1 sibling, 1 reply; 11+ messages in thread
From: Tad @ 2007-06-12 15:37 UTC (permalink / raw)
  To: eCos Disuss



Andrew Lunn wrote:
> On Mon, Jun 11, 2007 at 04:05:57PM -0800, Tad wrote:
>   
>> Andrew Lunn wrote:
>>     
>>> On Mon, Jun 11, 2007 at 03:42:07PM -0800, Tad wrote:
>>>  
>>>       
>>>>>> accept() won't return and won't timeout (>12hrs) when listen() 
>>>>>> indicates a new connection, if out of sockets/file-descriptors and all 
>>>>>> TCP connections are in ESTABLISHED state.
>>>>>>        
>>>>>>             
>>>>> Where exactly is it blocked. Please could you provide a call stack.
>>>>>           
more info.
seems to be dependent on CYGNUM_FILEIO_NFILE rather than 
CYGPKG_NET_MAXSOCKETS.  reducing NFILE < MAXSOCKETS causes accept to 
hang with fewer established connections than before reduction.


-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [ECOS] Re: accept() FreeBSD hangs when out of resources
  2007-06-12 15:37       ` Tad
@ 2007-06-12 15:49         ` Lars Povlsen
  0 siblings, 0 replies; 11+ messages in thread
From: Lars Povlsen @ 2007-06-12 15:49 UTC (permalink / raw)
  To: eCos Disuss


This seems a lot like the problem I've seen - and reported on 17/4-07.
I've been able to occasionally reproduce it manually with a browser
(MSIE), but enabling TCP debug logging causes the problem to go away
(not occur).

AFAICS, it is a race condition in the TCP stack causing socket buffers
to be leaked (forever). Calling cyg_kmem_print_stats() displays the
problem (but you need reset to recover :-() :

Network stack mbuf stats:
   mbufs 97, clusters 60, free clusters 1
   Failed to get 0 times
   Waited to get 0 times
   Drained queues to get 0 times
VM zone 'ripcb':
  Total: 64, Free: 64, Allocs: 0, Frees: 0, Fails: 0
VM zone 'tcpcb':
  Total: 64, Free: 61, Allocs: 353, Frees: 350, Fails: 0
VM zone 'udpcb':
  Total: 64, Free: 63, Allocs: 4, Frees: 3, Fails: 0
VM zone 'socket':
  Total: 64, *Free: 0*, Allocs: 365, Frees: 293, Fails: 8
Misc mpool: total   98304, free    4192, max free block 3748
Mbufs pool: total   81792, free   69248, blocksize  128
Clust pool: total  163840, free   38912, blocksize 2048

FWIW, I have not had time to dig into this (as my attempts to produce a
test bench has failed...)

---Lars

-----Original Message-----
From: ecos-discuss-owner@ecos.sourceware.org
[mailto:ecos-discuss-owner@ecos.sourceware.org] On Behalf Of Tad
Sent: 12. juni 2007 14:05
To: eCos Disuss
Subject: Re: [ECOS] Re: accept() FreeBSD hangs when out of resources



Andrew Lunn wrote:
> On Mon, Jun 11, 2007 at 04:05:57PM -0800, Tad wrote:
>   
>> Andrew Lunn wrote:
>>     
>>> On Mon, Jun 11, 2007 at 03:42:07PM -0800, Tad wrote:
>>>  
>>>       
>>>>>> accept() won't return and won't timeout (>12hrs) when listen() 
>>>>>> indicates a new connection, if out of sockets/file-descriptors
and all 
>>>>>> TCP connections are in ESTABLISHED state.
>>>>>>        
>>>>>>             
>>>>> Where exactly is it blocked. Please could you provide a call
stack.
>>>>>           
more info.
seems to be dependent on CYGNUM_FILEIO_NFILE rather than 
CYGPKG_NET_MAXSOCKETS.  reducing NFILE < MAXSOCKETS causes accept to 
hang with fewer established connections than before reduction.


-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [ECOS] listen (x, 0) on new TCP incoming connections doesn't stop select()/accept()
  2007-06-12  6:54     ` Andrew Lunn
  2007-06-12 15:37       ` Tad
@ 2007-06-12 16:08       ` Tad
  1 sibling, 0 replies; 11+ messages in thread
From: Tad @ 2007-06-12 16:08 UTC (permalink / raw)
  To: eCos Disuss

FWIW, as far as I can tell, and not fully understanding the internals of 
listen():

It appears that attempting to stop accepting incoming TCP connections by 
setting listen (x, backlog=0) (after an initial listen (x, >0) if it's 
relevant) doesn't stop incoming TCP SYN connection requests from being 
ACK'd and appearing in select() and accept().  Thought that should have.


-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [ECOS] "Fix" for atHTTP and HTTP socket requirements with mozilla POSTS
       [not found]         ` <466F2FC7.8060704@ds3switch.com>
@ 2007-06-13  0:09           ` Tad
  0 siblings, 0 replies; 11+ messages in thread
From: Tad @ 2007-06-13  0:09 UTC (permalink / raw)
  To: ecos-discuss

"Fix" for hanging atHTTP client requests on out-of-sockets.

Background:
It's somewhat known that atHTTP will "pause" for several minutes when 
running out of sockets.  One reason this can happen is that mozilla 
opens a new TCP connection for each POST or chunked-transfer(I think) 
GET, which requires a new socket for each.  The remnant sockets 
eventually (300sec default) are shutdown by atHTTP and then enter TCP 
TIME_WAIT state which is 2xMSL or something like another 2-4 minutes -- 
but that's a long time.  BTW, this assumes you don't hit the bug where 
accept() hangs when out of sockets.  See solution in a couple days for 
that.

"Solution:"
Mozilla (on XP) appears to get smart after opening about 10 TCP 
connections, and starts FIN,ACK ing them to shut them down so atHTTP 
doesn't have to sit in TIME_WAIT or atHTTP timeout on the old connections.

So, setting CYGPKG_NET_MAXSOCKETS to something > 10x # of users + a 
couple for sockets other eCos apps have open will allow "unlimited" 
mozilla (XP) client requests without any timeouts.


-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-06-12 22:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-11 23:15 [ECOS] Re: accept() FreeBSD hangs when out of resources Tad
2007-06-12  3:51 ` Andrew Lunn
2007-06-12  3:57   ` Tad
2007-06-12  6:54     ` Andrew Lunn
2007-06-12 15:37       ` Tad
2007-06-12 15:49         ` Lars Povlsen
2007-06-12 16:08       ` [ECOS] listen (x, 0) on new TCP incoming connections doesn't stop select()/accept() Tad
2007-06-12  4:05   ` [ECOS] Re: Re: accept() FreeBSD hangs when out of resources Tad
2007-06-12 11:06     ` Andrew Lunn
2007-06-12 11:19       ` Andrew Lunn
     [not found]         ` <466F2FC7.8060704@ds3switch.com>
2007-06-13  0:09           ` [ECOS] "Fix" for atHTTP and HTTP socket requirements with mozilla POSTS Tad

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).