[ECOS] BSD socket stall

public inbox for ecos-discuss@sourceware.org
 help / color / mirror / Atom feed

* [ECOS] BSD socket stall
@ 2012-09-08 17:43 henry mahler
  2012-09-13 13:14 ` Bernd Edlinger
       [not found] ` <BAY146-W727B35516EEA4449F4ECCE4910@phx.gbl>
  0 siblings, 2 replies; 7+ messages in thread
From: henry mahler @ 2012-09-08 17:43 UTC (permalink / raw)
  To: ecos-discuss

Hi, 

I am working with a Atmel Based ARM board and we BSD stack configured. The
board seems to have a issue with sending data on a UDP socket. 
We are sending a lot of data thru the socket and we have rare instances
where the socket seems to “stall”.
Attaching the gdb to the target shows the sending thread paused in “sosend”
at the “sblock”. I can see why the socket could stall when it a high water
mark. 
But the socket does not seem to recover from this condition. 
My confusion is that I do not understand or see how the socket handles this
 condition. 
Could someone point in the direction of how this condition is handles  or in
the direction of what our implementation missing. 

Thanks,
Henry Mahler
Time Domain Corp.  

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [ECOS] BSD socket stall
  2012-09-08 17:43 [ECOS] BSD socket stall henry mahler
@ 2012-09-13 13:14 ` Bernd Edlinger
  2012-09-13 21:02   ` Lambrecht Jürgen
       [not found] ` <BAY146-W727B35516EEA4449F4ECCE4910@phx.gbl>
  1 sibling, 1 reply; 7+ messages in thread
From: Bernd Edlinger @ 2012-09-13 13:14 UTC (permalink / raw)
  To: ecos-discuss


Hello Henry,
 
> I am working with a Atmel Based ARM board and we BSD stack configured. The
> board seems to have a issue with sending data on a UDP socket. 
> We are sending a lot of data thru the socket and we have rare instances
> where the socket seems to “stall”.
> Attaching the gdb to the target shows the sending thread paused in “sosend”
> at the “sblock”. I can see why the socket could stall when it a high water
> mark. 
> But the socket does not seem to recover from this condition. 
> My confusion is that I do not understand or see how the socket handles this
>  condition. 
> Could someone point in the direction of how this condition is handles  or in
> the direction of what our implementation missing.
The BSD stack uses a simple spin lock to prevent multiple threads from
entering the send path at the same time.That means, if you are using
more than one thread to send data to the UDP socket, you got interrupted
while one thread is in the sosend function. This spin lock is really really
simple. For instance it does not use priority inheritance at all.
And if your spin lock is occupied for 99.9% of the time, you're out of luck too.
 
Therefore, I'd suggest you place an eCos Kernel Mutex object with priority
inheritance around the sendto function call(s).
 
By the way, this might be another issue, when you use unicast udp sends.
 
That's as follows: If your ARP entry expires while your application sends
many udp telegrams in very short time, you will loose some packets while
the BSD stack is waiting for tha ARP response. The ARP Timeout is 20 minutes
by default, so You can expect some data losses every 20 minutes.
 
I worked around this by sending a unicast ARP request if any message gets
sent while 90-99% of the ARP time out expired.
 
Recently I fixed that and a lot of other issues in the BSD stack:
You might use like to try this: http://bugs.ecos.sourceware.org/show_bug.cgi?id=1001656
 
And maybe the improved AT91 Ethernet driver too: http://bugs.ecos.sourceware.org/show_bug.cgi?id=1001649
 
I had spurious interrupts with the original driver on the AT91SAM9G45 when I started
two or more flood pings at the same time, but this must also happen with other AT91 devices.
 
Which one are you using?
 
Regards
Bernd Edlinger 		 	   		  

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ECOS] BSD socket stall
  2012-09-13 13:14 ` Bernd Edlinger
@ 2012-09-13 21:02   ` Lambrecht Jürgen
  0 siblings, 0 replies; 7+ messages in thread
From: Lambrecht Jürgen @ 2012-09-13 21:02 UTC (permalink / raw)
  To: Bernd Edlinger; +Cc: ecos-discuss

Hello Bernd,

nice to read you worked on the same problems as I did a while ago..

On 09/13/2012 03:14 PM, Bernd Edlinger wrote:
> Hello Henry,
>
>> I am working with a Atmel Based ARM board and we BSD stack configured. The
>> board seems to have a issue with sending data on a UDP socket.
>> We are sending a lot of data thru the socket and we have rare instances
>> where the socket seems to “stall”.
>> Attaching the gdb to the target shows the sending thread paused in “sosend”
>> at the “sblock”. I can see why the socket could stall when it a high water
>> mark.
>> But the socket does not seem to recover from this condition.
>> My confusion is that I do not understand or see how the socket handles this
>>   condition.
>> Could someone point in the direction of how this condition is handles  or in
>> the direction of what our implementation missing.
> The BSD stack uses a simple spin lock to prevent multiple threads from
> entering the send path at the same time.That means, if you are using
> more than one thread to send data to the UDP socket, you got interrupted
> while one thread is in the sosend function. This spin lock is really really
> simple. For instance it does not use priority inheritance at all.
> And if your spin lock is occupied for 99.9% of the time, you're out of luck too.
>
> Therefore, I'd suggest you place an eCos Kernel Mutex object with priority
> inheritance around the sendto function call(s).
>
> By the way, this might be another issue, when you use unicast udp sends.
>
> That's as follows: If your ARP entry expires while your application sends
> many udp telegrams in very short time, you will loose some packets while
> the BSD stack is waiting for tha ARP response. The ARP Timeout is 20 minutes
> by default, so You can expect some data losses every 20 minutes.
>
> I worked around this by sending a unicast ARP request if any message gets
> sent while 90-99% of the ARP time out expired.
I used a static ARP entry to work around this, and filed it as a bug in 
our bug-tracking system.
>
> Recently I fixed that and a lot of other issues in the BSD stack:
> You might use like to try this: http://bugs.ecos.sourceware.org/show_bug.cgi?id=1001656
I will try to find time to integrate and use your patch.
>
> And maybe the improved AT91 Ethernet driver too: http://bugs.ecos.sourceware.org/show_bug.cgi?id=1001649
I mailed my improvements long ago 
(http://old.nabble.com/bugs-in-AT91-Ethernet-driver-td17569021.html), 
but never found time for a proper patch :-(.

Kind regards,
Jürgen
>
> I had spurious interrupts with the original driver on the AT91SAM9G45 when I started
> two or more flood pings at the same time, but this must also happen with other AT91 devices.
>
> Which one are you using?
>
> Regards
> Bernd Edlinger 		 	   		
>


-- 
Jürgen Lambrecht
R&D Associate
Tel: +32 (0)51 303045    Fax: +32 (0)51 310670
http://www.televic-rail.com
Televic Rail NV - Leo Bekaertlaan 1 - 8870 Izegem - Belgium
Company number 0825.539.581 - RPR Kortrijk

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [ECOS] BSD socket stall
       [not found]   ` <CAH0LssZ1+fGEESuzmZuRn0G+8D+yOcSaXu8ZBXruR36OUzdQxQ@mail.gmail.com>
  2012-09-14  8:21     ` Bernd Edlinger
@ 2012-09-14  8:21     ` Bernd Edlinger
  2012-09-14 14:19       ` Bernd Edlinger
  1 sibling, 1 reply; 7+ messages in thread
From: Bernd Edlinger @ 2012-09-14  8:21 UTC (permalink / raw)
  To: henry.mahler, ecos-discuss


Henry,
> Hi Bernd,
>
> Thank you for the reply. Our implementation did have two threads
> sending data on one socket.
> We have changed the code so that each thread has its own socket.
>
> I did look thru the modifications for the BSD stack, but I did not see
> you the spin lock was changed to a Mutex. Is that change in the diff? A
> matter of fact I do not see where the spin lock is implemented either.
>
I did not touch that code, because I do not have too much contention in that
spin lock:
int
sb_lock(sb)
 register struct sockbuf *sb;
{
 int error;
 while (sb->sb_flags & SB_LOCK) {
  sb->sb_flags |= SB_WANT;
  error = tsleep((caddr_t)&sb->sb_flags,
      (sb->sb_flags & SB_NOINTR) ? PSOCK : PSOCK|PCATCH,
      "sblock", 0);
  if (error)
   return (error);
 }
 sb->sb_flags |= SB_LOCK;
 return (0);
}

that waits for the SB_LOCK bit to clear and set the SB_LOCK again.
what might have happened would be a priority inversion here.
however this might also be a real bug...
I am not sure at the moment, if this code might be missing
the splnet mutex?
in sosend()
 error = sblock(&so->so_snd, SBLOCKWAIT(so,flags));
 if (error)
  goto out;
=>
 s = splnet();
 error = sblock(&so->so_snd, SBLOCKWAIT(so,flags));
 splx(s);
 if (error)
  goto out;

what are the priorites of your writing threads?
and are other threads in between?
> Back to changing the code so that each thread has its own socket. Is
> there any history where a socket has problem when the socket receives
> data but no thread is reading from that socket. The socket is intened
> for transmit only but another happed to send data to that socket and
> the data was allowed to stay in the socket unread.
good point, I usually allocate 1-2 megabytes for MBUFs, but that might
not always be possible.
if you are concerned that the socket accumulates Packet Buffers,
there is a socket option SO_RCVBUF, maybe you set it to zero, then
the socket should discard any garbage that is received accidentally.
> Thanks
> Henry
>
Regards
Bernd Edlinger 		 	   		  

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [ECOS] BSD socket stall
       [not found]   ` <CAH0LssZ1+fGEESuzmZuRn0G+8D+yOcSaXu8ZBXruR36OUzdQxQ@mail.gmail.com>
@ 2012-09-14  8:21     ` Bernd Edlinger
  2012-09-14  8:21     ` Bernd Edlinger
  1 sibling, 0 replies; 7+ messages in thread
From: Bernd Edlinger @ 2012-09-14  8:21 UTC (permalink / raw)
  To: henry.mahler, ecos-discuss


Henry,
> Hi Bernd,
>
> Thank you for the reply. Our implementation did have two threads
> sending data on one socket.
> We have changed the code so that each thread has its own socket.
>
> I did look thru the modifications for the BSD stack, but I did not see
> you the spin lock was changed to a Mutex. Is that change in the diff? A
> matter of fact I do not see where the spin lock is implemented either.
>
I did not touch that code, because I do not have too much contention in that
spin lock:
int
sb_lock(sb)
 register struct sockbuf *sb;
{
 int error;
 while (sb->sb_flags & SB_LOCK) {
  sb->sb_flags |= SB_WANT;
  error = tsleep((caddr_t)&sb->sb_flags,
      (sb->sb_flags & SB_NOINTR) ? PSOCK : PSOCK|PCATCH,
      "sblock", 0);
  if (error)
   return (error);
 }
 sb->sb_flags |= SB_LOCK;
 return (0);
}

that waits for the SB_LOCK bit to clear and set the SB_LOCK again.
what might have happened would be a priority inversion here.
however this might also be a real bug...
I am not sure at the moment, if this code might be missing
the splnet mutex?
in sosend()
 error = sblock(&so->so_snd, SBLOCKWAIT(so,flags));
 if (error)
  goto out;
=>
 s = splnet();
 error = sblock(&so->so_snd, SBLOCKWAIT(so,flags));
 splx(s);
 if (error)
  goto out;

what are the priorites of your writing threads?
and are other threads in between?
> Back to changing the code so that each thread has its own socket. Is
> there any history where a socket has problem when the socket receives
> data but no thread is reading from that socket. The socket is intened
> for transmit only but another happed to send data to that socket and
> the data was allowed to stay in the socket unread.
good point, I usually allocate 1-2 megabytes for MBUFs, but that might
not always be possible.
if you are concerned that the socket accumulates Packet Buffers,
there is a socket option SO_RCVBUF, maybe you set it to zero, then
the socket should discard any garbage that is received accidentally.
> Thanks
> Henry
>
Regards
Bernd Edlinger 		 	   		  

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [ECOS] BSD socket stall
  2012-09-14  8:21     ` Bernd Edlinger
@ 2012-09-14 14:19       ` Bernd Edlinger
       [not found]         ` <CAH0LssbJX2f5U6wtxe+KjbqPZXDTJqPm8Awua6KC+o5oOxQx8Q@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Bernd Edlinger @ 2012-09-14 14:19 UTC (permalink / raw)
  To: henry.mahler, ecos-discuss

Ok Henry,

I tried it in the debugger, and the Mutex is really missing, so this could easily cause a deadlock as you described.

I updated the Bug# 1001656 to include this fix: The same bug is in soreceive() and shutdown(SHUT_RD/SHUT_RDWR) too.

Could you do me a favor please, and try the new BSD stack together with the previous version
of your Software, and tell me if this fixes the problem now?

Thanks
Bernd. 		 	   		  

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [ECOS] BSD socket stall
       [not found]         ` <CAH0LssbJX2f5U6wtxe+KjbqPZXDTJqPm8Awua6KC+o5oOxQx8Q@mail.gmail.com>
@ 2012-09-15 17:44           ` Bernd Edlinger
  0 siblings, 0 replies; 7+ messages in thread
From: Bernd Edlinger @ 2012-09-15 17:44 UTC (permalink / raw)
  To: henry.mahler, ecos-discuss

Hi Henry,
> We will try out this change but we have already changed our application 
> to have separate sockets for each thread. I not sure we could provide 
> answer to the question of does this change the fix this issue in our 
> case. 
> 
> I am just surprised that an issue like this could still be the BSD 
> after all these years. I mean ECOS and freeBSD stack have been out for 
> what 10+ years. I am I clueless for having assumed having two threads 
> send one socket was OK. I believed that sockets would be thread save, I 
> guess that is not the case. 
> Thanks, 
> Henry 
For UDP sockets that use case is perfectly OK.
You will see, the new version will handle this correctly,
except for the still possible priority inversion.
For TCP sockets the results are undefined, except
when the message size is always exactly 1 byte.

Why did that problem not occur before? Hard to tell.
For instance I believe that the problem arises from
interrupting this statement in sb_lock():
sb->sb_flags |= SB_WANT;
that will be an atomic like OR [bx],0x2 on Intel,
but at least 3 assembler instructions on ARM.
So you should have no problem at all on an Intel.
Although this is a perfect example of what happens,
when you use a condition object without a mutex.

But you should also check for Spurious interrupts.
They are likely to occur due to the "Tickle Loop"
in the BSD stack, especially when at a high rate.
My latest AT91 Ethernet driver does not need this
any more, and avoids the spurious interrupts even
if the stack polls the IRQ from time to time.

Therefore I would recommend you check this list of
important patches which we at Softing developed over
the last year (I must apologize, the list is too long,
but we walk on thin ice as you know, and most of these
bug fixes are obviously badly needed):
Bug   20804: Misbehavior of printf %e/%g format
Bug 1001522: Array index out of bounds in tftp_server.c
Bug 1001629: bsd stack uses wrong timeout values if hz != 100
Bug 1001633: DHCP Client may hang
Bug 1001634: A code review of dlmalloc.cxx revealed several weaknesses
Bug 1001635: wrong results from Cyg_StdioStream::read
Bug 1001637: fcntl() fails to handle F_GETFL, F_SETFL
Bug 1001639: Problems with i2c.cxx
Bug 1001641: Erase function in flashiodev.c and flashiodevlegacy.c handle "err_address" differently
Bug 1001645: Recursive Posix Mutexes
Bug 1001648: flash_init() behaves differently if CYGHWR_IO_FLASH_DEVICE==1
Bug 1001649: AT91 hal extension
Bug 1001654: diag_printf truncates the values in %llu and %llx formats
Bug 1001655: eth_drv_send stack_corruption with CYGFUN_LWIP_MODE_SIMPLE
Bug 1001656: FreeBSD: add AF_PACKET socket family
Bug 1001657: httpd server should parse request header lines
It might help to understand what is the application for this patches,
especially the new transacted PHY interface and the Packet sockets.
Think of PTPv2: Here we have to exchange very complex data over SMI
with the PHY, and the PTP packets may be in raw ethernet format.
That is what finally led to these enhancements.
Regards,
Bernd Edlinger 		 	   		  

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-09-15 17:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-08 17:43 [ECOS] BSD socket stall henry mahler
2012-09-13 13:14 ` Bernd Edlinger
2012-09-13 21:02   ` Lambrecht Jürgen
     [not found] ` <BAY146-W727B35516EEA4449F4ECCE4910@phx.gbl>
     [not found]   ` <CAH0LssZ1+fGEESuzmZuRn0G+8D+yOcSaXu8ZBXruR36OUzdQxQ@mail.gmail.com>
2012-09-14  8:21     ` Bernd Edlinger
2012-09-14  8:21     ` Bernd Edlinger
2012-09-14 14:19       ` Bernd Edlinger
     [not found]         ` <CAH0LssbJX2f5U6wtxe+KjbqPZXDTJqPm8Awua6KC+o5oOxQx8Q@mail.gmail.com>
2012-09-15 17:44           ` Bernd Edlinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).