Re: [ECOS] Re: Problems with "Scheduler lock not zero"

public inbox for ecos-discuss@sourceware.org
 help / color / mirror / Atom feed

* Re: [ECOS] Re: Problems with "Scheduler lock not zero"
@ 2007-07-22 22:38 Jürgen Lambrecht
  2007-07-24 11:21 ` Andrew Lunn
  0 siblings, 1 reply; 10+ messages in thread
From: Jürgen Lambrecht @ 2007-07-22 22:38 UTC (permalink / raw)
  To: ecos-discuss, Andrew Lunn

Hello Andrew,

You say below that there has been problems on the ARM platform not 
correctly dealing with spurious interrupts.

Is this a problem of the ARM processor or a problem in eCos? 
If it is a problem in eCos, I would like to try to solve it. I'm on holidays, so I have time to do some coding for fun, not for work :-).

You also say below that "Scheduler lock not zero" can be caused by a thread exiting with the scheduler locked. In a release SW, so without assertions, does this crash the SW or is this solved by eCos?


Kind regards,
Juergen

Ã˜yvind Harboe wrote:

    On 11/7/06, Andrew Lunn <andrew@lunn.ch> wrote:

        Hi Folks

        I think it is unlikley this is your problem, but i will mention it
        anyway. I once had an assertion failure in the same place. It was
        caused by a thread exiting with the scheduler locked.


        Another thing to check is do you have an spurious interrupts. There
        has been problems on the ARM platform not correctly dealing with
        this. Since spurious interrupts generally means broken hardware, its
        not the easiest thing to test and debug.
            



    After fixing a bug in our DDR controller(implemented in FPGA), we can
    no longer reproduce the problem..... Crossing fingers.... :-)


    It is hard enough to tell how a system behaves when it works, but to
    explain what possible a *nearly* working DDR controller could result
    in, is pretty much impossible.


    The main symptom of our broken DDR controller was that the whole
    system locked up. We'll run some more overnight testing, but it looks
    like the  "Scheduler lock not zero" assert failure was just another,
    albeit unfatohmable, manifestation of that same problem.





-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ECOS] Re: Problems with "Scheduler lock not zero"
  2007-07-22 22:38 [ECOS] Re: Problems with "Scheduler lock not zero" Jürgen Lambrecht
@ 2007-07-24 11:21 ` Andrew Lunn
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Lunn @ 2007-07-24 11:21 UTC (permalink / raw)
  To: Juergen Lambrecht; +Cc: ecos-discuss, Andrew Lunn

On Mon, Jul 23, 2007 at 12:36:17AM +0200, J?rgen Lambrecht wrote:
> Hello Andrew,
>
> You say below that there has been problems on the ARM platform not 
> correctly dealing with spurious interrupts.
>
> Is this a problem of the ARM processor or a problem in eCos? If it is a 
> problem in eCos, I would like to try to solve it. I'm on holidays, so I 
> have time to do some coding for fun, not for work :-).

Spurious interrupts is generally a hardware problem. It means
something generated an interrupt, but the advanced interrupt
controller does not know what caused the interrupt. It is also very
hard to test code which deals with this, unless you have some broken
hardware.

> You also say below that "Scheduler lock not zero" can be caused by a thread 
> exiting with the scheduler locked. In a release SW, so without assertions, 
> does this crash the SW or is this solved by eCos?

The scheduler is left locked, so the system just stops running
threads. When i found this problem, the relevant bit of code was
looked at and it is not easy to do anything about it. Also, doing
anything about this just hides bugs in the application code. The
assert seems to be the correct thing to do.

       Andrew

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ECOS] Re: Problems with "Scheduler lock not zero"
  2006-11-08 15:31       ` Øyvind Harboe
  2007-07-06  8:44         ` Jürgen Lambrecht
@ 2007-07-06  8:55         ` Jürgen Lambrecht
  1 sibling, 0 replies; 10+ messages in thread
From: Jürgen Lambrecht @ 2007-07-06  8:55 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Øyvind Harboe, eCos Discussion

Ã˜yvind Harboe wrote:

> On 11/7/06, Andrew Lunn <andrew@lunn.ch> wrote:
> 
>> Hi Folks
>>
>> I think it is unlikley this is your problem, but i will mention it
>> anyway. I once had an assertion failure in the same place. It was
>> caused by a thread exiting with the scheduler locked.
>>
>> Another thing to check is do you have an spurious interrupts. There
>> has been problems on the ARM platform not correctly dealing with
>> this. Since spurious interrupts generally means broken hardware, its
>> not the easiest thing to test and debug.

I have the option set: CYGIMP_HAL_COMMON_INTERRUPTS_IGNORE_SPURIOUS.
Is this not a solution for this?
> 
> 
> After fixing a bug in our DDR controller(implemented in FPGA), we can
> no longer reproduce the problem..... Crossing fingers.... :-)
> 
> It is hard enough to tell how a system behaves when it works, but to
> explain what possible a *nearly* working DDR controller could result
> in, is pretty much impossible.
> 
> The main symptom of our broken DDR controller was that the whole
> system locked up. We'll run some more overnight testing, but it looks
> like the  "Scheduler lock not zero" assert failure was just another,
> albeit unfatohmable, manifestation of that same problem.
> 

-- 
JÃ¼rgen Lambrecht
Diksmuidse Heerweg 338
8200 Sint-Andries
Tel: +32 (0)50 842901
GSM: +32 (0)476 313389

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ECOS] Re: Problems with "Scheduler lock not zero"
  2006-11-08 15:31       ` Øyvind Harboe
@ 2007-07-06  8:44         ` Jürgen Lambrecht
  2007-07-06  8:55         ` Jürgen Lambrecht
  1 sibling, 0 replies; 10+ messages in thread
From: Jürgen Lambrecht @ 2007-07-06  8:44 UTC (permalink / raw)
  To: eCos Discussion

Hello,

We are doing tests now with big UDP and TCP packets, bigger than the Ethernet MTU size, so IP has to fragment the packets.
Then, from time to time, we have the error "Scheduler lock not zero".
Has anybody experience with IP fragmented packets in ecos?

Ã˜yvind Harboe wrote:
> On 11/7/06, Andrew Lunn <andrew@lunn.ch> wrote:
> 
>> Hi Folks
>>
>> I think it is unlikley this is your problem, but i will mention it
>> anyway. I once had an assertion failure in the same place. It was
>> caused by a thread exiting with the scheduler locked.
>>
>> Another thing to check is do you have an spurious interrupts. There
>> has been problems on the ARM platform not correctly dealing with
>> this. Since spurious interrupts generally means broken hardware, its
>> not the easiest thing to test and debug.
> 
> 
> After fixing a bug in our DDR controller(implemented in FPGA), we can
> no longer reproduce the problem..... Crossing fingers.... :-)
> 
> It is hard enough to tell how a system behaves when it works, but to
> explain what possible a *nearly* working DDR controller could result
> in, is pretty much impossible.
> 
> The main symptom of our broken DDR controller was that the whole
> system locked up. We'll run some more overnight testing, but it looks
> like the  "Scheduler lock not zero" assert failure was just another,
> albeit unfatohmable, manifestation of that same problem.
> 

This will be the first thing to investigate I guess.

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ECOS] Re: Problems with "Scheduler lock not zero"
  2006-11-07 22:31     ` Andrew Lunn
  2006-11-08  7:08       ` Øyvind Harboe
@ 2006-11-08 15:31       ` Øyvind Harboe
  2007-07-06  8:44         ` Jürgen Lambrecht
  2007-07-06  8:55         ` Jürgen Lambrecht
  1 sibling, 2 replies; 10+ messages in thread
From: Øyvind Harboe @ 2006-11-08 15:31 UTC (permalink / raw)
  To: eCos Discussion

On 11/7/06, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Folks
>
> I think it is unlikley this is your problem, but i will mention it
> anyway. I once had an assertion failure in the same place. It was
> caused by a thread exiting with the scheduler locked.
>
> Another thing to check is do you have an spurious interrupts. There
> has been problems on the ARM platform not correctly dealing with
> this. Since spurious interrupts generally means broken hardware, its
> not the easiest thing to test and debug.

After fixing a bug in our DDR controller(implemented in FPGA), we can
no longer reproduce the problem..... Crossing fingers.... :-)

It is hard enough to tell how a system behaves when it works, but to
explain what possible a *nearly* working DDR controller could result
in, is pretty much impossible.

The main symptom of our broken DDR controller was that the whole
system locked up. We'll run some more overnight testing, but it looks
like the  "Scheduler lock not zero" assert failure was just another,
albeit unfatohmable, manifestation of that same problem.

-- 
Øyvind Harboe
http://www.zylin.com

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ECOS] Re: Problems with "Scheduler lock not zero"
  2006-11-07 21:04   ` Øyvind Harboe
  2006-11-07 22:31     ` Andrew Lunn
@ 2006-11-08 13:38     ` Jürgen Lambrecht
  1 sibling, 0 replies; 10+ messages in thread
From: Jürgen Lambrecht @ 2006-11-08 13:38 UTC (permalink / raw)
  To: Øyvind Harboe, eCos Discussion

Ã˜yvind Harboe wrote:
> On 11/7/06, JÃ¼rgen Lambrecht <J.Lambrecht@televic.com> wrote:
> 
>> Hello Harboe,
> 
> 
> I'm wondering.... What would be the observable effects of the
> scheduler lock count not being zero if asserts weren't enabled?
> 
> If the answer is that everything, except timeslicing, would work just
> fine, then I may have observed this on another project.
Now we're mainly testing release versions, so with asserts disabled..
> 
>>
>> I use again the MLQ scheduler instead of the bitmap and I have not seen
>> the error again.
> 
> 
> I'm using the MLQ scheduler and I do see the problem.
> 
> Perhaps you still have the problem, only more rarely?
> 
> It would not surprise me one bit if this problem is timing sensitive
> and pretty much anything could make it come or go.
> 
>> From my ecos.ecc:
> 
> 
> cdl_component CYGSEM_KERNEL_SCHED_MLQUEUE {
>    # Flavor: bool
>    # No user value, uncomment the following line to provide one.
>    # user_value 1
>    # value_source default
>    # Default value: 1
> 
> 
>> My ecos is from feb 15 2006, so I have that patch from Nick.
> 
> 
> I fetched a fresh version today, and the problem exists with our HAL &
> CVS HEAD. Since this problem appears to be rare, I would suspect that
> a) either our HAL is somehow provoking a rare problem or b) our HAL is
> busted. We're using the opencores ethermac, otherwise it is basically
> an EB40a.
> 
our HAL is based on the eb55; with a memory mapped ethermac, an i2c driver and an extended tftp server
> 
> As an experiment I disabled timeslicing(since I'm using pthreads this
> requires a bit of hacking) and the problem persists.
Be carefull: to be able to use the bitmap scheduler, I had to make sure each priority was unique. And by default, both the main thread and the tftpd thread have priority 10 (CYGNUM_LIBC_MAIN_THREAD_PRIORITY and CYGPKG_NET_TFTPD_THREAD_PRIORITY). So I changed the priorities (to 8 and 9).
And now we use again the MLQ scheduler, but I kept the bitmap priorities of the main and tftpd thread. So I don't use timeslicing anymore. That's maybe the reason we don't have the problem anymore?
(Also the networking threads have default priorities, CYGPKG_NET_THREAD_PRIORITY and CYGPKG_NET_FAST_THREAD_PRIORITY (7 and 6).)
> 

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ECOS] Re: Problems with "Scheduler lock not zero"
  2006-11-07 22:31     ` Andrew Lunn
@ 2006-11-08  7:08       ` Øyvind Harboe
  2006-11-08 15:31       ` Øyvind Harboe
  1 sibling, 0 replies; 10+ messages in thread
From: Øyvind Harboe @ 2006-11-08  7:08 UTC (permalink / raw)
  To: eCos Discussion

On 11/7/06, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Folks
>
> I think it is unlikley this is your problem, but i will mention it
> anyway. I once had an assertion failure in the same place. It was
> caused by a thread exiting with the scheduler locked.

Given that I'm running the dhcp_test from eCos, I'd be *really*
surprised to learn that that code had this bug.  I'll check if any
threads are exiting at all during the ping test.

> Another thing to check is do you have an spurious interrupts. There
> has been problems on the ARM platform not correctly dealing with
> this. Since spurious interrupts generally means broken hardware, its
> not the easiest thing to test and debug.

We're developing the hardware ourselves(the opencores ethermarc), so
this is a very valuable clue. Thanks!

-- 
Øyvind Harboe
http://www.zylin.com

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ECOS] Re: Problems with "Scheduler lock not zero"
  2006-11-07 21:04   ` Øyvind Harboe
@ 2006-11-07 22:31     ` Andrew Lunn
  2006-11-08  7:08       ` Øyvind Harboe
  2006-11-08 15:31       ` Øyvind Harboe
  2006-11-08 13:38     ` Jürgen Lambrecht
  1 sibling, 2 replies; 10+ messages in thread
From: Andrew Lunn @ 2006-11-07 22:31 UTC (permalink / raw)
  To: ?yvind Harboe; +Cc: J?rgen Lambrecht, eCos Discussion

Hi Folks

I think it is unlikley this is your problem, but i will mention it
anyway. I once had an assertion failure in the same place. It was
caused by a thread exiting with the scheduler locked.

Another thing to check is do you have an spurious interrupts. There
has been problems on the ARM platform not correctly dealing with
this. Since spurious interrupts generally means broken hardware, its
not the easiest thing to test and debug.

    Andrew

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ECOS] Re: Problems with "Scheduler lock not zero"
  2006-11-07 16:01 ` Jürgen Lambrecht
@ 2006-11-07 21:04   ` Øyvind Harboe
  2006-11-07 22:31     ` Andrew Lunn
  2006-11-08 13:38     ` Jürgen Lambrecht
  0 siblings, 2 replies; 10+ messages in thread
From: Øyvind Harboe @ 2006-11-07 21:04 UTC (permalink / raw)
  To: Jürgen Lambrecht; +Cc: eCos Discussion

On 11/7/06, Jürgen Lambrecht <J.Lambrecht@televic.com> wrote:
> Hello Harboe,

I'm wondering.... What would be the observable effects of the
scheduler lock count not being zero if asserts weren't enabled?

If the answer is that everything, except timeslicing, would work just
fine, then I may have observed this on another project.

>
> I use again the MLQ scheduler instead of the bitmap and I have not seen
> the error again.

I'm using the MLQ scheduler and I do see the problem.

Perhaps you still have the problem, only more rarely?

It would not surprise me one bit if this problem is timing sensitive
and pretty much anything could make it come or go.

From my ecos.ecc:

cdl_component CYGSEM_KERNEL_SCHED_MLQUEUE {
    # Flavor: bool
    # No user value, uncomment the following line to provide one.
    # user_value 1
    # value_source default
    # Default value: 1

> My ecos is from feb 15 2006, so I have that patch from Nick.

I fetched a fresh version today, and the problem exists with our HAL &
CVS HEAD. Since this problem appears to be rare, I would suspect that
a) either our HAL is somehow provoking a rare problem or b) our HAL is
busted. We're using the opencores ethermac, otherwise it is basically
an EB40a.

As an experiment I disabled timeslicing(since I'm using pthreads this
requires a bit of hacking) and the problem persists.

-- 
Øyvind Harboe
http://www.zylin.com

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [ECOS] Re: Problems with "Scheduler lock not zero"
       [not found] <c09652430611070722u281d3291sac51828f476aa014@mail.gmail.com>
@ 2006-11-07 16:01 ` Jürgen Lambrecht
  2006-11-07 21:04   ` Øyvind Harboe
  0 siblings, 1 reply; 10+ messages in thread
From: Jürgen Lambrecht @ 2006-11-07 16:01 UTC (permalink / raw)
  To: Øyvind Harboe, eCos Discussion

Hello Harboe,

I use again the MLQ scheduler instead of the bitmap and I have not seen 
the error again.
My ecos is from feb 15 2006, so I have that patch from Nick.

Kind regards,

JÃ¼rgen Lambrecht
Development Engineer
Televic Transport Systems
http://www.televic.com
Televic NV / SA (main office)  	
Leo Bekaertlaan 1
B-8870 Izegem
Tel: +32 (0)51 303045
Fax: +32 (0)51 310670



Ã˜yvind Harboe wrote:

> Hi JÃ¼rgen,
>
> did you ever find out anything more about the problem you had?
>
> http://sources.redhat.com/ml/ecos-discuss/2006-08/msg00083.html
>
>
> We're seing a similar problem:
>
> http://sources.redhat.com/ml/ecos-discuss/2006-11/msg00072.html
>

-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-07-24 11:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-22 22:38 [ECOS] Re: Problems with "Scheduler lock not zero" Jürgen Lambrecht
2007-07-24 11:21 ` Andrew Lunn
     [not found] <c09652430611070722u281d3291sac51828f476aa014@mail.gmail.com>
2006-11-07 16:01 ` Jürgen Lambrecht
2006-11-07 21:04   ` Øyvind Harboe
2006-11-07 22:31     ` Andrew Lunn
2006-11-08  7:08       ` Øyvind Harboe
2006-11-08 15:31       ` Øyvind Harboe
2007-07-06  8:44         ` Jürgen Lambrecht
2007-07-06  8:55         ` Jürgen Lambrecht
2006-11-08 13:38     ` Jürgen Lambrecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).