* Bugs in eCos SMP scheduler
@ 2004-09-06 11:29 sandeep
2004-09-07 12:54 ` sandeep
0 siblings, 1 reply; 2+ messages in thread
From: sandeep @ 2004-09-06 11:29 UTC (permalink / raw)
To: ecos-devel; +Cc: Nick Garnett
hi nick,
Of late, I observed a new situation related to scheduler, where sched_lock count
was becoming negative (0xFFFFFFFF/0xFFFFFFFE) and the holder of lock was
HAL_SMP_CPU_NONE (i.e. none of the processors was the owner of schedlock).
well, cause of that is --
though sched_lock incrementing process makes sure that only owner can increment
the count, zero_sched_lock/set_sched_lock/get_sched_lock don't respect the
notion of owner (processor) of sched_lock.
This, also introduces race conditions in the system, and results are obvious.
consider a sample situation of two processor configuration involving threads T1,
T2, T3, T4, ... running on system comprising of processors P1 and P2.
- currently no processor owns the lock and sched_lock count is 0.
- T1 (on P2) completes it's excution of thread_entry and later in user specified
thread entry function, takes the scheduler lock (owner = P2, count = 1)
- T2 (on P1) is in it's execution in thread_entry function and executes
zero_sched_lock (owner = NONE, count = 0)
- T1 (on P2) unlocks the scheduler __AND__ scheduler lock is -1 (0xFFFFFFFF,
considering 32-bit data-type for it), owner is NONE.
another variation of previous scenario could be --
- T1 (on P2) takes sched lock (owner = P2, count =1)
- T2 (on P1) executed zero_sched_lock (owner = NONE, count = 0)
- T1 (on P2) takes another sched lock (owner = P2, count = 1)
** count should have been 2 **
- T1 (on P2) unlocks the scheduler (causes it to enter unlock_inner and choose
another thread to run.
** scheduling shouldn't have happened at this point **
- next time when T1 is in on any processor, it continues with it's second
scheduler unlocking (that decrements current sched lock value, irrespective of
anyone else being owner) and messup continues..
current code of sched-lock incrementing, does a lock++ when it becomes the
owner of lock for the first time (instead of setting it to 1), hence in case
sched_lock value had become -1 in previous case and NO CPU was owner, then it
will become 0, rather than 1 and system is in mess.
for this small aspect, fix is to replace "lock ++" by "lock = 1", but still
the larger problem remains.
Possible solutions (in part) could be --
* I change zeroing sched_lock process to check for - if the current processor
executing this code is the owner of sched lock, and only in that case proceed
with zeroing.
but it breaks the notion of - every eCos thread starting with sched_lock value
of 0 (a notion carried from NO-SMP eCos) --- impact of this??? might not be any.
Unless, I am missing something stupidly, decent amount of changes might be
required for SMPising of eCos, ranging from changes to unlock_inner to situation
that it might not been possible to extend NO-SMP eCos scheduling model to SMP.
I have sat over the observations and tried to analyse a bit before keying in
this mail today, but I am still considering my flaws in analysis/understanding
of eCos. but the mentioned in this mail (and others not mentioned) observations
can't be explained in other ways, atleast as of now.
For saving reader's time, SMP startup flow is as follows --
Cyg_scheduler :: start ()
--> HAL_SMP_CPU_START
--> cyg_hal_smp_start
--> cyg_hal_smp_startup
--> cyg_kernel_smp_startup (takes scheduler lock and calls start_cpu)
--> start_cpu (gets next thread to schedule and loads it)
--> thread_entry (zeroes scheduler lock and calls actual thread entry point
specified by the user during thread creation)
--
regards
sandeep
--------------------------------------------------------------------------
Walk softly and carry a megawatt laser.
--------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Bugs in eCos SMP scheduler
2004-09-06 11:29 Bugs in eCos SMP scheduler sandeep
@ 2004-09-07 12:54 ` sandeep
0 siblings, 0 replies; 2+ messages in thread
From: sandeep @ 2004-09-07 12:54 UTC (permalink / raw)
To: Nick Garnett; +Cc: ecos-devel
hi nick,
apologies for the previous post. I guess, looking at hal_smp.h and smp.hxx
simultaneously caused the previous post. Since you folks have tested SMP eCos on
i386 using redhat/ecoscentric testbeds, bugs shouldn't be there in SMP eCos, so
problem(s?) should be in HAL.
i feel, i have identified one problem atleast. the example situation given in
earlier post would happen with the state of existing HAL (in some scenario).
will handle it locally, and will share with list, from non company email id, in
case I observe some issues with eCos non-hal parts that helps make eCos sturdy.
--
regards
sandeep
--------------------------------------------------------------------------
God doesn't play dice.
-- Albert Einstein
--------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2004-09-07 12:54 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-06 11:29 Bugs in eCos SMP scheduler sandeep
2004-09-07 12:54 ` sandeep
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).