public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* Status on sourceware.org stability
@ 2004-05-10 14:30 Matthew Galgoci
  2004-05-10 15:14 ` Christopher Faylor
  0 siblings, 1 reply; 3+ messages in thread
From: Matthew Galgoci @ 2004-05-10 14:30 UTC (permalink / raw)
  To: overseers


Ok folks, here is the current status on sourceware:

With the increased bandwidth and increased ram in sourceware, we are now
pushing the disk subsystem harder than we ever have before. The binken
lights on the disks are pegged solid. This increased load is exposing 
driver bugs with an increased frequency that we previously only encountered
once every few weeks. We are now hitting these bugs once every couple of
hours.

Note that the kernel on sourceware at least for most of the weekend was the
same kernel we had been running for the past two months.

Also note that the new ram in sourceware is registered ECC ram.

Some background on the disk subsystem used on sourceware.org:

The disk subsystem consists of 5 scsi disks, ontop of which, we run hardware
raid5 in the form of Adaptec's raid controller. This raid controller, in 
combination with a hardware specific OS level driver, present to the OS any 
number of logical disks.

The OS level driver in linux is the aacraid driver. This driver has a number
of problems, which we have encountered in the past as sourceware mysteriously
locking hard. This used to happen once every couple of weeks. We are now seeing
these hard locks every couple of hours, with an accompanying kernel traceback.

This traceback is the key to finding the problem. However, the call trace is so
long it scrolls off the screen.

In order to catch this kernel traceback, I've attached a serial console to 
sourceware so I can get our engineering group to fix the problem in the aacraid
driver. The author of the driver is going to assist as well.

I will remain on call to attend to sourceware until this problem is resolved.

If you have any questions, please post them to overseers and I will answer them
as best I can.

Regards.

Matthew Galgoci

-- 
Matthew Galgoci
System Administrator and Sr. Manager of Ruminants
Red Hat, Inc
919.754.3700 x44155

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Status on sourceware.org stability
  2004-05-10 14:30 Status on sourceware.org stability Matthew Galgoci
@ 2004-05-10 15:14 ` Christopher Faylor
  2004-05-10 20:10   ` Jonathan Larmour
  0 siblings, 1 reply; 3+ messages in thread
From: Christopher Faylor @ 2004-05-10 15:14 UTC (permalink / raw)
  To: overseers

On Mon, May 10, 2004 at 10:10:12AM -0400, Matthew Galgoci wrote:
>The OS level driver in linux is the aacraid driver.  This driver has a
>number of problems, which we have encountered in the past as sourceware
>mysteriously locking hard.  This used to happen once every couple of
>weeks.  We are now seeing these hard locks every couple of hours, with
>an accompanying kernel traceback.

That is a bit of an exaggeration.  sourceware did not lock hard once
every couple of weeks.

Also, when the system was rebooted it used the latest errata kernel,
which was different from what it had been running.  I should have
mentioned that before.  I installed the new kernel a little while
before the system went down.  It was the latest stock RH 9.0 kernel.

cgf

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Status on sourceware.org stability
  2004-05-10 15:14 ` Christopher Faylor
@ 2004-05-10 20:10   ` Jonathan Larmour
  0 siblings, 0 replies; 3+ messages in thread
From: Jonathan Larmour @ 2004-05-10 20:10 UTC (permalink / raw)
  To: overseers

Christopher Faylor wrote:
> Also, when the system was rebooted it used the latest errata kernel,
> which was different from what it had been running.  I should have
> mentioned that before.  I installed the new kernel a little while
> before the system went down.  It was the latest stock RH 9.0 kernel.

If it helps any which it probably doesn't, I had problems with the latest 
RH9 kernel on my home machine here (only 768M RAM and no RAID) and 
downgraded again after all sorts of weirdness with the VM and large 
applications.

Jifl
-- 
eCosCentric    http://www.eCosCentric.com/    The eCos and RedBoot experts
--["No sense being pessimistic, it wouldn't work anyway"]-- Opinions==mine

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-05-10 18:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-10 14:30 Status on sourceware.org stability Matthew Galgoci
2004-05-10 15:14 ` Christopher Faylor
2004-05-10 20:10   ` Jonathan Larmour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).