public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* mysterious reboots
@ 2004-04-08 17:46 Angela Marie Thomas
  2004-04-08 17:54 ` Frank Ch. Eigler
  2004-04-08 18:06 ` Matthew Galgoci
  0 siblings, 2 replies; 7+ messages in thread
From: Angela Marie Thomas @ 2004-04-08 17:46 UTC (permalink / raw)
  To: overseers


This is the second or third time in recent memory that the system
has mysteriously rebooted with no one owning up to doing it.
I'm very concerned about this.  Every time there's a mysterious
reboot, I have to prepare for the potential for a complete system
reload from backups or at the very least some kind of verification
that the files are OK.

I'd very much like to get to the bottom of mysterious reboots.
So far all of the usual suspects have replied to the current thread
and none have owned up to doing the reboot.  This implies to me
that the system spontaneously rebooted which I consder a pretty
bad sign.  Can we please have a rule that anyone who reboots the
system or authorizes a reboot of the system must notify overseers,
preferably before the fact, so we all know what is going on?

WRT the backups specifically, they still run every night.  However,
the system was down while the backups tried to run last night so
my snapshot is from Wed morning with an incomplete /sourceware/www
backup.  I'm doing a rsync -n run right now to see how much data
wants to be updated.

--Angela

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mysterious reboots
  2004-04-08 17:46 mysterious reboots Angela Marie Thomas
@ 2004-04-08 17:54 ` Frank Ch. Eigler
  2004-04-08 18:17   ` Angela Marie Thomas
  2004-04-08 18:06 ` Matthew Galgoci
  1 sibling, 1 reply; 7+ messages in thread
From: Frank Ch. Eigler @ 2004-04-08 17:54 UTC (permalink / raw)
  To: angela; +Cc: overseers

[-- Attachment #1: Type: text/plain, Size: 884 bytes --]

Hi -


angela wrote:

> This is the second or third time in recent memory that the system
> has mysteriously rebooted with no one owning up to doing it.
> [...]

The machine hung while I was logged on early this morning, which
fact I announced in email to overseers.  mgalgoci oversaw its
reboot two hours later.  I guess announcing that the reboot was
done seemed unnecessary after the fact.  Sorry.

> I'd very much like to get to the bottom of mysterious reboots.

Yes, I wish too that we had more insight into what's going on
inside an ailing machine.

> WRT the backups specifically, they still run every night.  However,
> the system was down while the backups tried to run last night so
> my snapshot is from Wed morning with an incomplete /sourceware/www
> backup.  I'm doing a rsync -n run right now to see how much data
> wants to be updated.

Thank you very much.


- FChE

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mysterious reboots
  2004-04-08 17:46 mysterious reboots Angela Marie Thomas
  2004-04-08 17:54 ` Frank Ch. Eigler
@ 2004-04-08 18:06 ` Matthew Galgoci
  2004-04-08 18:19   ` Matthew Galgoci
  2004-04-08 18:42   ` Jonathan Larmour
  1 sibling, 2 replies; 7+ messages in thread
From: Matthew Galgoci @ 2004-04-08 18:06 UTC (permalink / raw)
  To: overseers

On Thu, 8 Apr 2004, Angela Marie Thomas wrote:
> This is the second or third time in recent memory that the system
> has mysteriously rebooted with no one owning up to doing it.
> I'm very concerned about this.  Every time there's a mysterious
> reboot, I have to prepare for the potential for a complete system
> reload from backups or at the very least some kind of verification
> that the files are OK.

The system appears to have locked up just before 6am EDT. I power cycled
it remotely around 7:30am EDT. Frank was on the box just after it came
up and I assumed he would take a look at things.

Fwiw, I watched the raid array start up via remote serial console:

[snip]
Dell PowerEdge Expandable RAID Controller 3/Di, BIOS V2.7-0 [Build 3546]
(c) 1998-2001 Adaptec, Inc. All Rights Reserved.
                                                                                
    Press <Ctrl><A> for Configuration Utility!
Waiting for Array Controller #0 to start.... \
                                                                                
Array Controller started
Controller monitor V2.7-0, Build 3546
Controller kernel  V2.7-0, Build 3546
Controller POST operation successful
[end snip]

Note the lack of error messages concerning a failed disk or running in 
degraded mode.

I do know that there are open issues with the aacraid driver that manifest
as random hard locks, sometime occuring infrequently, other times the driver
doesn't last a week. It depends on the version of the aacraid hardware and
also the version of the aacraid firmware.

For more info: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=92129

Filtering out the crap, it looks like Adaptec has the problem fixed in the
latest version of the driver.

-- 
Matthew Galgoci
System Administrator and Sr. Manager of Ruminants
Red Hat, Inc
919.754.3700 x44155

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mysterious reboots
  2004-04-08 17:54 ` Frank Ch. Eigler
@ 2004-04-08 18:17   ` Angela Marie Thomas
  0 siblings, 0 replies; 7+ messages in thread
From: Angela Marie Thomas @ 2004-04-08 18:17 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: overseers


> The machine hung while I was logged on early this morning, which
> fact I announced in email to overseers.  mgalgoci oversaw its
> reboot two hours later.  I guess announcing that the reboot was
> done seemed unnecessary after the fact.  Sorry.

I interpretted your mail as being as mystified as others about
the reboot (i.e. the machine hung while you were using it and then
inexplicably rebooted).  I think being explicit will help ensure
everyone is on the same page.

--Angela

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mysterious reboots
  2004-04-08 18:06 ` Matthew Galgoci
@ 2004-04-08 18:19   ` Matthew Galgoci
  2004-04-08 18:42   ` Jonathan Larmour
  1 sibling, 0 replies; 7+ messages in thread
From: Matthew Galgoci @ 2004-04-08 18:19 UTC (permalink / raw)
  To: overseers

> I do know that there are open issues with the aacraid driver that manifest
> as random hard locks, sometime occuring infrequently, other times the driver
> doesn't last a week. It depends on the version of the aacraid hardware and
> also the version of the aacraid firmware.
> 
> For more info: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=92129
> 

I fired off an email to Mark Salyzyn to with the relevant information asking his
take on the situation vis recommended aacraid firmware and also recommended version
of the aacraid linux driver. When I hear back I will post his comments to this list.

-- 
Matthew Galgoci
System Administrator and Sr. Manager of Ruminants
Red Hat, Inc
919.754.3700 x44155

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mysterious reboots
  2004-04-08 18:06 ` Matthew Galgoci
  2004-04-08 18:19   ` Matthew Galgoci
@ 2004-04-08 18:42   ` Jonathan Larmour
  2004-04-08 18:47     ` Matthew Galgoci
  1 sibling, 1 reply; 7+ messages in thread
From: Jonathan Larmour @ 2004-04-08 18:42 UTC (permalink / raw)
  To: Matthew Galgoci; +Cc: overseers

Matthew Galgoci wrote:
> 
> The system appears to have locked up just before 6am EDT. I power cycled
> it remotely around 7:30am EDT. Frank was on the box just after it came
> up and I assumed he would take a look at things.
> 
[snip]
> I do know that there are open issues with the aacraid driver that manifest
> as random hard locks, sometime occuring infrequently, other times the driver
> doesn't last a week. It depends on the version of the aacraid hardware and
> also the version of the aacraid firmware.

So the question then becomes whether any filesystems are corrupted, even if 
they were unmounted successfully (and therefore were marked as clean and 
thus not checked on reboot). I know from experience that linux kernels 
don't take fs corruption, however caused, kindly and oopses etc. are 
certainly possible.

Jifl
-- 
eCosCentric    http://www.eCosCentric.com/    The eCos and RedBoot experts
--["No sense being pessimistic, it wouldn't work anyway"]-- Opinions==mine

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mysterious reboots
  2004-04-08 18:42   ` Jonathan Larmour
@ 2004-04-08 18:47     ` Matthew Galgoci
  0 siblings, 0 replies; 7+ messages in thread
From: Matthew Galgoci @ 2004-04-08 18:47 UTC (permalink / raw)
  To: Jonathan Larmour; +Cc: overseers

On Thu, 8 Apr 2004, Jonathan Larmour wrote:

> Matthew Galgoci wrote:
> > 
> > The system appears to have locked up just before 6am EDT. I power cycled
> > it remotely around 7:30am EDT. Frank was on the box just after it came
> > up and I assumed he would take a look at things.
> > 
> [snip]
> > I do know that there are open issues with the aacraid driver that manifest
> > as random hard locks, sometime occuring infrequently, other times the driver
> > doesn't last a week. It depends on the version of the aacraid hardware and
> > also the version of the aacraid firmware.
> 
> So the question then becomes whether any filesystems are corrupted, even if 
> they were unmounted successfully (and therefore were marked as clean and 
> thus not checked on reboot). I know from experience that linux kernels 
> don't take fs corruption, however caused, kindly and oopses etc. are 
> certainly possible.

I agree. My money is on cvs-temp. cgf is going to schedule an outage for me to
fsck the filesystems.

-- 
Matthew Galgoci
System Administrator and Sr. Manager of Ruminants
Red Hat, Inc
919.754.3700 x44155

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-04-08 18:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-08 17:46 mysterious reboots Angela Marie Thomas
2004-04-08 17:54 ` Frank Ch. Eigler
2004-04-08 18:17   ` Angela Marie Thomas
2004-04-08 18:06 ` Matthew Galgoci
2004-04-08 18:19   ` Matthew Galgoci
2004-04-08 18:42   ` Jonathan Larmour
2004-04-08 18:47     ` Matthew Galgoci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).