hard disk fun == down time

public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed

* hard disk fun == down time
  2001-12-31 19:40 hard disk fun == down time Christopher Faylor
@ 2001-03-29 18:31 ` Christopher Faylor
  2001-12-31 19:40 ` Jason Molenda
  1 sibling, 0 replies; 12+ messages in thread
From: Christopher Faylor @ 2001-03-29 18:31 UTC (permalink / raw)
  To: overseers

My attempts to use /dev/hda4 last night proved disastrous.  I believe
that there is a kernel bug which causes linux to screw up with very
large hard disks like /dev/hda.

The end result was that when I wrote to /dev/hda4, I was also randomly
writing to /dev/hda3.

It obviously took a long time for us to recover from this and I *know*
that there are still issues on /sourceware/www (aka /dev/hda3).  My apologies
for the long down time.  We're going to investigate what went wrong here
and try to improve on it.

If you see any oddness on this partition please send email to overseers.
I hope that we'll be able to have everything completely restored in the
next week or so.  For the time being, basic functionality on this
partition seems to be restored.

You can look at web pages but the most affected part of the partition
seemed to be the mailing list archives.  There is probably information
missing there.

Phew.  It's hard doing sysadmin duties when you're 3000 miles from the
machine in question.

cgf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40 ` Jason Molenda
@ 2001-03-29 20:34   ` Jason Molenda
  2001-12-31 19:40   ` law
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Jason Molenda @ 2001-03-29 20:34 UTC (permalink / raw)
  To: overseers

On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:

> The end result was that when I wrote to /dev/hda4, I was also randomly
> writing to /dev/hda3.

Ouch!  It wasn't just a partition problem where they were overlapping
or something like that?  I've seen that happen before, where you
write to one of the partitions and you're scrubbing all over one
of the other ones in the process.

If that 9GB SCSI drive is still around, the vast majority of the
web site is static.  99% of all the web pages on the site are
mailing lists, and only a small fraction of those web archives have
changed since the drive swap (i.e. the current month/quarter/year
archives).  All of the other archives could be copied back from the
SCSI drive on top of whatever's there now.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40   ` Christopher Faylor
@ 2001-03-29 20:56     ` Christopher Faylor
  0 siblings, 0 replies; 12+ messages in thread
From: Christopher Faylor @ 2001-03-29 20:56 UTC (permalink / raw)
  To: Jason Molenda; +Cc: overseers

On Thu, Mar 29, 2001 at 08:32:36PM -0800, Jason Molenda wrote:
>On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:
>>The end result was that when I wrote to /dev/hda4, I was also randomly
>>writing to /dev/hda3.
>
>Ouch! It wasn't just a partition problem where they were overlapping or
>something like that?  I've seen that happen before, where you write to
>one of the partitions and you're scrubbing all over one of the other
>ones in the process.

Nah.  I'm pretty sure it was a linux problem.  I had the same problem
here (at home).  I spent a week staring at partition tables and then
one day I decided to upgrade the OS.  The problem just disappeared.

Both Jeff and I looked at the partition table and it looked fine,
AFAWCT.

>If that 9GB SCSI drive is still around, the vast majority of the
>web site is static.  99% of all the web pages on the site are
>mailing lists, and only a small fraction of those web archives have
>changed since the drive swap (i.e. the current month/quarter/year
>archives).  All of the other archives could be copied back from the
>SCSI drive on top of whatever's there now.

Yep.  That's what we plan on doing.  I think the old SCSI disk is more
recent than any of the tape backups of that disk.

Btw, I've just learned that /dev/hda1 was affected too.  Sigh.

cgf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40   ` Jason Molenda
@ 2001-03-29 21:06     ` Jason Molenda
  2001-12-31 19:40     ` Christopher Faylor
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Molenda @ 2001-03-29 21:06 UTC (permalink / raw)
  To: overseers

On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:

> The end result was that when I wrote to /dev/hda4, I was also randomly
> writing to /dev/hda3.

I was just talking with Marc Rovner on the phone and mentioned the
problem.  He suggested that some old BIOSes had kooky confusion
about larger hard drives, and if Linux didn't know that the BIOS
might report incorrect information, hilarity could ensue.  Marc
said the trick was to specify explicitly the size of the drive
(cylinders, sectors, etc.) in Linux and everything would be OK.

FWIW.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40     ` Christopher Faylor
@ 2001-03-29 22:14       ` Christopher Faylor
  0 siblings, 0 replies; 12+ messages in thread
From: Christopher Faylor @ 2001-03-29 22:14 UTC (permalink / raw)
  To: Jason Molenda; +Cc: overseers

On Thu, Mar 29, 2001 at 09:05:51PM -0800, Jason Molenda wrote:
>On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:
>
>> The end result was that when I wrote to /dev/hda4, I was also randomly
>> writing to /dev/hda3.
>
>I was just talking with Marc Rovner on the phone and mentioned the
>problem.  He suggested that some old BIOSes had kooky confusion
>about larger hard drives, and if Linux didn't know that the BIOS
>might report incorrect information, hilarity could ensue.  Marc
>said the trick was to specify explicitly the size of the drive
>(cylinders, sectors, etc.) in Linux and everything would be OK.
>
>FWIW.

Could be, but this happened to me with a BIOS from December 2000.
No amount of BIOS fiddling seemed to help.

cgf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40   ` law
@ 2001-04-01 20:46     ` law
  0 siblings, 0 replies; 12+ messages in thread
From: law @ 2001-04-01 20:46 UTC (permalink / raw)
  To: Jason Molenda; +Cc: overseers

  In message <20010329203236.A2834@shell17.ba.best.com>you write:
  > On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:
  > 
  > > The end result was that when I wrote to /dev/hda4, I was also randomly
  > > writing to /dev/hda3.
  > 
  > Ouch!  It wasn't just a partition problem where they were overlapping
  > or something like that?  I've seen that happen before, where you
  > write to one of the partitions and you're scrubbing all over one
  > of the other ones in the process.
Nyet. 

My guess is there's a kernel bug of some kind.  Possibly related to using
DMA on the ide disks.  Dunno.


  > If that 9GB SCSI drive is still around, the vast majority of the
  > web site is static.  99% of all the web pages on the site are
  > mailing lists, and only a small fraction of those web archives have
  > changed since the drive swap (i.e. the current month/quarter/year
  > archives).  All of the other archives could be copied back from the
  > SCSI drive on top of whatever's there now.
I've still got it.  And using it to replace the lost data is the plan.

plan.
jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40 ` Jason Molenda
  2001-03-29 20:34   ` Jason Molenda
  2001-12-31 19:40   ` law
@ 2001-12-31 19:40   ` Christopher Faylor
  2001-03-29 20:56     ` Christopher Faylor
  2001-12-31 19:40   ` Jason Molenda
  3 siblings, 1 reply; 12+ messages in thread
From: Christopher Faylor @ 2001-12-31 19:40 UTC (permalink / raw)
  To: Jason Molenda; +Cc: overseers

On Thu, Mar 29, 2001 at 08:32:36PM -0800, Jason Molenda wrote:
>On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:
>>The end result was that when I wrote to /dev/hda4, I was also randomly
>>writing to /dev/hda3.
>
>Ouch! It wasn't just a partition problem where they were overlapping or
>something like that?  I've seen that happen before, where you write to
>one of the partitions and you're scrubbing all over one of the other
>ones in the process.

Nah.  I'm pretty sure it was a linux problem.  I had the same problem
here (at home).  I spent a week staring at partition tables and then
one day I decided to upgrade the OS.  The problem just disappeared.

Both Jeff and I looked at the partition table and it looked fine,
AFAWCT.

>If that 9GB SCSI drive is still around, the vast majority of the
>web site is static.  99% of all the web pages on the site are
>mailing lists, and only a small fraction of those web archives have
>changed since the drive swap (i.e. the current month/quarter/year
>archives).  All of the other archives could be copied back from the
>SCSI drive on top of whatever's there now.

Yep.  That's what we plan on doing.  I think the old SCSI disk is more
recent than any of the tape backups of that disk.

Btw, I've just learned that /dev/hda1 was affected too.  Sigh.

cgf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40 ` Jason Molenda
                     ` (2 preceding siblings ...)
  2001-12-31 19:40   ` Christopher Faylor
@ 2001-12-31 19:40   ` Jason Molenda
  2001-03-29 21:06     ` Jason Molenda
  2001-12-31 19:40     ` Christopher Faylor
  3 siblings, 2 replies; 12+ messages in thread
From: Jason Molenda @ 2001-12-31 19:40 UTC (permalink / raw)
  To: overseers

On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:

> The end result was that when I wrote to /dev/hda4, I was also randomly
> writing to /dev/hda3.

I was just talking with Marc Rovner on the phone and mentioned the
problem.  He suggested that some old BIOSes had kooky confusion
about larger hard drives, and if Linux didn't know that the BIOS
might report incorrect information, hilarity could ensue.  Marc
said the trick was to specify explicitly the size of the drive
(cylinders, sectors, etc.) in Linux and everything would be OK.

FWIW.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* hard disk fun == down time
@ 2001-12-31 19:40 Christopher Faylor
  2001-03-29 18:31 ` Christopher Faylor
  2001-12-31 19:40 ` Jason Molenda
  0 siblings, 2 replies; 12+ messages in thread
From: Christopher Faylor @ 2001-12-31 19:40 UTC (permalink / raw)
  To: overseers

My attempts to use /dev/hda4 last night proved disastrous.  I believe
that there is a kernel bug which causes linux to screw up with very
large hard disks like /dev/hda.

The end result was that when I wrote to /dev/hda4, I was also randomly
writing to /dev/hda3.

It obviously took a long time for us to recover from this and I *know*
that there are still issues on /sourceware/www (aka /dev/hda3).  My apologies
for the long down time.  We're going to investigate what went wrong here
and try to improve on it.

If you see any oddness on this partition please send email to overseers.
I hope that we'll be able to have everything completely restored in the
next week or so.  For the time being, basic functionality on this
partition seems to be restored.

You can look at web pages but the most affected part of the partition
seemed to be the mailing list archives.  There is probably information
missing there.

Phew.  It's hard doing sysadmin duties when you're 3000 miles from the
machine in question.

cgf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40   ` Jason Molenda
  2001-03-29 21:06     ` Jason Molenda
@ 2001-12-31 19:40     ` Christopher Faylor
  2001-03-29 22:14       ` Christopher Faylor
  1 sibling, 1 reply; 12+ messages in thread
From: Christopher Faylor @ 2001-12-31 19:40 UTC (permalink / raw)
  To: Jason Molenda; +Cc: overseers

On Thu, Mar 29, 2001 at 09:05:51PM -0800, Jason Molenda wrote:
>On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:
>
>> The end result was that when I wrote to /dev/hda4, I was also randomly
>> writing to /dev/hda3.
>
>I was just talking with Marc Rovner on the phone and mentioned the
>problem.  He suggested that some old BIOSes had kooky confusion
>about larger hard drives, and if Linux didn't know that the BIOS
>might report incorrect information, hilarity could ensue.  Marc
>said the trick was to specify explicitly the size of the drive
>(cylinders, sectors, etc.) in Linux and everything would be OK.
>
>FWIW.

Could be, but this happened to me with a BIOS from December 2000.
No amount of BIOS fiddling seemed to help.

cgf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40 ` Jason Molenda
  2001-03-29 20:34   ` Jason Molenda
@ 2001-12-31 19:40   ` law
  2001-04-01 20:46     ` law
  2001-12-31 19:40   ` Christopher Faylor
  2001-12-31 19:40   ` Jason Molenda
  3 siblings, 1 reply; 12+ messages in thread
From: law @ 2001-12-31 19:40 UTC (permalink / raw)
  To: Jason Molenda; +Cc: overseers

  In message < 20010329203236.A2834@shell17.ba.best.com >you write:
  > On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:
  > 
  > > The end result was that when I wrote to /dev/hda4, I was also randomly
  > > writing to /dev/hda3.
  > 
  > Ouch!  It wasn't just a partition problem where they were overlapping
  > or something like that?  I've seen that happen before, where you
  > write to one of the partitions and you're scrubbing all over one
  > of the other ones in the process.
Nyet. 

My guess is there's a kernel bug of some kind.  Possibly related to using
DMA on the ide disks.  Dunno.


  > If that 9GB SCSI drive is still around, the vast majority of the
  > web site is static.  99% of all the web pages on the site are
  > mailing lists, and only a small fraction of those web archives have
  > changed since the drive swap (i.e. the current month/quarter/year
  > archives).  All of the other archives could be copied back from the
  > SCSI drive on top of whatever's there now.
I've still got it.  And using it to replace the lost data is the plan.

plan.
jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hard disk fun == down time
  2001-12-31 19:40 hard disk fun == down time Christopher Faylor
  2001-03-29 18:31 ` Christopher Faylor
@ 2001-12-31 19:40 ` Jason Molenda
  2001-03-29 20:34   ` Jason Molenda
                     ` (3 more replies)
  1 sibling, 4 replies; 12+ messages in thread
From: Jason Molenda @ 2001-12-31 19:40 UTC (permalink / raw)
  To: overseers

On Thu, Mar 29, 2001 at 09:27:37PM -0500, Christopher Faylor wrote:

> The end result was that when I wrote to /dev/hda4, I was also randomly
> writing to /dev/hda3.

Ouch!  It wasn't just a partition problem where they were overlapping
or something like that?  I've seen that happen before, where you
write to one of the partitions and you're scrubbing all over one
of the other ones in the process.

If that 9GB SCSI drive is still around, the vast majority of the
web site is static.  99% of all the web pages on the site are
mailing lists, and only a small fraction of those web archives have
changed since the drive swap (i.e. the current month/quarter/year
archives).  All of the other archives could be copied back from the
SCSI drive on top of whatever's there now.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-12-31 19:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-12-31 19:40 hard disk fun == down time Christopher Faylor
2001-03-29 18:31 ` Christopher Faylor
2001-12-31 19:40 ` Jason Molenda
2001-03-29 20:34   ` Jason Molenda
2001-12-31 19:40   ` law
2001-04-01 20:46     ` law
2001-12-31 19:40   ` Christopher Faylor
2001-03-29 20:56     ` Christopher Faylor
2001-12-31 19:40   ` Jason Molenda
2001-03-29 21:06     ` Jason Molenda
2001-12-31 19:40     ` Christopher Faylor
2001-03-29 22:14       ` Christopher Faylor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).