public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* ongoing sourceware.org recovery from disk corruption
@ 2017-08-15 13:35 Frank Ch. Eigler
  2017-08-15 14:02 ` Joseph Myers
  2017-08-15 16:23 ` Joseph Myers
  0 siblings, 2 replies; 6+ messages in thread
From: Frank Ch. Eigler @ 2017-08-15 13:35 UTC (permalink / raw)
  To: Sourceware Overseers

Hi -

As you probably know, we had a planned shutdown last night for
installation of a PCI SSD card into sourceware.org=gcc.gnu.org,
expecting to benefit from much greater read speeds.  Its installation
went fine, and the machine came back up fine.  The planned LVM2
operations were begun to make the PCI SSD card an LV raid1 mirror, and
the HDD LV mirror half was made 'writemostly'.

This is when things started going wrong.  I believe a kernel bug
(rhel6 2.6.32+) has caused the mostly-null SSD LV mirror half to start
answering -some- reads, even from regions that were not yet finished
their initial mirror.  This messed up ext4's brain, which started
corrupting metadata and some file content on the HDD half.  Within a
few minutes, it was clear something was wrong, the SSD mirroring was
shut down and the machine was rebooted.  That stopped any further
corruption.

Unfortunateyl, within those few minutes, a large number of files were
corrupted.  While we have backups on /sourceware2 (now frozen) from
late the previous night (Aug. 13), the new work makes us loath to just
switch back to the backup and ditch the 24+ hours of un-backed-up work
before the corruption, and the new bits of work committed since then.

So we're proceeding to restore bits, file by file, when/as corruption
is found.  It's silly laborious, and we'll appreciate your patience
and help identifying affected files.  The version control repositories
appear fine now, /ftp is getting mass-restored (since it's apprx. all
old), so the most important stuff seems OK.  There are reports of some
mailing list archives and wiki pages being broken; will look at those
next.  Please come hang out on #overseers on irc.freenode.net to chat.

Sorry about this inconvenience.  We (I) did not anticipate kernel bugs
messing up a Perfect Plan for speeding up our treasured little box.
In a little while, we'll try again, but with paranoid staging, some
manual fresh backups onto our backup server, and LVM snapshotting.


- FChE

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ongoing sourceware.org recovery from disk corruption
  2017-08-15 13:35 ongoing sourceware.org recovery from disk corruption Frank Ch. Eigler
@ 2017-08-15 14:02 ` Joseph Myers
  2017-08-16 13:35   ` Joseph Myers
  2017-08-15 16:23 ` Joseph Myers
  1 sibling, 1 reply; 6+ messages in thread
From: Joseph Myers @ 2017-08-15 14:02 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Sourceware Overseers

On Tue, 15 Aug 2017, Frank Ch. Eigler wrote:

> So we're proceeding to restore bits, file by file, when/as corruption
> is found.  It's silly laborious, and we'll appreciate your patience
> and help identifying affected files.  The version control repositories
> appear fine now, /ftp is getting mass-restored (since it's apprx. all
> old), so the most important stuff seems OK.  There are reports of some
> mailing list archives and wiki pages being broken; will look at those
> next.  Please come hang out on #overseers on irc.freenode.net to chat.

Suggested general comparison methodology (for the whole backup against 
live data, not just areas we know to be broken, and with due disregard of 
any areas we are confident have been safely fixed now - and of course any 
areas of old files that shouldn't change at all can just be restored with 
rsync -c without such comparisons needed):

* If the file contents have changed but the timestamp hasn't, it's almost 
certainly corruption (especially if the changes involve blocks of NUL 
bytes) and restoring the old version makes sense.

* If a file has disappeared, it's very probably corruption and restoring 
the old file is very probably safe (it *could* be a temporary file 
captured by the backup, or something that was properly removed, but more 
likely something lost by metadata corruption).

* If a file is new since the last backup, it may or may not also be 
corrupted, but there's not much that can be done about it if it is 
corrupted (except in special cases, e.g. getting copies of release files 
or version control data from elsewhere).

* If a file's contents and timestamp have both changed, it may or may not 
also be corrupted, and if it is it may or may not make sense to restore 
from backup.  Depending on how many such files there are, we may need to 
consider case by case what should be done for particular kinds of files 
(e.g. if they should be text, checking for NUL bytes would help indicate 
corruption).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ongoing sourceware.org recovery from disk corruption
  2017-08-15 13:35 ongoing sourceware.org recovery from disk corruption Frank Ch. Eigler
  2017-08-15 14:02 ` Joseph Myers
@ 2017-08-15 16:23 ` Joseph Myers
  1 sibling, 0 replies; 6+ messages in thread
From: Joseph Myers @ 2017-08-15 16:23 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Sourceware Overseers

Is there any plan regarding Bugzilla (and anything else using databases, I 
don't know what there might be beyond Bugzilla)?  Inform all the affected 
projects of the exact period for which data was lost (extends back to 
Sunday afternoon at least according to 
<https://gcc.gnu.org/ml/gcc/2017-08/msg00146.html>) and ask them to 
systematically redo changes / refile bugs (possibly with new bug numbers) 
based on -bugs list archives, or something else?  For GCC Bugzilla that's 
GCC and Classpath as affected projects; rather more projects for 
Sourceware Bugzilla though probably much less activity to restore in 
total.

For GCC Bugzilla, the change from 
<https://gcc.gnu.org/ml/gcc-bugs/2017-08/msg01289.html> appears to be 
present, but not that from 
<https://gcc.gnu.org/ml/gcc-bugs/2017-08/msg01290.html> (i.e. a cut-off 
early Sunday morning UTC).  I don't know for sourceware Bugzilla but it's 
at least consistent with that from checking glibc-bugs (but with a larger 
gap in time between present and absent changes; there are lots of other 
projects that might help narrow down the time).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ongoing sourceware.org recovery from disk corruption
  2017-08-15 14:02 ` Joseph Myers
@ 2017-08-16 13:35   ` Joseph Myers
  2017-08-18 22:30     ` Joseph Myers
  0 siblings, 1 reply; 6+ messages in thread
From: Joseph Myers @ 2017-08-16 13:35 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Sourceware Overseers

On Tue, 15 Aug 2017, Joseph Myers wrote:

> Suggested general comparison methodology (for the whole backup against 
> live data, not just areas we know to be broken, and with due disregard of 
> any areas we are confident have been safely fixed now - and of course any 
> areas of old files that shouldn't change at all can just be restored with 
> rsync -c without such comparisons needed):
> 
> * If the file contents have changed but the timestamp hasn't, it's almost 
> certainly corruption (especially if the changes involve blocks of NUL 
> bytes) and restoring the old version makes sense.

To be clear: by timestamp I'm referring to the *seconds* part of the 
timestamp, not the nanoseconds (whereas e.g. Python stat returns float 
values for timestamps including nanoseconds by default).  Whatever 
comparison is used needs to compare seconds only.  E.g., for one case of a 
corrupted file I noticed updating an old src checkout:

-r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.000000000 +0000 /sourceware2/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
-r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.398317319 +0000 /sourceware/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v

(apparently the /sourceware2 backup process didn't preserve nanoseconds; I 
think rsync may not support them).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ongoing sourceware.org recovery from disk corruption
  2017-08-16 13:35   ` Joseph Myers
@ 2017-08-18 22:30     ` Joseph Myers
  2017-08-21 11:06       ` Joseph Myers
  0 siblings, 1 reply; 6+ messages in thread
From: Joseph Myers @ 2017-08-18 22:30 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Sourceware Overseers

On Wed, 16 Aug 2017, Joseph Myers wrote:

> To be clear: by timestamp I'm referring to the *seconds* part of the 
> timestamp, not the nanoseconds (whereas e.g. Python stat returns float 
> values for timestamps including nanoseconds by default).  Whatever 
> comparison is used needs to compare seconds only.  E.g., for one case of a 
> corrupted file I noticed updating an old src checkout:
> 
> -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.000000000 +0000 /sourceware2/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
> -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.398317319 +0000 /sourceware/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v

Any updates on the status of the search for / fix of such 
same-timestamp-different-contents files?  This pair still have different 
contents, for example.

On the missing-files front, in another cvs update I noticed that 
/sourceware/projects/gcc-home/cvsfiles/wwwdocs/htdocs/onlinedocs is 
missing the entire gcc-4.8.1 subdirectory (present in the /sourceware2 
copy).  Presumably there are various other such missing files and 
directories (but as the search for them only needs to compare directory 
contents, not read the files themselves, it should be quicker).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ongoing sourceware.org recovery from disk corruption
  2017-08-18 22:30     ` Joseph Myers
@ 2017-08-21 11:06       ` Joseph Myers
  0 siblings, 0 replies; 6+ messages in thread
From: Joseph Myers @ 2017-08-21 11:06 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Sourceware Overseers

On Fri, 18 Aug 2017, Joseph Myers wrote:

> On Wed, 16 Aug 2017, Joseph Myers wrote:
> 
> > To be clear: by timestamp I'm referring to the *seconds* part of the 
> > timestamp, not the nanoseconds (whereas e.g. Python stat returns float 
> > values for timestamps including nanoseconds by default).  Whatever 
> > comparison is used needs to compare seconds only.  E.g., for one case of a 
> > corrupted file I noticed updating an old src checkout:
> > 
> > -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.000000000 +0000 /sourceware2/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
> > -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.398317319 +0000 /sourceware/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
> 
> Any updates on the status of the search for / fix of such 
> same-timestamp-different-contents files?  This pair still have different 
> contents, for example.

I see that ,v file has now been fixed, thanks, while some other 
same-timestamp-different-contents fixes remain pending.

> On the missing-files front, in another cvs update I noticed that 
> /sourceware/projects/gcc-home/cvsfiles/wwwdocs/htdocs/onlinedocs is 
> missing the entire gcc-4.8.1 subdirectory (present in the /sourceware2 
> copy).  Presumably there are various other such missing files and 
> directories (but as the search for them only needs to compare directory 
> contents, not read the files themselves, it should be quicker).

I've identified a further form of corruption that could be automatically 
searched for.  I was looking at the wiki problems (GCC wiki, at least, 
missing various pages, as can be seen by the links in grey on its home 
page).  For example, <https://gcc.gnu.org/wiki/OpenACC> is missing.  Now 
the relevant files are:

$ ls -l /sourceware{,2}/projects/gcc-home/wikidata/data/pages/OpenACC
/sourceware2/projects/gcc-home/wikidata/data/pages/OpenACC:
total 16
drwxrwx---. 2 apache apache 4096 Jul 27 18:47 cache
-rw-rw----. 1 apache apache    9 Jul 27 18:47 current
-rw-rw----. 1 apache apache 3902 Jul 27 18:47 edit-log
drwxrwx---. 2 apache apache 4096 Jul 27 18:47 revisions

/sourceware/projects/gcc-home/wikidata/data/pages/OpenACC:
total 0
-rw-rw----. 1 apache apache 0 Aug 15 01:24 edit-log

So there are missing files and directories, which should be covered by a 
global restore to /sourceware of files and directories present only on 
/sourceware2 (minus any cases where it seems such a restore is 
inappropriate for some reason).  But also the edit-log file ended up being 
zero-size with a new timestamp - so also needs restoring from 
/sourceware2.  That suggests another search: for files that are zero-size 
on /sourceware but non-zero size on /sourceware2 (which also should 
generally be pretty safe to restore from the /sourceware2 copies).

(The timestamps on the empty edit-log files for missing pages vary; cf the 
LinkTimeOptimizationFAQ, JIT pages as other examples.  Possibly the empty 
edit-log files result from something the wiki software does on 
encountering corrupted page data, rather than resulting directly from the 
corruption itself - but anyway, the edit-log files that are now zero-size 
should be restored along with other missing/corrupted wiki page data, in 
any wiki.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-08-21 11:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-15 13:35 ongoing sourceware.org recovery from disk corruption Frank Ch. Eigler
2017-08-15 14:02 ` Joseph Myers
2017-08-16 13:35   ` Joseph Myers
2017-08-18 22:30     ` Joseph Myers
2017-08-21 11:06       ` Joseph Myers
2017-08-15 16:23 ` Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).