* ongoing sourceware.org recovery from disk corruption
@ 2017-08-15 13:35 Frank Ch. Eigler
2017-08-15 14:02 ` Joseph Myers
2017-08-15 16:23 ` Joseph Myers
0 siblings, 2 replies; 6+ messages in thread
From: Frank Ch. Eigler @ 2017-08-15 13:35 UTC (permalink / raw)
To: Sourceware Overseers
Hi -
As you probably know, we had a planned shutdown last night for
installation of a PCI SSD card into sourceware.org=gcc.gnu.org,
expecting to benefit from much greater read speeds. Its installation
went fine, and the machine came back up fine. The planned LVM2
operations were begun to make the PCI SSD card an LV raid1 mirror, and
the HDD LV mirror half was made 'writemostly'.
This is when things started going wrong. I believe a kernel bug
(rhel6 2.6.32+) has caused the mostly-null SSD LV mirror half to start
answering -some- reads, even from regions that were not yet finished
their initial mirror. This messed up ext4's brain, which started
corrupting metadata and some file content on the HDD half. Within a
few minutes, it was clear something was wrong, the SSD mirroring was
shut down and the machine was rebooted. That stopped any further
corruption.
Unfortunateyl, within those few minutes, a large number of files were
corrupted. While we have backups on /sourceware2 (now frozen) from
late the previous night (Aug. 13), the new work makes us loath to just
switch back to the backup and ditch the 24+ hours of un-backed-up work
before the corruption, and the new bits of work committed since then.
So we're proceeding to restore bits, file by file, when/as corruption
is found. It's silly laborious, and we'll appreciate your patience
and help identifying affected files. The version control repositories
appear fine now, /ftp is getting mass-restored (since it's apprx. all
old), so the most important stuff seems OK. There are reports of some
mailing list archives and wiki pages being broken; will look at those
next. Please come hang out on #overseers on irc.freenode.net to chat.
Sorry about this inconvenience. We (I) did not anticipate kernel bugs
messing up a Perfect Plan for speeding up our treasured little box.
In a little while, we'll try again, but with paranoid staging, some
manual fresh backups onto our backup server, and LVM snapshotting.
- FChE
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ongoing sourceware.org recovery from disk corruption
2017-08-15 13:35 ongoing sourceware.org recovery from disk corruption Frank Ch. Eigler
@ 2017-08-15 14:02 ` Joseph Myers
2017-08-16 13:35 ` Joseph Myers
2017-08-15 16:23 ` Joseph Myers
1 sibling, 1 reply; 6+ messages in thread
From: Joseph Myers @ 2017-08-15 14:02 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: Sourceware Overseers
On Tue, 15 Aug 2017, Frank Ch. Eigler wrote:
> So we're proceeding to restore bits, file by file, when/as corruption
> is found. It's silly laborious, and we'll appreciate your patience
> and help identifying affected files. The version control repositories
> appear fine now, /ftp is getting mass-restored (since it's apprx. all
> old), so the most important stuff seems OK. There are reports of some
> mailing list archives and wiki pages being broken; will look at those
> next. Please come hang out on #overseers on irc.freenode.net to chat.
Suggested general comparison methodology (for the whole backup against
live data, not just areas we know to be broken, and with due disregard of
any areas we are confident have been safely fixed now - and of course any
areas of old files that shouldn't change at all can just be restored with
rsync -c without such comparisons needed):
* If the file contents have changed but the timestamp hasn't, it's almost
certainly corruption (especially if the changes involve blocks of NUL
bytes) and restoring the old version makes sense.
* If a file has disappeared, it's very probably corruption and restoring
the old file is very probably safe (it *could* be a temporary file
captured by the backup, or something that was properly removed, but more
likely something lost by metadata corruption).
* If a file is new since the last backup, it may or may not also be
corrupted, but there's not much that can be done about it if it is
corrupted (except in special cases, e.g. getting copies of release files
or version control data from elsewhere).
* If a file's contents and timestamp have both changed, it may or may not
also be corrupted, and if it is it may or may not make sense to restore
from backup. Depending on how many such files there are, we may need to
consider case by case what should be done for particular kinds of files
(e.g. if they should be text, checking for NUL bytes would help indicate
corruption).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ongoing sourceware.org recovery from disk corruption
2017-08-15 13:35 ongoing sourceware.org recovery from disk corruption Frank Ch. Eigler
2017-08-15 14:02 ` Joseph Myers
@ 2017-08-15 16:23 ` Joseph Myers
1 sibling, 0 replies; 6+ messages in thread
From: Joseph Myers @ 2017-08-15 16:23 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: Sourceware Overseers
Is there any plan regarding Bugzilla (and anything else using databases, I
don't know what there might be beyond Bugzilla)? Inform all the affected
projects of the exact period for which data was lost (extends back to
Sunday afternoon at least according to
<https://gcc.gnu.org/ml/gcc/2017-08/msg00146.html>) and ask them to
systematically redo changes / refile bugs (possibly with new bug numbers)
based on -bugs list archives, or something else? For GCC Bugzilla that's
GCC and Classpath as affected projects; rather more projects for
Sourceware Bugzilla though probably much less activity to restore in
total.
For GCC Bugzilla, the change from
<https://gcc.gnu.org/ml/gcc-bugs/2017-08/msg01289.html> appears to be
present, but not that from
<https://gcc.gnu.org/ml/gcc-bugs/2017-08/msg01290.html> (i.e. a cut-off
early Sunday morning UTC). I don't know for sourceware Bugzilla but it's
at least consistent with that from checking glibc-bugs (but with a larger
gap in time between present and absent changes; there are lots of other
projects that might help narrow down the time).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ongoing sourceware.org recovery from disk corruption
2017-08-15 14:02 ` Joseph Myers
@ 2017-08-16 13:35 ` Joseph Myers
2017-08-18 22:30 ` Joseph Myers
0 siblings, 1 reply; 6+ messages in thread
From: Joseph Myers @ 2017-08-16 13:35 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: Sourceware Overseers
On Tue, 15 Aug 2017, Joseph Myers wrote:
> Suggested general comparison methodology (for the whole backup against
> live data, not just areas we know to be broken, and with due disregard of
> any areas we are confident have been safely fixed now - and of course any
> areas of old files that shouldn't change at all can just be restored with
> rsync -c without such comparisons needed):
>
> * If the file contents have changed but the timestamp hasn't, it's almost
> certainly corruption (especially if the changes involve blocks of NUL
> bytes) and restoring the old version makes sense.
To be clear: by timestamp I'm referring to the *seconds* part of the
timestamp, not the nanoseconds (whereas e.g. Python stat returns float
values for timestamps including nanoseconds by default). Whatever
comparison is used needs to compare seconds only. E.g., for one case of a
corrupted file I noticed updating an old src checkout:
-r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.000000000 +0000 /sourceware2/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
-r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.398317319 +0000 /sourceware/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
(apparently the /sourceware2 backup process didn't preserve nanoseconds; I
think rsync may not support them).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ongoing sourceware.org recovery from disk corruption
2017-08-16 13:35 ` Joseph Myers
@ 2017-08-18 22:30 ` Joseph Myers
2017-08-21 11:06 ` Joseph Myers
0 siblings, 1 reply; 6+ messages in thread
From: Joseph Myers @ 2017-08-18 22:30 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: Sourceware Overseers
On Wed, 16 Aug 2017, Joseph Myers wrote:
> To be clear: by timestamp I'm referring to the *seconds* part of the
> timestamp, not the nanoseconds (whereas e.g. Python stat returns float
> values for timestamps including nanoseconds by default). Whatever
> comparison is used needs to compare seconds only. E.g., for one case of a
> corrupted file I noticed updating an old src checkout:
>
> -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.000000000 +0000 /sourceware2/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
> -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.398317319 +0000 /sourceware/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
Any updates on the status of the search for / fix of such
same-timestamp-different-contents files? This pair still have different
contents, for example.
On the missing-files front, in another cvs update I noticed that
/sourceware/projects/gcc-home/cvsfiles/wwwdocs/htdocs/onlinedocs is
missing the entire gcc-4.8.1 subdirectory (present in the /sourceware2
copy). Presumably there are various other such missing files and
directories (but as the search for them only needs to compare directory
contents, not read the files themselves, it should be quicker).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ongoing sourceware.org recovery from disk corruption
2017-08-18 22:30 ` Joseph Myers
@ 2017-08-21 11:06 ` Joseph Myers
0 siblings, 0 replies; 6+ messages in thread
From: Joseph Myers @ 2017-08-21 11:06 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: Sourceware Overseers
On Fri, 18 Aug 2017, Joseph Myers wrote:
> On Wed, 16 Aug 2017, Joseph Myers wrote:
>
> > To be clear: by timestamp I'm referring to the *seconds* part of the
> > timestamp, not the nanoseconds (whereas e.g. Python stat returns float
> > values for timestamps including nanoseconds by default). Whatever
> > comparison is used needs to compare seconds only. E.g., for one case of a
> > corrupted file I noticed updating an old src checkout:
> >
> > -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.000000000 +0000 /sourceware2/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
> > -r--r--r--. 1 corinna src 106630 2015-01-14 09:56:02.398317319 +0000 /sourceware/projects/src-home/cvsfiles/src/libgloss/or1k/include/or1k-sprs.h,v
>
> Any updates on the status of the search for / fix of such
> same-timestamp-different-contents files? This pair still have different
> contents, for example.
I see that ,v file has now been fixed, thanks, while some other
same-timestamp-different-contents fixes remain pending.
> On the missing-files front, in another cvs update I noticed that
> /sourceware/projects/gcc-home/cvsfiles/wwwdocs/htdocs/onlinedocs is
> missing the entire gcc-4.8.1 subdirectory (present in the /sourceware2
> copy). Presumably there are various other such missing files and
> directories (but as the search for them only needs to compare directory
> contents, not read the files themselves, it should be quicker).
I've identified a further form of corruption that could be automatically
searched for. I was looking at the wiki problems (GCC wiki, at least,
missing various pages, as can be seen by the links in grey on its home
page). For example, <https://gcc.gnu.org/wiki/OpenACC> is missing. Now
the relevant files are:
$ ls -l /sourceware{,2}/projects/gcc-home/wikidata/data/pages/OpenACC
/sourceware2/projects/gcc-home/wikidata/data/pages/OpenACC:
total 16
drwxrwx---. 2 apache apache 4096 Jul 27 18:47 cache
-rw-rw----. 1 apache apache 9 Jul 27 18:47 current
-rw-rw----. 1 apache apache 3902 Jul 27 18:47 edit-log
drwxrwx---. 2 apache apache 4096 Jul 27 18:47 revisions
/sourceware/projects/gcc-home/wikidata/data/pages/OpenACC:
total 0
-rw-rw----. 1 apache apache 0 Aug 15 01:24 edit-log
So there are missing files and directories, which should be covered by a
global restore to /sourceware of files and directories present only on
/sourceware2 (minus any cases where it seems such a restore is
inappropriate for some reason). But also the edit-log file ended up being
zero-size with a new timestamp - so also needs restoring from
/sourceware2. That suggests another search: for files that are zero-size
on /sourceware but non-zero size on /sourceware2 (which also should
generally be pretty safe to restore from the /sourceware2 copies).
(The timestamps on the empty edit-log files for missing pages vary; cf the
LinkTimeOptimizationFAQ, JIT pages as other examples. Possibly the empty
edit-log files result from something the wiki software does on
encountering corrupted page data, rather than resulting directly from the
corruption itself - but anyway, the edit-log files that are now zero-size
should be restored along with other missing/corrupted wiki page data, in
any wiki.)
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-08-21 11:06 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-15 13:35 ongoing sourceware.org recovery from disk corruption Frank Ch. Eigler
2017-08-15 14:02 ` Joseph Myers
2017-08-16 13:35 ` Joseph Myers
2017-08-18 22:30 ` Joseph Myers
2017-08-21 11:06 ` Joseph Myers
2017-08-15 16:23 ` Joseph Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).