system woes likely due to hardware/fs problems

public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed

* system woes likely due to hardware/fs problems
@ 2004-04-06 21:04 Angela Marie Thomas
  2004-04-06 21:14 ` Frank Ch. Eigler
  2004-04-06 21:15 ` Ian Lance Taylor
  0 siblings, 2 replies; 13+ messages in thread
From: Angela Marie Thomas @ 2004-04-06 21:04 UTC (permalink / raw)
  To: overseers


Geoff and I were in Australia for 3 weeks so I'm behind in overseers
mail but did notice the recent htdig discussion and my backups today
weren't able to complete in a timely fashion either.  I took a look at
the system and eventually was able to find the following errors:

Apr  4 22:00:03 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_add_entry: bad entry in directory #107864: rec_len %% 4 != 0 - offset=0, inode=1296707338, rec_len=28261, name_len=117
Apr  5 23:46:35 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_add_entry: bad entry in directory #14523: rec_len %% 4 != 0 - offset=0, inode=2037591876, rec_len=12147, name_len=47
Apr  5 23:46:38 sourceware kernel: EXT3-fs warning (device lvm(58,7)): empty_dir: bad directory (dir #14521) - no `.' or `..'
Apr  5 23:46:38 sourceware kernel: EXT3-fs warning (device lvm(58,7)): ext3_rmdir: empty directory has nlink!=2 (3)
Apr  6 00:42:48 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_readdir: bad entry in directory #83388: rec_len %% 4 != 0 - offset=0, inode=1701999700, rec_len=11621, name_len=115
Apr  6 00:42:48 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_add_entry: bad entry in directory #83399: rec_len %% 4 != 0 - offset=0, inode=174613358, rec_len=8257, name_len=47
Apr  6 00:42:54 sourceware kernel: EXT3-fs warning (device lvm(58,7)): empty_dir: bad directory (dir #83388) - no `.' or `..'
Apr  6 00:42:54 sourceware kernel: EXT3-fs warning (device lvm(58,7)): ext3_rmdir: empty directory has nlink!=2 (3)
Apr  6 08:59:38 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_add_entry: bad entry in directory #64668: rec_len %% 4 != 0 - offset=0, inode=1667458900, rec_len=13101, name_len=95
Apr  6 08:59:41 sourceware kernel: EXT3-fs warning (device lvm(58,7)): empty_dir: bad directory (dir #64668) - no `.' or `..'
Apr  6 08:59:41 sourceware kernel: EXT3-fs warning (device lvm(58,7)): ext3_rmdir: empty directory has nlink!=2 (4)

This may indicate more serious hardware issues.  I think it would be 
prudent to take the system down to verify the consistency of the fileystems
and take a look at the RAID diagnostics to see whether we have hardware
issues or just a random glitch causing woe.

Obviously ample notice must be given out before taking the system down.

--Angela

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 21:04 system woes likely due to hardware/fs problems Angela Marie Thomas
@ 2004-04-06 21:14 ` Frank Ch. Eigler
  2004-04-06 21:15 ` Ian Lance Taylor
  1 sibling, 0 replies; 13+ messages in thread
From: Frank Ch. Eigler @ 2004-04-06 21:14 UTC (permalink / raw)
  To: angela; +Cc: overseers

[-- Attachment #1: Type: text/plain, Size: 494 bytes --]

Hi -


> Apr  4 22:00:03 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_add_entry: bad entry in directory #107864: rec_len %% 4 != 0 - offset=0, inode=1296707338, rec_len=28261, name_len=117
> Apr  5 23:46:35 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_add_entry: bad entry in directory #14523: rec_len %% 4 != 0 - offset=0, inode=2037591876, rec_len=12147, name_len=47
> [...]

All these appear to relate to cvs-tmp, which thankfully is
small and disposable.


- FChE

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 21:04 system woes likely due to hardware/fs problems Angela Marie Thomas
  2004-04-06 21:14 ` Frank Ch. Eigler
@ 2004-04-06 21:15 ` Ian Lance Taylor
  2004-04-06 21:36   ` Angela Marie Thomas
  1 sibling, 1 reply; 13+ messages in thread
From: Ian Lance Taylor @ 2004-04-06 21:15 UTC (permalink / raw)
  To: angela; +Cc: overseers

Angela Marie Thomas <angela@foam.wonderslug.com> writes:

> Geoff and I were in Australia for 3 weeks so I'm behind in overseers
> mail but did notice the recent htdig discussion and my backups today
> weren't able to complete in a timely fashion either.  I took a look at
> the system and eventually was able to find the following errors:
> 
> Apr  4 22:00:03 sourceware kernel: EXT3-fs error (device lvm(58,7)): ext3_add_entry: bad entry in directory #107864: rec_len %% 4 != 0 - offset=0, inode=1296707338, rec_len=28261, name_len=117

...

> This may indicate more serious hardware issues.  I think it would be 
> prudent to take the system down to verify the consistency of the fileystems
> and take a look at the RAID diagnostics to see whether we have hardware
> issues or just a random glitch causing woe.

I also noticed those errors, but they are all on 58,7 which is
/sourceware/cvs-tmp, which is not very interesting, and is certainly
vulnerable to things like a directory getting rm -rf'ed by two
different processes simultaneously.  But if they could indicate
hardware problems, then more investigation would obviously be
appropriate.

To me it does seem that the system is mainly waiting for disk I/O, and
that the processes waiting are mainly running CVS.  But that is more
of a gut feeling based on top and vmstat, not something I can really
prove.

I looked at some network packets, and on a packet count basis, not a
data size basis, I saw numbers like this:
    SMTP: 32%
    HTTP: 27%
    FTP: 27%
    CVS: 12%

As can be seen, FTP seems to be a pretty big network user, and could
most easily be pushed off to mirror sites.  But I don't know that
network usage is our real problem here.

I note that the bandwidth reports at
    http://sourceware.org/sourceware/bandwidth/
appear to no longer be updated.

Ian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 21:15 ` Ian Lance Taylor
@ 2004-04-06 21:36   ` Angela Marie Thomas
  2004-04-06 22:39     ` Matthew Galgoci
  0 siblings, 1 reply; 13+ messages in thread
From: Angela Marie Thomas @ 2004-04-06 21:36 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: overseers

> I also noticed those errors, but they are all on 58,7 which is
> /sourceware/cvs-tmp, which is not very interesting, and is certainly
> vulnerable to things like a directory getting rm -rf'ed by two
> different processes simultaneously.  But if they could indicate
> hardware problems, then more investigation would obviously be
> appropriate.
> 
> To me it does seem that the system is mainly waiting for disk I/O, and
> that the processes waiting are mainly running CVS.  But that is more
> of a gut feeling based on top and vmstat, not something I can really
> prove.

The last time I noticed significant delays with my backups was a
disk problem.  It's true that cvs-tmp isn't interesting, but I also
wouldn't expect this to be a common problem which is why I thought
it would be worthwhile to make sure the RAID array is OK.  If there's
some way to access the RAID diagnostics or get some kind of hardware
status without taking the machine down, that would of course be ideal.
I think it would be easy to overlook the system running in degraded
mode, for example, with the volume of output in /var/log/messages.

> As can be seen, FTP seems to be a pretty big network user, and could
> most easily be pushed off to mirror sites.  But I don't know that
> network usage is our real problem here.

I'm pretty sure it's not a network issue.  Everything seems to point to
a disk IO issue.

--Angela

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 21:36   ` Angela Marie Thomas
@ 2004-04-06 22:39     ` Matthew Galgoci
  2004-04-06 22:48       ` Jason Molenda
  2004-04-07  2:49       ` Christopher Faylor
  0 siblings, 2 replies; 13+ messages in thread
From: Matthew Galgoci @ 2004-04-06 22:39 UTC (permalink / raw)
  To: overseers; +Cc: fche

> The last time I noticed significant delays with my backups was a
> disk problem.  It's true that cvs-tmp isn't interesting, but I also
> wouldn't expect this to be a common problem which is why I thought
> it would be worthwhile to make sure the RAID array is OK.  If there's
> some way to access the RAID diagnostics or get some kind of hardware
> status without taking the machine down, that would of course be ideal.
> I think it would be easy to overlook the system running in degraded
> mode, for example, with the volume of output in /var/log/messages.
> 
> > As can be seen, FTP seems to be a pretty big network user, and could
> > most easily be pushed off to mirror sites.  But I don't know that
> > network usage is our real problem here.
> 
> I'm pretty sure it's not a network issue.  Everything seems to point to
> a disk IO issue.

The aacraid card will spew failed/degraded messags to syslog.

I think that the issue is workload related. If nobody objects I'm going to install
the psacct process accounting package to gather some statistics on what the work load 
profile looks like over time.

Having observed the system for a bit, it seems like this process could be
causing issues:

/sourceware/www/sourceware/htdocs/cygwin/cgi-bin2/package-grep.cgi

How long has the package grep scipt been around? Seems like a fairly expensive
cgi script by the looks of it.

-- 
Matthew Galgoci
System Administrator and Sr. Manager of Ruminants
Red Hat, Inc
919.754.3700 x44155

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 22:39     ` Matthew Galgoci
@ 2004-04-06 22:48       ` Jason Molenda
  2004-04-06 22:53         ` Jason Molenda
  2004-04-06 23:32         ` Angela Marie Thomas
  2004-04-07  2:49       ` Christopher Faylor
  1 sibling, 2 replies; 13+ messages in thread
From: Jason Molenda @ 2004-04-06 22:48 UTC (permalink / raw)
  To: Matthew Galgoci; +Cc: overseers, fche

For what it's worth, /var/log/monitors has a periodic sampling of
a few different measures.  It's been well over a year since I've
looked at it so maybe it's completely broken, but my guess is that
it's still pretty much working.

The most important part is that it gives you some historical
data.  Whenever we have a system meltdown situation, everyone
logs in and says "Thirty httpds!  Sweet god!" (or whatever,
I don't mean to slam anyone poking at this, or any program), but 
that may be a perfectly normal load.

As for abusers -- I swear, if you look at any of the services on
the system, I'll bet you a nickel there's someone abusing it.
httpd, ftp, cvs, rsync - there's always some ignorant person doing
some while (1) system ("cvs update"); out there on the net over a
fat pipe.  Short of looking over the logs carefully by hand to spot
these folks, there's not much to do about it.

As for the load from the cygwin.com side of the site, the two
biggest resource consumers are definitely cygwin and gcc.  If we
wanted to reduce load, moving both of those projects away would be
a very effective way of accomplishing it -- no other projects even
come close to those two.  But for the most part, the system is
capable of handling this load.  It gets a little dicey on occasion,
especially when gcc releases become imminent and everyone is doing
a million "cvs update"s a day on branches, but the system survives
and generally works.

J

PS-  I don't mean to discourage anyone who is looking at the causes
of the current performance problem -- don't ever doubt the ability
of a small group of committed people can improve sourceware.
Indeed, it is the only thing that ever has.  :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 22:48       ` Jason Molenda
@ 2004-04-06 22:53         ` Jason Molenda
  2004-04-06 23:32         ` Angela Marie Thomas
  1 sibling, 0 replies; 13+ messages in thread
From: Jason Molenda @ 2004-04-06 22:53 UTC (permalink / raw)
  To: Matthew Galgoci; +Cc: overseers, fche

On Tue, Apr 06, 2004 at 03:47:57PM -0700, Jason Molenda wrote:
> For what it's worth, /var/log/monitors has a periodic sampling of
> a few different measures.  It's been well over a year since I've
> looked at it so maybe it's completely broken, but my guess is that
> it's still pretty much working.



BTW I should note that most of the scripts do stuff like give the
ps entries for all running cvs processes.  So it's most useful like


cd /var/log/montiors
cd all-cvs
wc -l */*

and you get a list like

      2 2004-04-05/03:15
      2 2004-04-05/03:30
      8 2004-04-05/03:45
     18 2004-04-05/04:00
     19 2004-04-05/04:15
     19 2004-04-05/04:30
     20 2004-04-05/04:45
      6 2004-04-05/05:00
      8 2004-04-05/05:15
      5 2004-04-05/05:30
      6 2004-04-05/05:45
      8 2004-04-05/06:00
     16 2004-04-05/06:15
     16 2004-04-05/06:30
     15 2004-04-05/06:45
     10 2004-04-05/07:00
     10 2004-04-05/07:15
     13 2004-04-05/07:30
     16 2004-04-05/07:45
     12 2004-04-05/08:00


etc.  This particular one seems to be including cvsupd in its
output, so it's always off by one.  But you get the idea.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 22:48       ` Jason Molenda
  2004-04-06 22:53         ` Jason Molenda
@ 2004-04-06 23:32         ` Angela Marie Thomas
  2004-04-06 23:59           ` Jason Molenda
  1 sibling, 1 reply; 13+ messages in thread
From: Angela Marie Thomas @ 2004-04-06 23:32 UTC (permalink / raw)
  To: Jason Molenda; +Cc: Matthew Galgoci, overseers, fche

> PS-  I don't mean to discourage anyone who is looking at the causes
> of the current performance problem -- don't ever doubt the ability
> of a small group of committed people can improve sourceware.
> Indeed, it is the only thing that ever has.  :)

I wasn't very concerned until I noticed that my backups hadn't
finished.  The last bit that gets backed up is /sourceware/www and
normally takes about an hour.  Today it was running for 4+ hours
and still hadn't finished.  4+ hours is closer to "doing it from
scratch" time.  The last time time this started happening, it was
a bad disk so when I saw FS errors, I was worried.

Looking at /var/log/monitors is defintely a good idea.  Just a
very quick peek makes me think we just got a lot more cvs requests
than usual.  Starting sometime last night and continuing today,
it looks like we're seeing 2X-3X more cvs processes than usual.
That could account for the slowness pretty easily.  Observing the
processes for the last hour or so, they're not hung or anything.
I'll keep a close eye on the backups for the rest of the week and
let folks know if I continue to see abnormally long times.

Also, Geoff told me he got the following error message today:
  % cvs -q update -d
  cannot mkdir /sourceware/cvs-tmp/cvs-serv22568/libstdc++-v3/testsuite/22_locale/num_put/put/char No such file or directory

So it's possible the cvs-tmp inconsistencies will eventually cause
problems for people (or this is a coincidence).

--Angela

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 23:32         ` Angela Marie Thomas
@ 2004-04-06 23:59           ` Jason Molenda
  2004-04-07  0:06             ` Ian Lance Taylor
  2004-04-07  2:55             ` Christopher Faylor
  0 siblings, 2 replies; 13+ messages in thread
From: Jason Molenda @ 2004-04-06 23:59 UTC (permalink / raw)
  To: angela; +Cc: Matthew Galgoci, overseers, fche

On Tue, Apr 06, 2004 at 04:48:40PM -0700, Angela Marie Thomas wrote:

> Looking at /var/log/monitors is defintely a good idea.  Just a
> very quick peek makes me think we just got a lot more cvs requests
> than usual.  Starting sometime last night and continuing today,
> it looks like we're seeing 2X-3X more cvs processes than usual.
> That could account for the slowness pretty easily.  Observing the

My guess is that we're seeing a backlog of cvs processes.  It may
be that the cvs load is no different from last week, but the
processes are taking long to complete this week so they're piling
up.  That could easily have been instigated by a long-running htdig
process.

BTW I untarred the logs from March -- the load on the server really
decreases over the week, so it's not too fair to compare April 1
(Thursday) to April 5th (Monday).  The load from this week is still
way over last week, of course.

Incidentally, if gcc is nearing a release or something like that,
that's going to be a major contributor to the current problems.
An htdig process running too long, plus a gcc release, would be
enough to cause quite a backlog.

J

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 23:59           ` Jason Molenda
@ 2004-04-07  0:06             ` Ian Lance Taylor
  2004-04-08 21:19               ` Gerald Pfeifer
  2004-04-07  2:55             ` Christopher Faylor
  1 sibling, 1 reply; 13+ messages in thread
From: Ian Lance Taylor @ 2004-04-07  0:06 UTC (permalink / raw)
  To: Jason Molenda; +Cc: angela, Matthew Galgoci, overseers, fche

Jason Molenda <jason-swarelist@molenda.com> writes:

> Incidentally, if gcc is nearing a release or something like that,
> that's going to be a major contributor to the current problems.
> An htdig process running too long, plus a gcc release, would be
> enough to cause quite a backlog.

It is perhaps worth noting that the gcc 3.3 snapshot process runs at
16:40 on Monday, from the gcc crontab.  16:40UTC is 12:40EDT and
9:40PDT, so the gcc 3.3 snapshot process start running at a pretty
high point of sourceware activity.  It's fairly time consuming--
yesterday, when the load was very high, I think bzip was running for
over an hour.

The gcc 3.4 snapshot process runs at 16:42 on Wednesday.

It would probably be a good idea to run the gcc snapshot processes at
a different time of day.

Ian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 22:39     ` Matthew Galgoci
  2004-04-06 22:48       ` Jason Molenda
@ 2004-04-07  2:49       ` Christopher Faylor
  1 sibling, 0 replies; 13+ messages in thread
From: Christopher Faylor @ 2004-04-07  2:49 UTC (permalink / raw)
  To: overseers

On Tue, Apr 06, 2004 at 06:39:51PM -0400, Matthew Galgoci wrote:
>Having observed the system for a bit, it seems like this process could be
>causing issues:
>
>/sourceware/www/sourceware/htdocs/cygwin/cgi-bin2/package-grep.cgi
>
>How long has the package grep scipt been around? Seems like a fairly expensive
>cgi script by the looks of it.

This script is heavily used by cygwin users and has been around for some
time.

I'll take a look at optimizing it when I get a chance but I'm not
convinced that it really is the cause of system problems.

cgf

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-06 23:59           ` Jason Molenda
  2004-04-07  0:06             ` Ian Lance Taylor
@ 2004-04-07  2:55             ` Christopher Faylor
  1 sibling, 0 replies; 13+ messages in thread
From: Christopher Faylor @ 2004-04-07  2:55 UTC (permalink / raw)
  To: Jason Molenda; +Cc: angela, Matthew Galgoci, overseers, fche

On Tue, Apr 06, 2004 at 04:59:42PM -0700, Jason Molenda wrote:
>On Tue, Apr 06, 2004 at 04:48:40PM -0700, Angela Marie Thomas wrote:
>
>> Looking at /var/log/monitors is defintely a good idea.  Just a
>> very quick peek makes me think we just got a lot more cvs requests
>> than usual.  Starting sometime last night and continuing today,
>> it looks like we're seeing 2X-3X more cvs processes than usual.
>> That could account for the slowness pretty easily.  Observing the
>
>My guess is that we're seeing a backlog of cvs processes.  It may
>be that the cvs load is no different from last week, but the
>processes are taking long to complete this week so they're piling
>up.  That could easily have been instigated by a long-running htdig
>process.

I think that subversion.gnu.org may be having more problems so it may be
that we're seeing more anoncvs activity as a result.

cgf

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: system woes likely due to hardware/fs problems
  2004-04-07  0:06             ` Ian Lance Taylor
@ 2004-04-08 21:19               ` Gerald Pfeifer
  0 siblings, 0 replies; 13+ messages in thread
From: Gerald Pfeifer @ 2004-04-08 21:19 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Jason Molenda, angela, Matthew Galgoci, overseers, fche

On Wed, 6 Apr 2004, Ian Lance Taylor wrote:
> It would probably be a good idea to run the gcc snapshot processes at
> a different time of day.

I'll try to work on that tomorrow.

Gerald

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-04-08 21:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-06 21:04 system woes likely due to hardware/fs problems Angela Marie Thomas
2004-04-06 21:14 ` Frank Ch. Eigler
2004-04-06 21:15 ` Ian Lance Taylor
2004-04-06 21:36   ` Angela Marie Thomas
2004-04-06 22:39     ` Matthew Galgoci
2004-04-06 22:48       ` Jason Molenda
2004-04-06 22:53         ` Jason Molenda
2004-04-06 23:32         ` Angela Marie Thomas
2004-04-06 23:59           ` Jason Molenda
2004-04-07  0:06             ` Ian Lance Taylor
2004-04-08 21:19               ` Gerald Pfeifer
2004-04-07  2:55             ` Christopher Faylor
2004-04-07  2:49       ` Christopher Faylor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).