|
楼主 |
发表于 2009-5-8 21:30:36
|
显示全部楼层
7 May 2009 22:03:43 UTC
I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?
Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.
Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.
Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.
- Matt |
|