找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-5-22 07:55:17 | 显示全部楼层

21 May 2008 22:16:59 UTC

The BOINC mysql replica wrapped up its resync. This morning Bob did some testing to see if we can improve our failure/recovery situation. MySQL allows different levels of log commitments to disk: commit only when the buffer is full, commit at least once a second, or commit on every transaction. We've been sticking with the middle option, as that affords us the most protection without heavy disk I/O - the worst case is that we lose one seconds' worth of data. However, we've proven a couple times now that we do many updates per second (i.e. hundreds) and that's enough to bring the master/replica majorly out of sync if one crashes before being able to commit. So today we tried the last option and expected an increase of disk I/O and sure enough this commit level brought the database to its knees almost instantaneously. We tried this first on the replica and thought it was its software RAID or low number of spindles causing the headache, but applying this to the heftier master had the same effect. So it's back to the drawing board on that front: we don't have the server capacity to commit on every transaction. Maybe there's other screws we can tighten to make this possible. Bob's looking into that. More tests to come, or we'll just put this on the back burner.

Other than that... Got FC9 running on my desktop. So two computers are upgraded now, and I'm getting to understand all the gotchas. Also Jeff and I actually are discussing SERENDIP again. You ever hear of that? That's the project we were working on before SETI@home happened, and it's been in limbo for about 10 years. But as Dan continues to build SERENDIP-like spectrometer boards to help other SETI scientists around the world, these other projects may want to incorporate our data collection/analysis software, so we better dust that off sooner than later. In the process we can maybe throw the old SERENDIP IV data into the same database as SETI@home to buff up our sensitivity even more. That's the hope, anyway.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-5-23 07:45:41 | 显示全部楼层

22 May 2008 22:35:37 UTC

More database poking/prodding today. Tweaking different mysql variables (and even adding "noatime" and "nodiratime" to the mount options of the data partitions) didn't really help all that much in regards to the transaction committing stuff I was whining about yesterday. So be it. Bob and I also found this morning that our science database indexes were in need of rebuilding as well. Every few weeks we need to run an "update statistics" query to keep those indexes in line.

Slowly working my work through the OS upgrade queue. We're getting FC9 installed on one of three recently donated servers (dual 2.80GHz Xeon / 4 GB RAM) so we can finally start getting these (and another equally powerful P4 server with more RAM, also recently donated) thrown into the fold. The use of these is still up for debate, though they all will be perfectly good general backup/redundant/compute servers. We are definitely missing some redundancy on the backend. I mean, we do have server "maul" sitting around which is quite powerful but being a test model donated by Intel it has an engineering motherboard with keyboard/mouse issues, so we don't want to trust it with anything that needs to have 24/7 uptime - instead it's up and running as a test/compute server, i.e. if it goes off line for any period of time we won't be sad.

Anything else? Just some work on more internal data plots for data integrity checking, and the final bits and pieces of that proposal which is due tomorrow.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-5-28 07:32:39 | 显示全部楼层

27 May 2008 21:23:45 UTC

Long holiday weekend (Memorial Day). On the actual day off (yesterday) the BOINC web/download server was misbehaving. In theory I should have been able to connect to the KVM from home but that wasn't working properly (couldn't access via the web due to incompatibilities with newer JRE versions, couldn't access via the standalone client since I ain't got no Windows machines and the client only works on Windows, etc.) so I had to drive up to the lab to kick it in person. No big deal - just a runaway job that clobbered the process queue. Had the usual database backup outage today. Not much news to report.

To answer RHWhelan from my last thread: > ...it seems that most of the data we analyze gets dumped soon after we report. Not sure what you mean by dumped but nothing important is getting thrown out. Your SETI@home client reduces about 350K of raw data into a few signals which get plopped into a result file and uploaded to our server. Once these signals are verified and put into our master database the result file (and its sister row in the database) are deleted to make way for more. The signals themselves never get deleted.

> It also appears that the real staff spends more time transferring, storing and manipulating data and hardware than actually analyzing the results. I don`t mean to be critical, I am actually very devoted to the philosophy of SETI but I must admit it seems a bit futile.It appears that way because it's completely true. And there's nothing wrong with that. To be clear, the "real staff" running the entire show is me, Jeff, Eric, and Bob - all working part time (combined we're about 3 full time employees). Anyway... I understand the feelings of frustration due to perceived futility - science takes time, underfunded/understaffed science takes even more. We're only just now turning the corner on the analysis. Unless final results start appearing, we're still productively collecting/reducing data - not as interesting, but still quite useful. I don't expect everybody to maintain interest until we have some real data products, and then I expect interest to jump.

> Are there ever any "HITS" or even slightly suspicious data streams?There are hits and then there are HITS. We haven't really looked for the HITS yet as we've been unable to until very recently (that part is working now in beta). There are no data "streams" as data don't come to us in streams - the earth rotates so signals that persist over time that are actually originating from outer space will only last a few seconds as our beam passes over it.

When I first started working on SETI in 1997 the group here (just Dan and Jeff at the time) we were wrapping up final analysis on SERENDIP III. Didn't find anything really interesting. Then we started collecting data for SERENDIP IV. We were starting to dig into the final analysis of that data set (about 60GB) when SETI@home came into being and derailed that, though Jeff and I have been plotting to wrap that up sometime soon (once we get the SETI@home final analysis rolling). SERENDIP IV is actually interesting, even with 11 year old data - the analysis is hardly as deep as SETI@home, but much wider: the frequency range is about 35 times bigger than SETI@home. We are also doing Optical SETI, and pulsar searching... The point being is SETI@home isn't all we do, nor is our lab here at Berkeley the only SETI lab on the planet. Nevertheless we do have the biggest, bestest search going by far.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-5-29 08:16:50 | 显示全部楼层

28 May 2008 20:04:41 UTC

People noticed there were short network "hiccups" during the course of the evening, ending this morning. All of it was quite mysterious - no database problems, no workunit storage server problems, and at first no obvious download server problems. Upon further examination I found the DNS configuration was "lopsided" towards one of the two download servers. We have load balancing software on both machines so they were sending equal numbers of workunits, but all initial requests hit only one of the two. This hasn't been a problem before, but apparently this week's outage caused enough strain on apache such that every few hours the load got fairly high and log rotation would take abnormally long (several minutes) and nothing could get through during that time. We are also at our highest active user level in over a year (about 10% higher than a couple months ago), so maybe that added to the apache/server stress level, and what we were seeing were outage "aftershocks." In any case, I fixed the DNS so perhaps this won't be so drastic next week (and hopefully for many weeks to come).

Work on the NTPCkr continues - Jeff uploaded the Hipparcos Catalog to the database, so I added a star count on the science status page for the pixel we are currently observing. Of course, the more stars in a pixel the higher the score. However, there are only about 100,000 catalogued stars and 15,000,000 pixels. So odds are pretty high we are observing zero (known) stars at any given moment.

Oh yeah the idle splitter processes - a couple were shirking their duties. I told them to stop slacking off and get back to work. Not that we needed them but it looks bad to have 'em sitting around doing nothing (in reality they were stuck on some stale trigger files).

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-5-30 08:01:54 | 显示全部楼层

29 May 2008 22:40:14 UTC

I spent the entire day so far (and will certainly continue after writing this missive) doing nothing anybody will ever care about - mostly revolving around php programming for upcoming letter drive (more on that later). My desktop was getting funky X errors so I decided it was due for a reboot, and then it wouldn't come up again. This new Fedora Core 9 distro apparently yum'ed in something which broke the boot loader. An hour or two spent trying to suss that out and ultimately reinstalling the OS and I'm back in business

We did have a software meeting earlier - we're getting back on track with various stagnant analysis/database projects. Also discussed the Google Sky map stuff - they get their images from many different sources, so it's still unclear what epoch the coordinates are in. No simple official statements like, "Google Sky coordinates are entirely in J2000." So we're going to have this cosmetic issue where the image data on the science status page may not exactly line up with our reality (which is J2000). In any case, this is hardly a scientific issue as in doesn't affect our analysis - just what's in that neat little Google window.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-3 08:39:41 | 显示全部楼层

2 Jun 2008 18:58:32 UTC

Early Sunday morning I discovered the assimilators were all failing. Immediate analysis uncovered zero smoking guns. All the assimilators were choking on the same subset of results, and all while inserting pulses. Plus the actual processes were seg-faulting before they could produce any useful error codes. Checking the failing result files and database entries showed nothing obvious (all different sizes, submitted at different times, created by different clients, etc.). I did all I could do. I told the other guys (Bob, Jeff, Eric) - Bob's checking the database now for any subtle weird behaviour (once again I found no obvious problems yesterday) and Jeff's recompiling the assimilator code (perhaps a version that outputs useful error information). In the meantime, the assimilation cue grows, and our disk usage grows with it (as we haven't deleted anything in over a day) - sooner than later I'll have to stop the splitters to prevent storage disasters. I'll update this thread if we figure out what's up on that front.

The only other real gripe right now is that our data recorder system at Arecibo is only seeing one of two data drives. Not a tragedy - we can still record data but this will put additional strain on the operators down there until we figure out why.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-4 08:15:57 | 显示全部楼层

3 Jun 2008 21:46:01 UTC

Good news. The science database problems were far less severe than we thought. Short story: we ran out of space. Long story: due to a slightly confusing configuration we thought we ran out of extents for reasons unclear. Informix categorizes all usable storage space into dbspaces, fragments, chunks, extents... maybe more things I'm not sure. We've had problems in the past where we ran out of extents long before running out of actual disk space and we thought this is what happened again. The solution for such is painful - basically like rebuilding a RAID system (unload everything, recreate, and reload). Luckily we discovered we had some fragments/chunks misaligned (some fragments had more chunks than others) so all we had to do was add more chunks, and we had plenty of disk space for that. We added enough to get by for now, and will do more when we catch up from the queue draining/filling.

We had our usual outage today (for BOINC database backup/compression, etc.). Between the usual recovery for that and the recovery for all the above it may be a bumpy ride for the next 24 hours or so.

Yesterday afternoon server "bane" (one of the two download servers) was having mounting issues which required a reboot to clean up. I was home at the time and rebooted it remotely. Of course, like my desktop last week, a new kernel was yum'ed in during the recent past and messed up grub for some reason, so it wouldn't load the OS. I had to get Jeff, who was still at the lab, to deal with booting from the emergency DVD and boot from an older kernel. While bane was down half the downloads connections were failing, but usually retries were successful as we have the two redundant servers.

Today I got server anakin more officially racked up (actually just sitting in a rack directly on top of a UPS) to ultimately become the new scheduler. It's a recently donated Dual Xeon (used) that is actually less powerful than our current scheduler, ptolemy, but should be able to handle the job just fine. We plan on making ptolemy, with its 16 mostly unused drive bays, a network storage server to replace our ageing Network Appliance server, which fell out of service long ago and its many drives are dying with regularity - infrequent but still worrisome.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-5 07:56:17 | 显示全部楼层

4 Jun 2008 20:06:25 UTC

Things are continuing to clear up nicely since the science database kerfuffle earlier this week. The assimilator queue is still large, but now that everything is more or less "caught up" it's draining at a pretty good clip.

Nobody probably noticed but for a while there this morning (actually still as I type this sentence) we had two scheduling servers - ptolemy and anakin. I finally got anakin up and configured and made it a secondary scheduler to test it out. Once we're ready to convert ptolemy into something else, we now have another scheduling server in our back pocket.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-6 08:11:23 | 显示全部楼层

5 Jun 2008 21:24:59 UTC

Another mild day in server land. Lots of minor apache issues. There was an annoying web scrape yesterday afternoon that gummed up the works for a moment. This morning I found a bug in the web log rotation script that prevented our public web server from restarting - so it's been running for weeks non-stop during which the httpd processes bloated in size (apparently there are small/tolerable memory leaks in php/apache/boinc code somewhere). Then later our scheduling server was suddenly unable to run the scheduler cgi. We were dropping connections so I got alerts right away about this. I had to stop/restart apache twice, though, to get it working again. Not sure why the first restart didn't take.

Jeff's adding more star catalog data to our database. Bob worked on another alert script to better check our current database storage allocations (and prevent another minor mishap like earlier this week). Eric and I swapped drives between his hydrogen server "ewen" and ptolemy (for when the latter becomes a storage server) - ewen freaked out a little bit unexpectedly - we umounted the filesystems before pulling the drives, but an xfs daemon woke up and thought that particular partition should still be around, etc. No big deal - just a lot of alert e-mails that were scary at first.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-10 07:27:59 | 显示全部楼层

9 Jun 2008 20:52:35 UTC

Over the weekend the scheduler ceased operations on its own again. I was able to remotely fix this Saturday morning and recovery was swift. This was the same problem as earlier in the week but this time we had a smoking gun: the CGI output log file was maxed out at 2GB in size (this is running on a 32 bit system). Cleaning out the logs solved the problem. The thing is: We've been letting these logs grown to 2GB in size for months without any issue. So why is this a problem all of a sudden? However strange, I put a log rotation script in place to prevent this from happening again any time soon. Funny side note: I would have gotten the alerts faster but coincidentally the lab-wide mail servers conked out as well Saturday morning. Other than that, nothing much to report the past couple of days.

Which brings us to today. Around 12:30 our server closet air conditioning unit died. Within 30 minutes all the servers warmed up over 5 degrees Celsius and I started getting alerts. This may be a significant problem (i.e. we may need more than just a coolant refill). So depending on how fast we can get the maintenance people up here I might have to shut down parts or all of the project to prevent server burnout. Meanwhile, I have the server closet doors open to help cool things down, much to the annoyance of all the projects on this floor (the fan noise is about 20-30 decibels louder with the doors open). The poor people across the hall from the closet are being defeaned - my desk is a few doors down.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-11 21:18:47 | 显示全部楼层

10 Jun 2008 22:20:19 UTC

Normal Tuesday outage. Didn't really do anything special this time around. I did mess around with server "anakin" a bit (the presumptive replacement scheduling server) - for starters it keeps booting up in X (though the inittab says not to) and one of its drives got marked as "defunct" (the hardware RAID is rather confusing - I can't figure out how to "unfail" the drive). Both really minor issues. At least there was zero fallout from the air conditioner failure yesterday. Other than that I'm mostly working on mundane sys admin chores and catching up on some back-end diagnostic/analysis stuff.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-13 07:29:32 | 显示全部楼层

11 Jun 2008 21:25:25 UTC

Some general BOINC code got updated on our servers this morning, which broke a couple things (some pages went blank, and the php "magic quotes" got messed up causing all kinds of backslashes to appear everywhere). I whined to Dave and he fixed it, which is usually how these particular problems sort themselves out. The problem with the web code is that it is being completely or partially used by all kinds of BOINC projects, so a "fix" for one project may end up unexpectedly being a "bug" for another, which is why this kind of thing happens from time to time. We try to keep SETI@home as up to date with the BOINC source tree as possible, even if that means we're on the "bleeding edge." Of course this is all web code, so problems like these are cosmetic and relatively minor in the grand scheme of things. We do more thorough alpha/beta testing of the important back-end functions - you know, the ones that update millions of database records every day.

Other than that today has seen more OS installs/RAID manipulations on various donated servers that have been anxiously waiting their call to duty (I got beyond the issues I was having yesterday). Slowly but surely we'll get these up and running. I also got a bunch of data drives from Arecibo - it's been a while we got a batch of fresh data up here, so I'm now lost in data pipeline management mode.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-18 18:41:47 | 显示全部楼层

17 Jun 2008 20:44:23 UTC

Ho hum weekend, which is good. The air conditioning people came up yesterday (Monday) and today to do follow-up inspection of our server closet system (which failed last week) and found a couple more leaks which have been repaired. We seem to really be pushing it beyond its limits. Had the usual database outage today. No big whoop there.

Somebody noted earlier that their results were getting validated surprisingly quickly. We didn't change anything. This may have been due to a longer-than-usual period this past weekend of fast workunits - the average turnaround time was roughly 10 hours (about 20%) shorter than normal, meaning pairs were getting matched up that much faster.

A lot of what's been going on the past couple of days has been post-vacation catchup (half the staff was out of town). While I have a zillion other things to do I discovered a couple ways to optimize the NTPCkr so I coded that up and I'm testing it now. Every little speedup on this front helps. Jeff's still working on the scoring part. We're getting there...

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-19 09:19:07 | 显示全部楼层

18 Jun 2008 23:16:03 UTC

The assimilator queue grew again. The main culprit this time was the NTPCkr - from here on out I'll simply refer to it as the nitpicker - as a reminder this is the program that is pretty much the culmination of all our SETI@home data collection and analysis, i.e. it's the thing that'll find the aliens if they exist. All other analyses so far using SETI@home data were cursory by comparison.

Anyway.. we're finding every so often that we have "deep" pixels containing tens of thousands of multiplets, each containing thousands of signals. When my "science status page updater" hits one of these it hangs on for quite a long time, causing a heavy CPU load on the database server as it tries to wade through this flood of signals gathering statistics. My optimizations (mentioned earlier in the week) helped, but not enough. We may devise/implement more. In any case, the heavy nitpicker load made the assimilators slow down. We killed those particular processes and I think we're catching up again. Slowly.

So the donation processing suite had been choked for a couple weeks and nobody noticed. This was caused by a suddenly (and silently) more stringent firewall, and masked by several things. We've been getting the donations, just no confirmations. So there's quite a few missing green stars I imagine. Not exactly sure what to do about that just yet.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-20 08:55:04 | 显示全部楼层

19 Jun 2008 19:41:22 UTC

We're still maintaining an assimilator queue, but it is indeed draining over time. Besides the nitpicker CPU consumption issues addressed yesterday, we're also doing several data transfers down to HPSS (our off-site storage) including a large science database backup, as well as several raw data files (we keep copies of all raw data down there). All these things - the backups, the raw data storage, the nitpicker, and the assimilation of new results - run on thumper (because that's where all the data are). So there's basic I/O contention at the moment.

Other than that I have nothing to report - I've been mostly occupied by bureaucratic/policy tasks for the past while. I was also annoyed to find somebody threw away my plastic fork, which I admit has been sitting used and unwashed on my desk for days, but nevertheless I came to work expecting to eat my lunch with it. The lab kitchen is oddly devoid of utensils. I did find a pile of aged wooden coffee stirrers, out of which I fashioned a pair of makeshift chopsticks.

There's a halo around the sun at the moment. Cool.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-11 01:33

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表