找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-6-24 08:52:52 | 显示全部楼层

23 Jun 2008 22:22:22 UTC

Another weekend without much ado. Our assimilator queue is low but not exactly pegged at zero. What's causing it to not run as fast as all the other backend processes? Not entirely sure, but we know of several things that happen from time to time which may be the problem (i.e. cause extra load on the science database), or at least aggravate the problem. But for now, it's not even close to a tragedy, so we're just keeping our eye on it.

I guess we did have a disk failure on thumper (the master science database server), or at least disk complaint. It didn't cause any downtime or data loss, but it's getting us to reconsider our current stance on software vs. hardware RAID. We've been sticking with software RAID due to ease of use and quickness of warning, but we're finding it sometimes doesn't behave the exact way we expect, or sometimes not the best way. So this event inspired some additional R&D on that front

I just rebooted the main web server, so that was offline for a couple minutes. No big deal - just some mounting issues that needed to be cleared out.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-25 08:26:58 | 显示全部楼层

24 Jun 2008 21:50:01 UTC

Had the usual outage today. No news there, and we're recovering normally at the moment.

Continuing along the hardware vs. software RAID theme, we have vast experience getting bitten by both - in the early days of SETI@home we got burned by hardware RAID, hence our current general affinity towards software. However, today Jeff and I got over the (very small) hump of learning how to query the recently donated IBM Xseries on-board RAID from within linux and decided that we're going to learn to enjoy living with a zillion different kinds of RAID, each employed based on current needs and resources.

Tomorrow we're going to attempt converting our scheduler to the new-used system "anakin" so we can then convert the current scheduler (ptolemy) into a NAS box (to ultimately replace the NAS taking up one third of our server closet). Expect funky DNS rollout issues.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-26 08:09:36 | 显示全部楼层

25 Jun 2008 22:23:54 UTC

This morning we turned off the scheduling server on ptolemy and started it up on anakin. This basically worked right out of the box. Pretty quickly we determined the lower traffic rates were due to DNS rollout. Despite having the TTL (time to live) on the download name (boinc2.ssl.berkeley.edu) set to 5 minutes, it sometimes takes weeks to fully convince the world the change has been made. This is due to various types of DNS caching I still don't fully understand (why don't they all obey the TTL?). Stopping/restarting the BOINC client sometimes resolves this.

However, after an hour or so I decided to play nice and turn ptolemy back on, set in a way using apache to forward all lagging scheduling requests over to anakin with a "permanently moved" warning. I guess I should have done this from the get-go, but better late than never. Immediately this seemed to help, but only the uploads. Download traffic still remained under some rather low ceiling.

So I checked the two redundant download servers (bane and vader). Turns out bane wasn't serving any download requests. Was it even getting any? That part is a total mystery - nothing changed in any configurations pertaining to these servers. I double checked the DNS updates. No smoking guns there, either. Well, bane had weird dns/mounting/apache problems before that a quick reboot cleared up, so after rebooting it seemed to be "better" but not by much. Instead of 0 requests per second before reboot, it started serving 2 or 3 - vader is serving around 10. What's the deal, then? Perhaps this has to do with our "pound" load balancing utility recognizing bane was having trouble (strangely coincident but unrelated to the anakin switch) and has been favorite vader until bane got better. I filed this under "unrelated and currently harmless problem."

Anyway.. I then noticed (in between doing other tasks, hence the lag) the upload traffic was increasing way beyond expectations. I assumed everything was okay as all the apache logs were reporting no errors, but indeed the requests forwarded from ptolemy to anakin were failing. Why? Because the http headers were missing variables, including the all-imporant "Conent-Length." Why?!! This I have no idea, but apparently between apache (and/or the boinc client) redirected traffic results in different and less informative http headers. And so the schedulers on anakin were saying, "I don't know what you want - try again in 10 seconds." This got worse and worse as more clients wrapped up their currently workunits and tried to connect.

The solution to all that was to *not* do apache redirects (both 301 and 302 redirects had the same effect) but to use good ol' pound to simple shovel ptolemy's packets towards anakin. This helped all our DNS-lagging clients to finally connect again, but won't help to inform them that the scheduling server has indeed changed. Hopefully the clients will learn on their own in the coming days. We plan to turn off ptolemy outright early next week.

Nitpicker progress has been slowed by database programming issues. Informix has undocumented limits on user-defined lists in certain contexts. We may have to work around all that using something other than lists. Jeff's been banging on this and other similar programming hurdles for a while, hence the lack of recent info. Plus we have yet to sit down and discuss candidate scoring algorithms which will only happen if we can manage to get the four parties involved (Dan, Eric, Jeff, and me) in the same room at the same time without greater problems hanging over our heads. This hasn't happened in, well, months. At least glacial speeds are non-zero speeds.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-6-27 08:14:19 | 显示全部楼层

26 Jun 2008 21:07:44 UTC

The new scheduler continues to be handling its new duties just fine. Slowly but surely people are moving their connections over to this new server, but I'm not convinced the change rate is fast enough to do a whole sale cutover by next week. We shall see.

Funny aside: while getting new-ish donated server "clarke" up yesterday I was annoyed to find that Fedora Core 9 was booting to run level 5 (where it loads the X windowing environment). We don't need X on these servers, so we typically set our servers to boot to run level 3 via a change in /etc/inittab. In doing so, I'd comment out the old line with a "#" and enter in a new line with the adjusted run level. It was still booting up in X. Why? Turns out the latest inittab parser (new with FC9, I guess) ignores "#" comments in inittab, and just looks for lines containing the string "initdefault" and parses the first one it finds. Since I left the old line in there commented out (or so I thought) it was superseding the line I wanted. So much for standards (and clear documentation stating when/how standards change).

Nitpicker weirdness: While finally getting around to testing the few optimizations I made to Jeff's code I found that multiple runs of the nitpicker on the same pixel were producing slightly different results each time. We believe this is due to the order which the database pulls out rows - unless requested otherwise databases generally pull things out in random order, i.e. the order which requires the least I/O at that exact point in time (mostly due to page caching or where the many drive arms are currently located in our RAID set). Sorting query output adds significant (and usually unnecessary) overhead. But there are a lot of "fuzzy compares" in the nitpicker (due to floating point computations on different chips you can't expect decimal values to be "exactly exact"). When two items are close enough to be called "duplicates" you only need one, but which one you pick may cause different results down the road. So Jeff is elbow deep in this problem right now.

Apropos of nothing, the entire northern half of state of California is on fire. The smoke ending up here in the Bay Area is intense. I feel like I'm smoking a couple packs a day just walking around outside. I can smell it sitting here at my desk.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-1 14:16:13 | 显示全部楼层

30 Jun 2008 21:58:57 UTC

A rather static weekend which is always welcome. This morning found that, despite DNS changes made several days ago many clients are still connecting to the old scheduling server. I find this particularly frustrating as there is no legitimate reason for anything to be caching bogus domain information for more than 5 days, especially if said domain had a 5 minute time to live. We need to get to work on this server, so I opened up a currently unused port on one of our non-public servers and gave it the old scheduler IP address to forward along to the new address, thereby acting as a "detour" so we can get to work. Hopefully over time clients will get wind of the correct IP address so we can turn off this detour as well.

Eric's back in town. Overheard him and Jeff talking a bit about current nitpicker/database programming woes. Seems like an effective new strategy is being enacted. Other than that, no real new to report and nothing but chores and meetings all day today for me, pretty much.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-2 08:46:03 | 显示全部楼层

1 Jul 2008 22:09:19 UTC

Today's Tuesday, which means we went through the usual database cleanup/backup outage. That went smoothly. As I may have already noted before, the replica mysql server has been regularly failing when actually writing the dump to disk. Our suspicion was that this server was having difficulty reaching the NAS via NFS - and mysql has been ultra-sensitive to any NFS issues. The master server doesn't have this problem, but maybe that's because it's attached to the NAS via a single switch (as opposed to the replica, which is going through at least three switches). Anyway.. we dumped the replica database locally and it worked fine. Our theory was strengthened, though not 100% confirmed.

While the project was down we plucked out and old (and pretty much unused) serial console server from the closet. That saves us an IP address (we get charged per IP address per month as part of university overhead - which is another reason I try to keep our server pool lean and trim). I also cleaned up our current Hurricane Electric network IP address inventory and realized and cleaned up some old, dead entries in the DNS maps. Not sure if this is what has been causing lingering scheduler-connection problems. We shall see.

Noted in the previous tech news thread, the science status page has been continually showing Alfa (the receiver from which we currently collect data) as "not running" for a while now. This was lost in the noise as Alfa actually hasn't been running much recently, but is still should have been shown as "running" every so often as data trickles in here and there. Looking back at the logs there has been a problem for some time now. We get the telescope specific data (pointing information, what receivers are on, etc.) every few seconds as they are broadcast to all the projects around the observatory. Perhaps the timing/format of these broadcasts have changed? In any case, I'm finding our script that reads these broadcasts is occasionally missing information, so I made it more insistent. We'll see if that helps.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-3 12:00:43 | 显示全部楼层

2 Jul 2008 22:29:10 UTC

Working on ptolemy's conversion into a NAS box today, with the focus on putting bigger drives in it and testing out its onboard RAID controllers. We're finding the hardware RAID to be a bit outdated and not exactly everything we want. For example, it has a 2TB logical drive size limit, and we can't create logical drives using more than half the physical drives (they are split over two separate controllers). I guess we can deal.

Some user web/user interfaces got broke over the past 24 hours. First, the credit certificates. Incomplete updates were made which were confusing. Dave cleaned that up. Second, the "special user" tags got reset by accident - this also got cleaned up but in the process we temporarily gave some users extra powers (the mysql table dumps were comma delimited so forum signatures containing commas offset the values, blah blah blah).

Regarding the "ALFA running" bit on the science status page - I think I fixed this, but we haven't collected ALFA data since, and won't for a while, so I don't have truly positive confirmation yet. No a big crisis either way, though I hope we get more ALFA time soon.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-4 09:52:21 | 显示全部楼层

3 Jul 2008 21:11:53 UTC

Crazy day getting ready for the long July 4th weekend. There was more testing on ptolemy with more depressing results (why isn't it picking up the hot spare when I pulled a drive out from an active array?!). I actually yanked the whole server out of the closet (which required me temporarily shutting down one of the download servers which was physically in the way - but nobody seemed to notice much). We opened it up and found the RAID is indeed on cards and not the motherboard, which is good as this means if we can't get this to ultimately work we can get some 3ware cards (or some such) instead.

Meanwhile, with ptolemy pretty much gone we've been having mounting problems with servers still requesting its disks. No matter how hard you try there's always some dependencies that hide until too late. So it's been a morning full of killing automounter processes, cleaning up stale mounts, deleting bogus trigger files, restarting services, etc. This was mostly hidden from the public - except for several status pages being out of whack. Actually the assimilators all froze but this was hidden behind the stale server status page. Now the queue is pretty large, but it should drain out just fine.

Eric and Jeff are still getting to the bottom of the database/esql interface woes, doing some extreme programming over by Jeff's desk. Converting lists with cryptic, undocumented size limits to blobs. One of the last major hurdles for the first rev of the nitpicker. Then it's doing all the scoring algorithms, which we'll discuss next week.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-8 08:37:54 | 显示全部楼层

7 Jul 2008 22:23:11 UTC

Rather dull holiday weekend except for the fact I was up in Oregon and remotely dealing with several server issues hidden from the public - nothing really newsworthy. Various previously mentioned projects are continuing along: I'm installing an OS on ptolemy in the hopes we can flash upgrade the current RAID cards' software and see if that helps, otherwise we're buying new cards that we *know* work. I might do a bit of physical server shuffling during the weekly outage tomorrow - get some of the newer stuff into the closet - maybe.

Looks like the big "scoring meeting" is also tomorrow where we will try to settle on our candidate scoring algorithms. Basically we need to pool together our scoring techniques from previous reobservation runs and apply it to the nitpicker which, unlike all prior data analysis, runs and updates in real time as signals flow in. It was easier before, at least in the candidate analysis I've done. You'd turn the crank, look at the results, adjust some variables and turn the crank again. Not so easy to be as casual and change algorithms when the crank is turning 24/7 and a million signals are added every day.

Oh yeah - back to the "ALFA running" problem on the science status page. Turns out we need to recompile our program that peeks at the observatory status broadcasts for our own status pages. This hasn't been recompiled in ages, and much has changed in the meantime. An added compilation is that this running on a Solaris machine down in Puerto Rico making recompiling old, stale code a challenge. Jeff is tackling that.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-9 07:56:45 | 显示全部楼层

8 Jul 2008 23:19:59 UTC

Weekly outage day (to compress/backup BOINC database). It lasted a little longer than usual due to some confusion - unbeknownst to me a recent web code update was made that broke the "stop_web" mechanism which keeps the database quiescent during the outage. It's also taking a long time to recover. Not sure why but we'll see if the clog pushes through. I took advantage of the outage to move server anakin into the closet. We also upgraded the RAID card BIOS to see if that fixes our minor issues with ptolemy's current hardware RAID setup. Well, it's logical volume initialization is still way too slow, but maybe we'll live with that if all future resync's are fast.

Just wrapped up the scoring meeting I mentioned yesterday. The bottom line being our current scoring algorithms for individual signals (spike, guassians, pulses, triplets) are sound, the multiplet scores (interesting groups of signals of a single type) are 99.9% sound, and metacandidate scores (of single sky pixels containing "candidates" like indiviual signals, multiplets, or stuff observed from previous SETI project, as well as interesting celestial objects) are still way up for debate as this is where individual philosophies differ, but we'll probably just go with the easiest solution (multiply all the candidate probabilities together) and see what that list looks like. Jeff will write all this up. Maybe we'll even have a science newsletter.

Jeez... still having a hard time recovering...

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-15 09:20:09 | 显示全部楼层

14 Jul 2008 23:07:47 UTC

So the second half of last week was spent trying to figure out why our database server was so painfully slow. Bob, Jeff, Eric, and I were scratching our heads, trying this and that to diagnose and fix this mysterious problem. Everything was fine before the Tuesday outage, nothing changed during the outage, but upon restarting the project we couldn't handle very much load.

We were quick to blame mysql, as it has had random episodes in the past of secretive bookkeeping causing us grief. We ruled this out. We started blaming the "credited job" table which is growing infinitely. This is the table keeping track of which user did which workunit. We do nothing but insert into this table (no random access selects), so why would that be a problem? Nevertheless we turned off inserts (back to writing similar info to flat files for later parsing) to no avail.

Maybe it was hardware? Did a disk fail? Is a disk about to fail? We ruled all that out as well, which brought the focus back on mysql with dozens of server tuneables that we tweaked for various reasons over the years. Did we go too far with some of those variables? We convinced ourselves that wasn't it.

Of course on hindsight the ultimate solution seems obvious: the filesystem where all the data is kept. Just because the hardware seems okay, and I/O rates are normal, doesn't mean the filesystem is happy. And the focus was back on "credited job" as this table is constantly growing and therefore a big ol' file - much bigger than anything else. A file that is constantly growing during all other inserts and updates that happen as the project is running will likely become interleaved and fragmented to the nth degree. Without fearing data loss we dropped the credited job indexes and that alone broke the dam. Well, jeez.

We're still catching up from the backlog, but mysql is performing incredibly well at this point. This is good, as we're hoping to release Astropulse before the end of the week. More on that later.

Happy Bastille Day, by the way.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-16 08:12:39 | 显示全部楼层

15 Jul 2008 22:42:09 UTC

Had the typical weekly outage today - the results of which were much happier than last week. We were also hoping to fsck the mysql data drive that gave us grief last week to make sure it's okay, but the outage was taking too long so we'll do that later. We did fire off our weekly science database backup which quickly failed due to finding a corrupt page or two. This happens from time to time - and turns out this particular corruption is within a index that we can easily drop and recreate if the usual data-cleanup utility doesn't work. Also science database replication broke at some recent point, probably due to the primary database catching up on backlogged inserts caused some kind of handshake timeout. No big deal - replication is catching up now.

The campus network graphs are all out, which is how we confirm what our current bandwidth usage is. I hope this will get fixed soon. I feel like a doctor without a stethoscope.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-22 09:25:58 | 显示全部楼层

21 Jul 2008 18:49:42 UTC

I was out of the lab since last Wednesday hence the dearth of tech news reports. Though not all that much to report. We had a couple of the usual/typical blips that required minor maintenance, most notably the db_purge process (the thing that keeps the result/workunit tables trim by actually deleting database rows from the BOINC database once the scientific data has been inserted into the science database) - this process hung for some unknown reason and the BOINC db grew great in size. A simple restart fixed that.

As for that index corruption in the science database I mentioned last week, that index was rebuilt just fine, but only after we took one drive in the particular RAID holding these indexes off line - smartd was reporting a lot of errors so we think that drive was the culprit of the corruption. We'll try to replace it sooner or later (the system is now down to only 47 out of 48 500GB drives).

I haven't fully caught up yet from being gone but I imagine there will be some AstroPulse ramping up to report sooner or later. I see scheduler updates have been made (and I think put into beta). I'll meet with Jeff/Eric later and discuss.

Looks like there will be a campus network outage that affects us this upcoming Wednesday morning - it will last about a half hour, starting at 6:30am (Pacific Time). A couple router upgrades from what I can tell.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-23 07:46:49 | 显示全部楼层

22 Jul 2008 21:16:02 UTC

Yesterday afternoon we installed in a new scheduler which included some updates necessary for the upcoming Astropulse rollout. However, our network performance took an immediate hit. After about 10 minutes trying to figure out what was causing this Jeff and I realized our scheduler switch perfectly coincided with several expensive credit-analysis queries Eric was running, also in regards to the Astropulse rollout. So it wasn't the scheduler - just the database getting overloaded. That got cleared up quickly.

Last night I noticed people complaining about Mac computers being denied work. This is still an issue, probably with the new scheduler implementation, and we'll address it shortly.

We had the regular weekly outage today during which I tackled some extra things. First off, due to continuing mysql database performance issues we completely dropped the credited_job table (before we just dropped the indexes). Reminder: this is the table that connects user ids in the mysql database to result ids in the science database, so we know who did what. This is also the only table in the mysql database that grows without bounds, and therefore has been the cause of much headaches as of late. Don't worry - we have all this data archives in three formats in three different locations, and will continue to collect this data in flat file format. I also checked the integrity of the database filesystem now that it was cleaner. No problems there. I started up the projects and mysql is currently handling well over 2000 queries/sec without breaking a sweat.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-7-25 20:04:42 | 显示全部楼层

24 Jul 2008 21:35:24 UTC

Astropulse release progress has been slowed by various things. Some necessary updates were made to the generic BOINC scheduler which we then employed on Monday. After that we found several weird problems including computers being refused work because their hardware was wrongly deemed inappropriate. At first this seemed like a "Mac only" problem but as far as I could tell some Macs were still able to get work. In any case, we ultimately fell back to the "old" scheduler this morning. This improved things according to some rough, immediate analysis. It is still unclear the complete set of scheduler problems, their causes, and their solutions. We'll chip away at that as Dave works his way through a large e-mail backlog.

Yesterday Dave, Jeff, and I had a "work stoppage" and went for a hardcore hike in the Desolation Wilderness (near Lake Tahoe) - something we've been talking about doing for way too long, as we are all avid hikers. We were joined by my wife and Daniel, a visiting BOINC developer from Spain. Since this is technical news, the technical details are thus: We took the Twin Bridges trailhead (at 6200') up to and beyond Horsetail Falls. This included some surprisingly dangerous boulder scrambling which sapped more energy than originally expected. Our plan to bag Ralston Peak (9200') was reduced to basic exploration up to (and ultimately into) Lake of the Woods (over 8000'). The boulder scrambling downward was even worse, but all knees/ankles survived intact. All told, about 7-8 miles of hiking/scrambling, almost 2000 vertical feet gained and lost, taking about 8 hours including lengthy breaks. I felt poorly acclimated, even though I easily conquered a similar hike in Yosemite (up to the top of Nevada Falls and back) six days earlier. Dave was acclimated but started the hike a bit exhausted as he did about 800 feet of rock climbing in upper Yosemite the previous day.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-11 06:35

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表