找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-10-29 08:46:41 | 显示全部楼层

28 Oct 2008 22:59:27 UTC

Today's outage took a little longer than usual. This had mostly to do with the replica mysql database needing to be reloaded from scratch (since it fell behind over the weekend and would take days to catch up otherwise). Plus there was some more index manipulation, en route to a (slightly) more streamlined mysql database. I also replaced the drive that failed on bambi a week ago. So you can stop worrying about that.

Jeff and I spent way too much time fighting with our current raw data pipeline. We get SATA drives up from Arecibo full of data. What happens to this data is a matter of priorities. Do we need to send empty drives back down to Arecibo as soon as possible? Is the splitter data queue low? Is the raw data storage full? Etc. etc. So at any given time we're been either (a) sending data to our offsite archival storage or (b) moving data over to the raw data storage, or (c) both of the above. We're not here 24/7 so to ensure continual data flow we have external SATA drive enclosure on a couple systems.

However, due to various annoying mechanical/form factor reasons, very few of our systems can host these enclosures. Also the drives should be swappable (otherwise what's the point?) but we're finding that very frequently a drive is pulled, another is put it to be read, and the OS can't see the new drive until we reboot the system. This has been a problem with the enclosure directly connected to a SATA card, or via a SATA to USB converter. We're trying to automate this whole process, but with the drives/enclosures constantly disappearing for no good reason we're up against a wall on this.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-10-30 11:05:44 | 显示全部楼层

29 Oct 2008 22:55:54 UTC

Well we haven't really gotten completely around the general problems with our raw data drives being unreadable via our tangled web of SATA enclosures and USB converters, etc. However I did find one thing this morning which helped. Turns out one enclosure just simply stopped working. Long story short, upon very careful inspection I found one of the drive bays had a tiny tiny piece of pink fluff wedged in the SATA power plug. The fluff was from our shipping containers to/from Arecibo. Bits of it get torn off from regular use, and it looks like some got stuck on a drive, which then got wedged into the power plug upon insertion into the enclosure. I dug it out, replaced the drives, and they were visible again. At least for now. I do appreciate the "modprobe" suggestion in the last thread, which may help other similar issues.

Jeff and I were discussing a lot of stuff today, focused mainly on future planning and needs, i.e. what are our current bottlenecks, how do we fix them, and then what will our new bottlenecks be? We're resurrecting conversation with campus, possibly to have them research the current cost/feasibility of increasing our bandwidth. We're also internally discussing needs regarding a potential move towards less redundancy - which will pretty much double our load if we decide to keep up with demand, and can keep up. As well we were scratching our heads about these semi-regular bandwidth spikes that max out our current bandwidth and wreak general havoc for an hour or so at a time.

As far as the last thing I found an important clue today. The assimilator code has a memory leak - it's had the leak for years now, but it's usually not a problem. It eventually reaches a limit, fails, then restarts within a few minutes. Today I found the assimilators have been dying quite often recently, and their failures are perfectly in tandem with upward bumps we see in upload traffic. No surprise, as the assimilators and uploads happen on the same machine (bruno) - so if bloated, resource-consuming assimilators suddenly disappear from the process queue, more resources are suddenly given to uploads.

The story goes on from there, but I have to get back to work and will leave the conclusion until tomorrow. You see, I put in a "assimilator killer" cronjob today in every two hours to restart the assimilators regularly and prevent them from bloating too much. I think observing the effects of that over the next 24 hours will inform what I think about other network problems we've been having...

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-10-31 19:33:22 | 显示全部楼层

30 Oct 2008 23:00:44 UTC

Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it.

Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses.

It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work.

The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results.

Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters.

Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet.

Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-4 13:10:05 | 显示全部楼层

3 Nov 2008 23:46:19 UTC

Yeesh - another rocky weekend, but nothing out the ordinary. One download server got a headache, the schedule process felt sick for a while, the workunit storage filled up again thus blocking the splitters... At least we don't have those Astropulse download spikes anymore, but we're still at a loss to exactly explain why bruno is so overloaded - and therefore why the queues can't seem to drain as fast as they used to. Anecdotal evidence shows the mysql database may seem fine on the surface but is about to collapse any second, and all those extra milliseconds it takes to respond is causing bruno's processes to get all gummed up. In any case I put some effort into moving as many of these processes elsewhere. I also asked Dave for a BOINC feature request - a file_deleter command line option where you can state "only delete results" or "only delete workunits" so you can have file_deleters running on more appropriate systems.

It's raining here in the Bay Area - and this wet weather is very much welcome given a ridiculously long summer of drought and fire, but rain also means our air conditioner isn't working as efficiently. So we got the server closet temperature to worry about on top of everything else.

- Matt

呃。。下雨也是值得高兴的。。空调负荷不用这么大了。。(问题还没解决啊?
回复

使用道具 举报

 楼主| 发表于 2008-11-5 08:07:06 | 显示全部楼层

4 Nov 2008 23:26:50 UTC

I don't know if you heard but today is Election Day in the U.S.. Luckily I only had to wait in line an hour to cast my vote so the usual weekly maintenance outage wasn't delayed. However, I wanted to reboot jocelyn to pick up a new kernel, and had issues upon shutting down and coming up not unlike those I had with bruno a week or two ago. Namely - the server couldn't find its large storage partition and/or thought it was corrupt. Not sure why but the data storage partition, which is under LVM control, wasn't being activated. I had to go through the rigamarole of booting from CD, commenting out the mount point in /etc/fstab, rebooting, then typing "vgchange -a y" myself to finally see the partition. Then everything was kosher. So far the projects are coming up just fine, though slowed as the restarted database has to flood its memory caches before reaching maximum efficiency - this usually takes around 30-45 minutes, I think.

Next Tuesday is Veteran's Day, so we'll probably have the weekly outage on Wednesday.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-6 08:28:24 | 显示全部楼层

5 Nov 2008 21:32:17 UTC

At 7:30pm last night the scheduler apache server got hung up - probably from all the election night excitement. These apache servers need a kick fairly often. Unfortunately they die various way due to various things, so automating the checking of certain pulses doesn't always help - in fact such things usually make systems more complicated and unpredictable. In the case last night it failed during log rotation which issues an "httpd restart" - this time the head-in process didn't die, so port 80 got locked up. I had to log in and kill the zombie httpds by hand before restarting apache. Not a big deal, though it got missed for a couple hours as it was timed perfectly with the entire country busy watching the news.

- Matt

7:30UTC调度服务器挂了,勘误了几个小时,正好错过了大选时间。。
回复

使用道具 举报

 楼主| 发表于 2008-11-11 09:52:13 | 显示全部楼层

10 Nov 2008 23:24:13 UTC

Reminder: Tuesday (tomorrow) is a national holiday. We'll be having our weekly outage on Wednesday.

It's been a bit of a rocky weekend. A rocky week, actually - since the last regular Tuesday outage the CPU load on anakin (the scheduling server) has been kinda high. Like around 100. The obvious answer - turning on client logging for debugging on Tuesday - wasn't so obvious at first as we thought we vindicated that and moved on to finding other possible problems which were all red herrings. Eventually we were brought back to client logging and Dave made an optimization fix which we tested in beta this morning, and Jeff applied to the public project around noon. This brought the load back down to 1. I guess the fix worked.

Our raw data pipeline management still needs work. Lots of bottlenecks make it impossible to keep a steady, automatic flow of work. In a perfect world it would be simple and serial - data drives sent up here, files brought online and acted upon by both multibeam and astropulse splitters, copied down to archival off-site storage, and then deleted. However each step takes a rather long time (hours per file if not days), and storage is limited, so we have to parallelize as much as possible. One possible effect of this, and one we're seeing now, is that we currently don't have raw data available for astropulse to split. We're loathe to copy data back up from the archives unless we really have to. We still might do so, but we are expecting a new shipment of drives directly from Arecibo today or Wednesday so astropulse at least be will be fed then.

The bright side is this is now very clear on the server status page now that I made some updates to finally split out database counts/rates and splitter activity per application. There's still more updates to be done, but now it's much easier to tell what's going on between the two.

明天是美国退伍军人日公众假期,例行维护改为后天。
回复

使用道具 举报

 楼主| 发表于 2008-11-13 20:11:22 | 显示全部楼层

12 Nov 2008 23:39:34 UTC

Let's see.. we had our weekly Tuesday outage today, since yesterday was a holiday. This meant the database had an extra day to get more bloated, which is perhaps why several queues started falling behind. Actually that probably has to do with our workunit storage server filling up again causes general backend malaise. So we were low on downloads for a while there before the outage.

Good news is that I found one reason why our apaches were randomly failing - kind of stupid, actually - there were two httpd log rotation scripts in occasional competition with each other. I think I cleared that up, but automounter/nfs problems are still creeping in there and wreaking havoc. I also finally employed new file_deleter logic to split the deletes between results and workunits, so they can run specific jobs on more appropriate machines. Hopefully that will help speed up the queue drainage.

I also added a tiny bit more logic to the server status page to help make clear which data files are being acted upon by which application. On that front, we were expecting raw data from Arecibo to show up today. It didn't. However, I've been pulling up old raw data files from HPSS for Josh's pulsar testing, and found these haven't been chewed on by Astropulse yet, so I added those to the data queue. So you'll be gettin' Astropulse work soon enough.

As for the project getting all of our stuff off the Network Appliance rack (to free up major amounts of space/power in our closet) we continue to make sloooow progress. Today we moved the boincadm account to this new machine, and so far so good - response times are still pretty snappy. Or snappy enough. The web page server does a lot of random access reading/writing in this directory, so it would be obvious if there were an i/o problem.

For what it's worth my schedule is changing a bit over the next month or so, and I'll be here on Fridays instead of Thursdays. This has nothing to do with anybody except those who expect my tech reports on specific days.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-15 09:05:09 | 显示全部楼层

14 Nov 2008 21:49:19 UTC

Happy Friday. After the Wednesday outage we had some splitter issues - Jeff incorporated new raw data reading logic that changed in our standard internal data handling libraries. This didn't break in tests, but broke in reality. Actually I'm not sure if it actually broke as much as misbehaved. In any case, the splitters tore through all our raw data and called the files "done" so we ran out of work to send for a moment there. I "un-did" these files and Jeff fell back to the old splitter. The project of debugging this is still open as far as I know. In the meantime, the Astropulse splitters are disabled for a reason - Eric and Josh want to fully drain all those workunits before releasing another client.

Meanwhile since we still haven't gotten our shipment of the latest data drives we had to pull data up from our archives to ensure we have enough work to send over the weekend. These are raw files that were surplus at the time and therefore unsplit (and "saved for a rainy day," like today).

We had some more automounter issues, though they are happening far less frequently than before. I added some alerts so Jeff and I will get more warning when such things go awry on any particular system. I also cleaned up the server status page some more. Other than that most of my time has been spent on shuffling big bunches of data around like some shell game in preparation for optimizing file systems (probably early next week). This is mostly internal stuff and has little to do with public server performance.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-18 09:00:23 | 显示全部楼层

18 Nov 2008 0:12:43 UTC

So vader went down again over the weekend. Actually just its ethernet connection went down. We're blaming Network Manager. Anyway, we remotely moved enough services around to get beyond vader missing from the server fold, and got everything working again this morning once back in the lab.

I don't have much time to report on all the other mundane details that occupiedthe rest of my day. Tomorrow is a standard outage day, during which we hope to get a bulk of the remaining NAS-box shuffling done - one more step towards major server closet overhaul.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-19 17:28:51 | 显示全部楼层

19 Nov 2008 1:59:09 UTC

Had the usual outage today (weekly mysql database reorg/backup). I also took this opportunity to do what I mentioned yesterday: the remaining last bits of NAS-box shuffling. This included breaking a (currently unused) RAID5 array, putting in bigger drives, and rebuilding it all as a RAID10. However, I quickly came to find the command line utility doesn't allow me to delete single logical drives - it's all or nothing. Not wanting to destroy the root logical drive, I was forced to go into the RAID BIOS, which meant the server (and the web site) had to be brought down temporarily.

Temporarily became a couple hours - after doing the reconfigure the regular BIOS surreptitiously changed the boot drive sequence. This meant the system wasn't booting after that, leading to much confusion and panic (and many long, slow reboots) until I discovered this tricky, pointless switcheroo. Anyway, everything was fine after that and I brought up the new partition and started moving things back to where they are supposed to be.

This included the beta download directory, which uncovered a "bookkeeping" error on our part which meant beta downloads of the new client were broken for the past few days. Oops. That should be fixed now.

We turned on query logging bringing the project back up in order to do an inventory and determine any need for more/different indexes. I had to bounce the project again later in the afternoon to turn that logging back off (it eats up too much i/o to just leave on indefinitely). I also spent a lot of time helping the CASPER gang reconfigure their main web server. I'm also supposed to be working on donation drive stuff. Oh well. I'll get to it tomorrow.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-20 11:21:31 | 显示全部楼层

20 Nov 2008 0:38:26 UTC

Today was a day mostly spent tracking down little problems involving BOINC, Astropulse beta, the Astropulse 5.00 release, the beta SETI@home splitter, raw data pipeline flow, data drives reporting wrong capacities to the OS... Lots of bizarre problem solving.

As for Astropulse 5.00 and an "official" statement (which was requested in the last thread) I just have to step back a moment and tell everybody that these threads are for entertainment purposes only and nothing I say should be considered official. I just work here and happen to suffer from hypergraphia. I do understand this is the most dynamic form of news on this site and so I nag the others to add content here and elsewhere. They never do, and I end up looking like the de facto spokesperson.

In practice, due to the incredible web of resource dependencies behind the scenes here, I have to keep tabs on pretty much every aspect of the whole BOINC/SETI@home/Astropulse family of projects since each program, server, budget constraint, etc. affects everything else. Jeff has to do the same. Nevertheless we can only keep track of so much, and what I believe is going to happen doesn't always necessarily happen.

That said.. from what I know and understand Astropulse queues did drain last week and the new client was released on Friday or Saturday. The vader choke hampered this a little bit, but shouldn't have affected progress on this front that much. Josh is a little puzzled about current results, or lack thereof. That was part of the problem solving today. I still have no real handle on the current Astropulse plans - just temporarily offering my mysql/BOINC expertise to the "Astropulse committee" (Josh and Eric) and then getting back to work on something else.

- Matt

Astropulse 5.00 开始测试了
回复

使用道具 举报

 楼主| 发表于 2008-11-22 09:57:56 | 显示全部楼层

21 Nov 2008 22:29:21 UTC

Let's see. Do I actually have any news to report...? Among other things today I've been working on some web site configuration cleanup, the continual chore of raw data pipeline management, and discussing the general Astropulse game plan with Josh. I think when Jeff and Eric are in the lab we'll all figure out what our exact plans are, and what we need to do to enact these plans. Generally I keep myself out of as many loops as possible for my own sanity, but I have to ramp up on Astropulse sooner or later. It's no longer a "proof of concept" kind of project handled completely by Eric/Josh. Anyway this is the kind of day where I take care a small subset of the little things that need fixing - it's been a long week and my brain is unable to deal with big projects, nor do I want to mess around with project critical stuff (especially as I am the only person on the "systems team" physically here at the lab right now and we're heading into the weekend).

Oh yeah.. keep in mind we are entering holiday hijinx time. Next week will be "short" (even shorter if I get called into jury duty the day before Thanksgiving).

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-25 10:28:58 | 显示全部楼层

25 Nov 2008 0:04:53 UTC

Welcome back from the weekend, which was actually relatively painless except for the usual set of automounter issues. We're close to giving up on all that. Today was a day filled with lots of chores - including trying to maximize how much raw data we have on line for splitting over the long weekend.

We did have a server hiccup today due to an administrative script corrupting an /etc/passwd file (thanks to aforementioned automounter problems). It's hard to maintain a server if the "root" user disappears from the passwd file. So I had to boot from DVD to file the corrupt file. Just so happens this was the server I was having BIOS issues last week, and they happened again! Without my consent it reset the boot drive sequence, causing a little bit of annoyance and grief. Eric and I are thinking there's a dead battery involved.

Reminder: this is a "short week" for us, thanks to the turkey day.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-11-26 08:44:44 | 显示全部楼层

25 Nov 2008 23:35:36 UTC

Happy Tuesday. We had the usual outage rigamarole today and should be recovering from that in due time. Right after the backup was finished we restarted mysql with full query logging turned on. We knew this would choke the server a bit, and would just be on temporarily. After about a half hour we had over a million queries in the log, so we brought everything back down and turned logging off. We'll parse this log file, and perhaps others we generate over the next 24 hours, in order to find pesky unoptimized queries, anything that would die if we remove all multidimensional indexes, or queries running far more often than we expected.

Also during the outage I moved some big directories around - more NAS shell games. Other than that I've been reconfiguring some more web server stuff (internal use pages) and trying to maximize the raw data pipeline plumbing to get as much work online as possible. It doesn't help that a lot of our raw data drives are showing weird signs of corruption. Don't worry - we do checksums at every important transfer to ensure the data are sound, and the splitters cannot operate on garbage (there are keyword strings occurring regularly throughout the files). Nevertheless, we're having to throw away some files, which is sad. My spider sense tells me this has to do with our general SATA enclosure mounting/unmounting woes. For example, we're finding drives that are 500GB thinking they are 750GB when mounted. Was this because a drive previously on that mount point was 750GB and some bookkeeping bits haven't been cleared? I dunno, but I'm sure this isn't good.

In a couple hours I get to call a number where an automated voice will tell me if I have to attend jury duty tomorrow or not. I get dragged in for potential jury duty an astonishing amount (pretty much the legal maximum) considering I never actually get selected for trial, and never will.

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-11 01:28

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表