找回密码
 新注册用户
搜索
查看: 42169|回复: 184

SETI@home 2008 技术新闻

[复制链接]
发表于 2008-1-3 14:37:08 | 显示全部楼层 |阅读模式
9g48nfnx.gif

新年快乐!SETI@home 迎接2008年,新一年希望收到外星文明的贺电!

内容转载自 SETI@home 官方网页 - 技术新闻 页面
内容大多诉说服务器软硬件上的问题,欢迎有兴趣的版友跟贴翻译~感谢!

另外官方提供有 技术新闻 的专门留言板:http://setiathome.berkeley.edu/forum_forum.php?id=21
可以直接跟 Matt 交流。

[ 本帖最后由 BiscuiT 于 2009-1-1 13:11 编辑 ]
回复

使用道具 举报

 楼主| 发表于 2008-1-3 14:37:30 | 显示全部楼层

2 Jan 2008 22:54:11 UTC

Happy new year! Actually, being that every moment is the beginning of some arbitrarily defined era, I should be more clear: Happy new calendar year number 2008, whoever uses this particular calendar system which I usually do!

The weekend was busy with the more-and-more-common fast workunits. Discussions today at the lab brought up the fact that about a third of our data will translate into these fast runners, so we better turn our attention back towards improving the data pipeline. We picked two low hanging fruits today: convert server bane from a redundant web server to a secondary download server. This will help determine if that bottleneck is the server or the storage. I also added a flag to the splitter scripts to select files in beam/polarization pair order, not filename order. This will help pseudo-randomize the creation of work, and hopefully spread the pain of fast workunit periods so we aren't so overwhelmed at times.

Nevertheless, we have Astropulse coming down the pike, and have a lot of SETI@home data to go through (and we're starting to collect new data again!). So we need to upgrade the network/servers in a big way. And acquire more participants. Not sure how this will all happen yet, but it has to happen.

Meanwhile, we might try another science database index build tomorrow (or soon thereafter). Bob found a way to do so while the database is up and inserting rows, so we might not have to shut down splitters/assimilators during the long build. Cool.

- Matt
回复

使用道具 举报

发表于 2008-1-3 14:59:48 | 显示全部楼层
谁翻译一把吧,26个字母都认识、凑一块就全乱了,一头雾水
回复

使用道具 举报

发表于 2008-1-3 16:22:40 | 显示全部楼层
求英语达人啊
回复

使用道具 举报

发表于 2008-1-3 18:54:17 | 显示全部楼层
哈哈 ,用GOOGLe自动翻译翻了一下,有点搞笑的说,内容如下:

新年快乐!其实,作为每一个时刻是开始有些任意确定的时代,我应该更清楚:快乐新历年人数2008 ,谁如果利用这个特别的日历系统,我通常会做的!

周末忙着与更多和更常见的快速workunits 。今天的讨论,在实验室带来了一个事实,即约三分之一,我们的数据将转化为这些快速跑,所以我们更好地把我们的注意力重新投入在改善数据管道。我们选择了两低挂水果今天:转换服务器的祸根,从一个冗余的网络服务器,以一所中学下载服务器。这将有助于确定,如果瓶颈是服务器或存储。我还补充说:授旗给分路剧本选取档案,在束/两极分化一双秩序,而不是文件名命令。这将有助于伪随机创建工作,并有希望蔓延的痛苦快workunit时期,因此,我们并非如此不堪重负时候。

不过,我们必须astropulse未来下来派克,有很多以前从未见过@家数据要经过(我们已经开始搜集新的数据再次! ) 。因此,我们必须以提升网络/服务器中迈出一大步。 ,并获得更多的人参加。不知道如何,这都将发生,但,但也有可能发生。

同时,我们也可以试试另一种科学数据库索引建明(或之后不久) 。鲍勃找到了一种方法,这样做的同时,该数据库已经建立和插入数据,所以我们可能不会有关闭分配器/ assimilators在长期的建设。酷
回复

使用道具 举报

 楼主| 发表于 2008-1-4 23:12:37 | 显示全部楼层

3 Jan 2008 20:54:14 UTC

Spreading the workunit creation over several files at once seems to be helping create a healthier mix of fast/slow workunits. However, adding a second download server seems to have confirmed a suspicion of mine (key word: "seems"): that somewhere down the pike we're being capped at 60 Mbits/sec. For a while there we had two download servers and a workunit storage server with plenty I/O capacity to spare, but still we were hitting a hard 60 Mbit ceiling outbound. Inquiries are being drafted/sent to the appropriate parties. It still could be a local problem, but we're not sure what else to try (given our current hardware).

We are in the middle of building another helpful index on the science database. Looks like Bob's magic informix incantations are working - we can keep the project running simultaneously (though the assimilators might back up a bit). It is always happier around here when work is flowing. To be safe we increased the ready-to-send queue size to one million - we have the disk space now to keep more workunits around. The only downside is that this inflates the result table in the database by approximately 5-10%, which may exercise the RAM on the BOINC database server that much more.

There is another problem Dave and I were poking at today: excessive "out of range" failures on our public web sites. Here's the deal: BOINC clients have a nice GUI which shows you icons, pictures, etc. from different projects as you select which to run on your computer. Where does it get these files? From the project's web servers. This is all well and good, but there are several (hundreds? thousands?) older clients out there making such requests but are being met with 416 "range not satisfiable" errors. Why? Because they have already downloaded the image file, but are making requests for more bytes beyond the file boundaries as if there was more to download. Obviously a bug somewhere, or a change in the way apache handles such things, but there's not much we can do about it. Even though this activity is creating bursts of heavy load on our web servers, this is a fire we're going to let burn for now.

The official press release about multi-beam is finally out. This should help on many levels (though I'll be busier making sure the servers can handle any significant load increase). I guess I'll also be shaving every morning in case there is interest from the national television news media.

I guess this is "technical" news: Our desks/chairs/furniture are mostly ancient hand-me-downs, some pieces older than I. We did get some new chair donations recently, but one of them broke - it came loose from its base, causing unsuspecting sitters to suddenly fall forward if their balance wasn't particularly keen. It's been lurking in our lab way too long, coaxing uninformed standers with tired legs to rest upon its comfortable and seemingly stable cushion base. I came to the lab this morning and that evil chair was by my desk with a note taped to it: "Matt - can you please toss this chair?" I guess enough was enough. I dragged it to the dumpster and sent it back to the dark void from whence it came.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-9 08:48:50 | 显示全部楼层

7 Jan 2008 23:28:38 UTC

Lots of weather in the Bay Area over the weekend, leading to many power outages. Luckily our project was not affected.

The new pseudo-random nature of our workunit creation finally worked itself out, and we were sending data at a relatively even pace. Speaking of sending data... At the end of last week my suspicions were confirmed: the router between us and our ISP (a Cisco 2811) has been CPU bound for who-knows-how-long, thus causing an artificial 60 Mbit/sec cap on our outbound packets. Further research will determine whether we can improve its performance or if we need to procure a better router.

We had an assimilator get jammed on a broken result. I had to delete the result to clear the pipes. This happened once before a week or two ago. A little detective work this morning uncovered that both such broken results were processed by optimized clients. I'm just sayin'. This could easily be a conincidence.

Spent a large chunk of the day trying to coax another Intel-donated server to life. We've gotten a lot of stuff from Intel recently, all in varying states of functionality (some missing CPUs, some have test boards, etc.). This particular one (4 2.66GHz CPUs, 8 GB RAM) was dead in the water for a while as it wouldn't respond to any keyboard/mouse. However, the other day I noticed one of the front-side fan modules wasn't seated properly. I adjusted it, and now the server sees all input devices. It's still a little squirly, but may be a worthwhile web server after all. We're calling it "maul" (sticking to the current "darth" theme). I'll announce it again if it actually proves to be ready for prime time.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-9 08:49:07 | 显示全部楼层

8 Jan 2008 22:16:52 UTC

So we've been running this annoyingly load-intensive query everyday on the BOINC database to clean up results that failed validation. It took up to an hour to run, during which it hogs a bunch of database memory and slows everything down, including workunit distribution. Why not build an index? Well, indexes still take up disk/memory, and the main table field in question is of low cardinality, and we're only hunting for a few thousand out of a millions of rows each time. So Bob was looking into implementing a new fangled mysql "trigger" to flag the few rows when they enter this bad state, making them much easier to find without needing the overkill of an index. However, we only discovered today triggers don't work in our current version of mysql. So we built an index after all. We'll see how much it helps.

Other than that and the usual database backup outage this morning, mostly spent the day moving large numbers of files/archives around to prepare to grow the workunit storage space again. I also got the new server (maul - see yesterday's note) up to speed, more or less. Still won't be live for at least a day or two, but it's working. It's a 4x2.66 GHz dual core intel with 4 GB of memory. Looks like another perfect web server to me. Also had to grow our home directory space because, as you know, no matter how much space you have, it's never enough.

Somebody pointed to an article that mentioned the Cisco 2811 has a known throughput rated at about 61 Mbps. This was a surprise to me and Jeff - I guess this wasn't what we were told, and you'd think a router with 100 Mbp ports could reach a theoretical maximum of 100 Mbps. The cap seems to be due to CPU limits, and we are doing tunnel encryption and have a small but still non-zero set of access rules. Anyway live and learn. And no further progress on that since yesterday.

Another storm is whizzing through. The top third of a 50 foot tree just broke off right outside my lab window. Cool. I understand why people are freaking out about this current weather, but this is nothing compared to the hurricanes I dealt with growing up in downstate NY.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-10 09:25:50 | 显示全部楼层

9 Jan 2008 22:51:15 UTC

More blips and blops in our traffic caused by who-knows-what. We still don't have enough data yet to see if yesterday's BOINC result outcome index build helped with those regular slow validation-fix updates. In any case, I misspoke: we are running a version of MySQL where triggers are available to us - we only have to figure out how to implement them to do what we need. This morning the secondary download server bane was having a mount headache and I had to give it a virtual kick to get it going again. And that router is still a problem, but we're not convinced it's the only problem. Swapped out cables, switches etc. to no avail this morning. I installed some real load balancing between vader and bane (in practice round robin DNS is hardly balanced) which may help.

There was still slowness to the web site as of a few minutes ago. This had nothing to do with recent web code tinkering/updates or database load or any such thing - this was strictly due to the aforementioned router problems, as half the web traffic was going through the same router (the other half over the standard campus network). I just moved the competing traffic onto the campus network as well, so that should improve web site performance in general.

Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of:

<pot length=211 encoding="x-csv">

It was:

<pot length=211 encoding71x-csv">

So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues.

Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don't have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so.

- Matt
回复

使用道具 举报

发表于 2008-1-10 17:28:36 | 显示全部楼层

tes

希望可以找到所想
回复

使用道具 举报

 楼主| 发表于 2008-1-11 12:27:10 | 显示全部楼层

10 Jan 2008 22:47:31 UTC

The public web site servers slowed to a crawl again this morning thanks to several robots/spiders scanning us at once. So I took another gander at my robots.txt file and used Google's webmaster tools to check how well this was being parsed. This uncovered a typo (a missing "s") and while I was at it I added some new rules to robots.txt. We'll see how this all fares.

Bob and I brought the BOINC/science database servers down briefly this morning to tweak some parameters and clean out logs - some of you may have noticed a brief data server/web site outage in the process. The only tweak of note was on the science database: we reduced the checkpoint intervals and increased the between-database-ping timeouts. Why? We've been seeing the secondary spuriously enter recovery mode due to being unable to reach the primary, when really the primary was simply busy doing checkpoints at the time. Anyway, outage recovery was slowed by confluence of various stats/update scripts starting up while the database was busy flooding its memory buffers. We really need to optimize those stats queries someday. As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely.

In the process of general disk cleanup, etc. I'm now forced to finally populate the credited_job table with three years' worth of purge archives. These archives are taking up 200GB on a 1TB filesystem which we really need to convert into workunit storage sooner than later, hence the push. Reminder: this is the table that contains the history of which users processed which workunits.

Just between you and me... In addition to the outbound traffic squeezing through our maxed-out router, I am now sneaking our an additional 5-10% over the campus net. This is thanks to the simple/useful "pound" load balancing utility. The campus net can definitely handle this tiny increase. In fact I might bump up the percentage. But don't tell anybody. Mwha ha ha. [edit: I brought that percentage back down to 0% an hour later - we'll keep this extra power in our back pocket for now.]

By the way, the optimized client discussion has been taken offline and is progressing. Turns out this may actually be a single bad host more than a bad client.

- Matt
回复

使用道具 举报

头像被屏蔽
发表于 2008-1-13 17:08:11 | 显示全部楼层
提示: 作者被禁止或删除 内容自动屏蔽
回复

使用道具 举报

发表于 2008-1-15 06:53:45 | 显示全部楼层
这是google翻译的吗?
回复

使用道具 举报

 楼主| 发表于 2008-1-15 09:31:31 | 显示全部楼层

14 Jan 2008 22:23:56 UTC

Things ran quite well over the weekend. Looks like we added the right index to the mysql database to reduce the slow "validator fix" queries. A note about general BOINC/mysql implementation/design: there are a lot of features in BOINC that are seemingly excessive from a single-project perspetive, but are there as every project has different needs. Project-specific factors (server power, workunit processing times, number of active users, min quorum, etc.) make some features less helpful. In the case of "resend lost workunits" (see last thread) this feature, implemented mostly for the benefit of Einstein@home, was most definitely weighing down our database server. We turned this off and have been running smoothly since. There were assumptions this would lead to greater problems down the line (fearing many results will be sitting on disk longer waiting for their redundant pairing to return) but in fact our "results returned and waiting for validation" number has been stable (if not slowly decreasing) since I made the change. Nevertheless, at some point soon we will see if we could optimize/reimplement this code, and Eric is actually making adjustments to the splitter which will perhaps create less "fast runners."

Our new-hardware-to-obtain priorities are shifting. Namely, we need a router (we're not ignoring discussion about this on other threads but we are limited to what we can use for various configuration/policy reasons). We also need a new KVM - our current one in the closet is maxed out and we'd like to get more stuff in the there ASAP. We also need three new desktop systems. Dan's using an old, sloooow solaris system which is out of support. Bob is on a slightly faster solaris system, but needs a safe mysql test sandbox. Josh's old super-cheap windows/intel box is basically a glorified console server.

Had some minor issues due to the root drive on bruno filling up on Sunday. I scanned the drive and found only 4GB of stuff, while "df" was showing 40GB. Eric eventually found a deleted-yet-open file - an infinitely growing httpd log. Apparently httpd log rotation broke at some point, but we cleaned this up. Annoying, but harmless.

Due to increased load in general, I changed the server db stats to update every hour (instead of half hour). Actually it's becoming clearer as we increase active user load and I'm populating credited_job, etc. that the mysql database might be our bottleneck du jour any jour now. There were also some issues with the user-of-the-day selection process which I tracked down and fixed this morning.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-1-16 12:57:17 | 显示全部楼层

16 Jan 2008 0:37:05 UTC

Yeah... we're really pushing the boundaries of our mysql database these days. I'm finally catching up on several years' of backlogged archives and inserting zillions of rows to credited_job and this, on top of general increased usage, is gumming up the works. In fact, optimizing this table alone during today's outage took three hours (normally only a few minutes) - which explains the extreme length of today's downtime. I guess we'll have to turn of credited_job optimization until we actually use the table.

This brings up several questions, the first of which was asked in a previous thread: Why are you guys using mysql instead of a more robust commercial product? Two main reasons: BOINC projects generally are small academic ventures with limited funds, and BOINC is an open-source project itself utilizing other open-source pieces of software. So all you need is a relatively cheap linux box which comes with php, apache, mysql, etc. and it's pretty much plug and play. Remember the project specific data, i.e. the science database, can be whatever you want. In our case, it's Informix. Why Informix? We got it for free 10 years ago - we now have 10 years of experience using it as a group and it is still free to us. Would we consider changing to Oracle/SQL server/etc.? If somebody wants to buy such a license and donate a man/year to change all our back end software to do so, then we would perhaps entertain the thought, but we have higher priorities, especially as Informix works perfectly well at this point. It's the BOINC/mysql part that needs help, and we're sticking with it for reasons stated above, and with SETI@home being the flagship project of BOINC we don't want to diverge from the standard.

In other news, it seems the every day there's a different reason our web sites are so darn slow. Yesterday afternoon we were getting hit by some seemingly nefarious activity which I was able to block quite easily once I discovered it. But we were also getting hit by some scraping of stats pages via a robot (called BoincBot) that was not obeying robots.txt. I blocked these hits as well. We don't allow such activity on our web sites. If you want BOINC stats you can download the daily xml dumps just like everybody else.

On the bright side, we obtained another server donation yesterday from a private party: a 1U dual-opteron (2.4GHz) server with 16GB memory. I installed FC8 on it just now, though there was a little bit of tweaking to get that to go. There's no DVD drive in the thing (only a CD drive) and for some reason the was some disconnect with the 3ware disk controller such that the linux installer couldn't see the two root drives. I ultimately took that out of the equation and plugged the drives straight into the SATA ports on the motherboard. All's well and it's getting all yummed up now.

So we're looking for a KVM-over-IP, at least 16 ports (24 preferable), easy-to-use but secure connections via a web browser, etc. Any thoughts? The Belkin Omniview seems the cheapest/easiest, but only allows one person to connect to the whole unit at a time - not a showstopper. Any suggestions, experience with such devices, etc. out there?

- Matt
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-11 03:45

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表