找回密码
 新注册用户
搜索
楼主: BiscuiT

SETI@home 2008 技术新闻

[复制链接]
 楼主| 发表于 2008-10-7 08:59:05 | 显示全部楼层

6 Oct 2008 23:15:18 UTC

Let's see. No real major crises at the moment. We do have these network bursts which are entirely due to Astropulse workunits. Here's what happens, I think: an Astropulse splitter takes a long time to generate a set of workunits, and then dumps them on the "ready to send" pile. These get shipped to the next 256 clients looking for something to do, which in turn causes a sudden demand on our download servers as the average workunit size being requested goes from 375K to 8000K. We'll smooth this out at some point.

Lots of systems projects, mostly focused on improving mysql performance (Bob is researching better index usage in newer versions) and improving disk I/O performance (I'm aiming to convert all our RAID5 systems to some form of RAID1). Also lots of software projects, mostly focused on radar blanking (the sooner we clean up the data the better). Unfortunately needs of the software radar blanker required us to break open working I/O code - Jeff implemented some new logic and we walked through the code together today. Hopefully soon we can get back to the NTPCker.

Thanks for your input about the "zero redundancy" plan. Frankly I'm a bit surprised how many are against it, though the arguments are all sound. As I said we have no immediate need to enact this feature. I still personally think it's worth doing if only for the reduction in power consumption - though I'd feel a lot better if we could buff up the validation methods to ensure we're not getting garbage from wrongly trusted clients.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-10-8 07:44:26 | 显示全部楼层

7 Oct 2008 22:30:02 UTC

Had our weekly outage for mysql database backup/compression. Reminder: by "compression" I mean that the rather large tables in the database (notably "workunit" and "result" tables) stay stagnant in size if you go by number of rows. That is, workunits and results are created/deleted at about the same rate. However, when you delete a result you can't reclaim that space in the database again until either (a) a whole page of results is deleted (due to random nature of the project this rarely happens) or (b) we actively do this "compression." Why is this a problem? Well, imagine a city where, once you leave a parking space, nobody can ever park in that spot ever again unless all spaces in that neighborhood are vacated. This would make hunting for parking quite a chore. As time goes on, we see a similar effect on the database I/O. Seems silly that the database has this issue, but consider how many endeavors around the world, commercial or otherwise, require a database as large as ours in which a million rows get deleted and added every day? It's not a common problem, to say the least. At least at our scope.

People seem to be experiencing slowness uploading/downloading work. I know why: I've been pumping raw data over our network to our offsite archive (HPSS) over the same network link as the uploads/downloads. Usually we don't, and in fact after the current batch is done (later tonight) I'll archive over the campus network (which is what we usually do).

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-10-9 18:38:12 | 显示全部楼层

8 Oct 2008 22:01:36 UTC

Some nagging network issues, mostly due to the known liabilities/usual suspects. Very often we are maxing out our 100Mbit private connection to the world, due to peak periods (catchup after an outage, a spate of "fast" workunits, new client downloads added to the usual set of workunit downloads) or sending our raw data to offsite archival storage. This is why download/upload rates are abysmally slow at times - if you can get through at all.

One solution would be to increase our bandwidth - we do pay for a 1Gbit connection, but due to campus infrastructure can only use 100Mbit of that. Getting campus to improve this infrastructure is currently prohibitive due to cost (which includes new routers and dragging new cables up a mountainside to our lab), bureaucratic red tape, and the backlog of higher priority networking tasks campus wishes to tackle first. In other words as far as I can tell it ain't never gonna happen. Another solution would be to reduce our result redundancy, as already discussed in recent threads.

We also had our science db/raw data storage server choke a bit today - perhaps because of the recent swarm of fast workunits and therefore increased demand on the splitters. We do try to randomize the data to prevent such swarms but you can't win all of the time. And our web log rotating script sometimes barfs for one reason or another and fails to restart one server or another. For a moment there both the scheduler and upload server were off - I caught it fairly quickly though and restarted them.

To clarify, result uploads and workunit downloads go over the private SETI net, along with scheduler traffic. Web pages and other stuff goes over the campus net (it's not that much - only a couple Mbits/sec at peak times). The archival storage (where we copy all our raw data offsite) sometimes goes over the campus net, sometimes over the private SETI net, sometimes both if we need to empty the disks as fast as possible to return to Arecibo.

Other than all that.. I fixed the fonts of the status pages, and Jeff elsewhere posted a quick note about NTPCker progress.

- Matt
回复

使用道具 举报

 楼主| 发表于 2008-10-10 09:12:41 | 显示全部楼层

9 Oct 2008 20:26:27 UTC

Let's see.. We had one of our download servers choked on NFS again, which caused its httpd server to die. I gave both autofs and httpd a swift kick on that machine (vader) and it's back up server workunits again. Of course that means there's a backlog of clients trying to connect to it, and we'll be dropping various other connections while that queue clears out.

Our mysql research led us to discover we needn't upgrade our current mysql version after all to make use of automatic index merges. We haven't been seeing this logic being employed due to (a) low ordinality of certain indexes and (b) mysql refusing to use multi-dimensional indexes in their merges. Fair enough. We'll just have to change around our current constraints.sql (dropping some 2-dimensional indexes and making new single dimensional ones) and see what sticks.

Other than that.. today I've been working on LVM/xfs snapshots and making slow but steady progress on radar blanking testing.

- Matt
回复

使用道具 举报

发表于 2008-10-10 11:01:29 | 显示全部楼层
LS,
不能给个一句话的翻译吗?这样简单的拷贝,意义不大!
回复

使用道具 举报

 楼主| 发表于 2008-10-10 18:44:47 | 显示全部楼层

回复 #140 gengli00 的帖子

下载服务器再次出现nfs阻塞,造成httpd服务器无法运作,matt迅速给机器踢了一脚(?),机器备份了WU。这次阻塞造成了尝试连接的用户积压起来,matt放弃了其他各种连接,清除了队列。。

今天尝试用 LVM/xfs 快照作雷达消隐测试,虽然慢但稳步进展。。

===
其实我不太看得懂。。
回复

使用道具 举报

 楼主| 发表于 2008-10-14 18:27:01 | 显示全部楼层

13 Oct 2008 21:30:21 UTC

Busy day today. Jeff came in and found the server closet air conditioner went dead around 5am. So the entire closet was running pretty hot. Turns out there was another coolant leak (a problem we seem to deal with a lot). At any rate, this was fixed pretty quickly and everything cooled up to 2 degree colders than before this weekend.

Problems over the weekend. The mysql replica lost its connection - a known, common problem (hopefully will be fixed once the replica is on the same switch as the master db). I discovered that and gave it a kick. Hours later the upload server needed a kick as well. Eric discovered that in the morning and got it working again.

We're also fairly pegged at our network limit again, I think thanks to the workunit turnaround time being pretty low (i.e. fast). Plus I have to send extra raw data to our archive over the same link. Oh well. Expect data transfer headaches for the next qwhile.

I also am planning for our last OS upgrade tomorrow on jocelyn, the master mysql database server. This means, like when we upgraded bruno, an extra long outage tomorrow.

- Matt


空调系统,mysql服务器遇到了一些问题,已经处理,明天升级主要mysql数据库服务器的系统,在升级Bruno的时候会有一个较长的停机维护时间。
回复

使用道具 举报

 楼主| 发表于 2008-10-15 18:42:04 | 显示全部楼层
14 Oct 2008 23:21:23 UTC
Well here we are. I just had a long day mostly occupied with upgrading the last of server that required a long-overdue OS upgrade. This was our master mysql server. We started the outage early so we could compress/back up the database like we usually do, then allowing enough time in the afternoon for me to install the new OS and configure everything. It seems everytime we install a new OS on a server, a completely random set of unexpected hurdles eats up a couple hours. Today was no different.

Hurdle 1: This system has two hardware RAID devices, which the OS saw as /dev/sda and /dev/sdb - the former being the root drive, the latter be
ing the data drive. The installer recognized both devices but swapped names - the root drive was /dev/sdb and the data drive was /dev/sda. Fair
enough, but I had to be extra careful not to blow the data drive away. This would have been okay, except upon reboot the names were swapped yet again, and grub's device map was pointing to the wrong drive (it doesn't use partition UUIDs). This led to some confusion and having to edit grub config files in rescue mode, etc.

Hurdle 2: Actually this happens every time I install an OS, but each time it is slightly different. That is, despite entering the proper network info during the install process, things just don't work right out of the box. This time it took 45 minutes of hair pulling before I gave up, swapped the ethernet jack from eth1 to eth0 (it was working just fine in eth1 before the upgrade) and then, inexplicably, I could see the world using the exact same "broken" configuration on eth0 that I used on eth1. Very annoying.

Hurdle 3: I was able to get mysql to start up and see the data, but it's master/replica configuration was messed up. I fixed it, but then the replica itself barfed for other reasons. Problem is it was still lodged on trying to replicate "alter table" commands which we do each week to compress the data. So every time I try to reset values an errant "alter table" seems to run, thus locking the database for 60-70 minutes. Makes debugging/progress very slow. In fact, the replica is still off - I just started the project running entirely on the master. I might get the replica working today. Maybe not.

- Matt

今天主mysql服务器升级并重装了系统。
回复

使用道具 举报

 楼主| 发表于 2008-10-16 13:35:52 | 显示全部楼层

15 Oct 2008 21:51:36 UTC

This morning the other building in the Space Lab "complex" started having network issues on one of their subnets. For various reasons I shan't go into here, we have some servers on that subnet. Since some of these "foreign" servers were recently mounted, this reverberated into all kinds of NFS malaise on most of our local servers, some of which needed rebooting to break various network logjams (and then in one case fsck'ing after rebooting...). It's been that kind of day.

The good news is the mysql master/replica seemed to have survived the OS upgrade yesterday, though not after some confusion about unexpected behavior.

- Matt

空间实验室的子网内一些服务器出现了网络问题,某些服务器需要重启。
好消息是mysql服务器看来已经熬过了昨天的系统升级,虽然有一些意想不到的问题。。
回复

使用道具 举报

 楼主| 发表于 2008-10-17 07:28:28 | 显示全部楼层

16 Oct 2008 16:46:38 UTC

Early note as I'm leaving after lunch today. Looks like the translation code on the web broke sometime during yesterday evening. How embarrassing. Code was updated on this site (not by me!) which messed things up. The problem with the translation stuff is that it takes a while to "percolate" - you update the proper .inc files, you look at the web site, it looks fine, so you move onto something else - and don't notice when it breaks 10-15 minutes later. Normally I check in regularly on "off hours" to catch such problems but I was busy last night. Anyway, I don't want to apologize for this, especially as it wasn't my fault, and in fact I fixed it when I got in by falling back to older code.

I believe everything is else is more or less recovering from various mounting/network/reboot issues yesterday. Hope y'all are getting your workunits for the weekend!

- Matt

网站的语言翻译服务出现问题。
回复

使用道具 举报

 楼主| 发表于 2008-10-21 20:58:20 | 显示全部楼层

20 Oct 2008 23:09:08 UTC

Hello. So the weekend was a bit "noisy" on the network backend. This was mostly due to these network bursts we've been getting. It's still confusing to me why these bursts are happening - every few hours we get a bunch of Astropulse workunit downloads in quick succession that max out our bandwidth and wreak general havoc. And over time our workunit storage server filled up again, so queues are filling up, the splitters can't create work fast enough. Also the load on our upload server is unbearably high. I'm hoping during the usual weekly outage tomorrow we can give certain servers a rest to help clear the pipes. Until then, practice patience.

We also have conditioning air conditioning problems in our server closet - apparently the temporary fix from last week is unfixing itself. It's not a disaster, but the temperatures are rising about a degree per day. I hope the facilities people will be checking it out tomorrow.

- Matt

每几个小时服务器会受到一串快速的Astropluse任务下载,网络带宽遭受到破坏,同时wu存储服务器再次填补,splitters不能足够快的创建wu,此外上传服务器处于高负荷状态,希望能在明天的例行维护让一些服务器休息以清除管道。。

空调系统的问题仍在,温度每天都增加一度,希望明天设备人员能检修它们。。
回复

使用道具 举报

 楼主| 发表于 2008-10-22 09:03:28 | 显示全部楼层

21 Oct 2008 22:25:00 UTC

Today had the weekly outage for the mysql backup/compression/etc. Bob did some index manipulation on the beta project while we were down - to see if we can perform as well with less indexes (now that mysql merges indexes if possible on its own). During the outage one of bambi's 24 drives failed, or at least seemed to. A spare has been pulled in and is rebuilding the array now.

The forums were pretty slow yesterday - actually everything was. Queues were filling, storage was maxed out, servers and databases was slowed by all the above, causing all kinds of headaches. However overnight the dams finally broke through and everything more or less cleared up on its own. I like when that happens.

About our bandwidth.. We do have *two* 100Mbit connections to the world. First is Hurricane Electric (HE), which is the what SETI pays for, and the other is the link supplied by campus which is shared by the entire lab. The HE traffic is strictly result uploads and workunit downloads, with occasional archival transfers to offsite storage. Everything else - most of the archival transfers, the public web sites, etc. go over the very underutilized campus link. So if there are web site connectivity problems, it has nothing to do with a maxed out link - it's probably due to the database server being overloaded, or something else.

- Matt


今天的例行维护,在服务器停机时Bob对beta项目做了一些索引测试。
停机后服务器和数据库或多或少的清理了自己的队列。
这里有两条100Mbit带宽连接到全球,一条(Hurricane Electric (HE))由SETI支付,另一条由校园整个实验室提供,HE负责结果上传和任务发放。而大部分的数据传送、公众网站等是靠非常不足的另外一条网络,如果网站连接出现问题,那大概是数据服务器超载或者其他什么了。。

这里没有提到空调问题。。大概是没有设备人员来处理了。。
回复

使用道具 举报

 楼主| 发表于 2008-10-23 18:39:15 | 显示全部楼层

22 Oct 2008 21:00:31 UTC

Really busy day for me, but not much on the public facing side of things. Jeff and I are revamping our current backend data pipeline in light of continual hardware and I/O headaches. I'm pulling a bunch of stuff out of the database for Josh so he can do some more "find the known pulsar and see if it looks like RFI" game in Astropulse. I enacted the "no redundancy" policy on beta - we're curious to see how well it works in practice, mostly for the sake of general BOINC testing. I had some updating/programming to do regarding our donations database - stuff that campus requested.

Still no signs of the air conditioner being fixed (though it is running cooler in the closet than earlier in the week). And we haven't yet replaced the bad drive on bambi (though we have a spare sitting on the shelf).

- Matt

修补了一些硬件问题,应校园要求更新了捐赠数据库。
空调问题仍然没有处理。
回复

使用道具 举报

 楼主| 发表于 2008-10-24 08:30:56 | 显示全部楼层

23 Oct 2008 20:55:56 UTC

There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help.

So we've been getting these frequent, scary, but ultimately harmless kernel warnings on bruno, our upload server. Research by Jeff showed a quick kernel upgrade would fix that. We brought the kernel in yesterday and rebooted this morning to pick it up. The new kernel was fine but exercised a set of other mysterious problems, mostly centered on our upload storage partition (which is software RAIDed). Lots of confusing/misleading fsck freakouts, mounting failures, disk label conflicts, etc. but eventually we were able to convince the system everything was okay, but not after a series of long, boring reboots.

Speaking of RAID, I still haven't put in the new spare on bambi. It's late enough in the week to not mess around with any hardware, especially after dealing with the above. Plus the particular RAID array in question is now 1 drive away from degradation (no big deal), and 2 drives away from failure. Plus it's a replica of the science database - and the primary is in good shape, and is backed up weekly. So no need to panic - we'll get the drive in there early next week.

Speaking of science database, I'm finding our signal tables (pulse, triplet, spike, gaussian) are sufficiently large that informix is automatically guessing that with certain "expensive" queries indexes aren't worth using, and is reverting to sequential scans which take forever. This has to be addressed sooner than later.

- Matt

googlebot对网站的严重抓取可能是导致最近的网络问题的主因,它会扫描所有beta项目的结果页面,而项目仅仅对主要项目禁止了抓取。
因此对服务器安装补丁进行更新,过程出现很多问题。。但最后还是能让服务器正常运作了。
回复

使用道具 举报

 楼主| 发表于 2008-10-28 08:12:39 | 显示全部楼层

27 Oct 2008 21:52:50 UTC

Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip.

Things were more or less working in our absence, but - as it happens sometimes - sendmail stopped working on bruno. This wouldn't be a tragedy except for the fact that bruno wasn't able to send us the usual complement of alerts. For example: "the mysql replica isn't running!" So we didn't realize the replica was clogged all weekend. The obvious effect of this is our stats pages have flatlined. It's catching up now, but we'll probably just reload it from scratch during the outage tomorrow.

We also had more air conditioning problems last night. At least the repairguy returned today with replacement parts in tow. So that's being addressed, but not after Jeff got the alarm at midnight last night and Dan trudged up to the lab to open the closet doors and let things cool off. And the httpd process on bruno, once again, crapped out at random - meaning uploads weren't happening for a short while there. Jeff gave that a swift kick, too.

On the bright side, we're discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above.

- Matt

一些科学数据库出现了问题,邮件服务器停止了工作,希望明天的维护可以恢复。
空调问题还在处理。。
好消息是我们发现了在后端如何调整NFS来大大提升效率和可靠性。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户

本版积分规则

论坛官方淘宝店开业啦~
欢迎大家多多支持基金会~

Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2025-5-11 01:24

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表