[BOINC] [天文类] SETI@home

BiscuiT · 发表于 2010-5-21 00:11:59

2010 年 5 月 12 日

David Anderson 在 www.seti.cl（西班牙）接受关于 BOINC 和 SETI@home 的采访

BiscuiT · 发表于 2010-5-21 08:38:59

2010 年 5 月 21 日

今晚 18:00 到 22:00 PDT 离线

BiscuiT · 发表于 2010-5-29 10:27:23

2010 年 5 月 30 日

没有任务发送，大约8小时后恢复

BiscuiT · 发表于 2010-6-3 19:47:31

2010 年 6 月 3 日

每月报告

NTPCkr 纳入了新的射频干扰消除代码，jeff 对它进行测试。
另外 matt 也写了一个关于雷达消隐的科学通讯。
服务器负载最近见顶，适逢周末加入对集体发送了Email，还有几个视频文件，百多MB，因此对网站和整个实验室的互联网连接有些压力。
其他略。。http://setiathome.berkeley.edu/forum_thread.php?id=60174

霊烏路空 · 发表于 2010-6-8 20:01:18

2010 年 6 月 8 日

在处理 cuda_fermi 计算程序运行新的调度器是出现了问题，在处理好之前没有任务发送。

霊烏路空 · 发表于 2010-6-10 09:26:13

2010 年 6 月 10 日

最近“无任务”的原因：

Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons:

1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem.

2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while.

3. Some data files error out pretty quickly due to noise or garbage data.

4. The CUDA clients sure burn through work fast.

5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage.

So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement.

- Matt

射命丸 文 · 发表于 2010-6-17 19:12:01

2010 年 6 月 17 日

Another day, another perfect storm.

We had our usual weekly outage yesterday (for database backups/maintenance/etc.) during which we take care of other hardware/project issues. Such as yesterday - we finally got our remote-controlled power strip configured and hoped to put on one of our crashy servers (ptolemy) on it.

This meant bringing ptolemy down, which pretty much kills *everything* including all the web sites/BOINC servers. We did so, only to find during the course of installationg the config on the power strip get reset somehow, so we had to fall back. All told, this meant an hour of delay/downtime, and we were once again at square one.

After that Dave and Jeff were coordinating getting some new scheduler fixes online, which required some database updates. So we didn't start the backup until after noon, which in turn meant the projects wouldn't be ready to come back on line until after well 5pm. Jeff manned that from home, but it turns out some poorly behaved yum upgrade of httpd on anakin in the meantime secretly broke the httpd config which was impossible to diagnose/fix at the time. So we were down for the night until we could figure it out in the morning.

I guess one silver lining being down all night meant Jeff and I had an opportunity to retry installing the power strip on ptolemy with minimal interruption (as we were already in the middle of a major interruption!). This time: success - as far as we can tell after one test, if ptolemy now crashes the power strip will detect this within 30 minutes and power cycle it. Hopefully this will vastly reduce our downtime when this happens again (usually on the weekends).

As I type this Jeff is still getting most of the BOINC back-end pieces working one by one, but at least we're doling out work for the moment as fast as we can.

I know most of you who read these updates know this already, but it bears repeating: nobody working directly on SETI@home (all 5 of us) works full time, and we all have enough other things going on that make it impossible for us to be "on call" in case of outage/emergencies. In my case, I currently have four regular separate sources of income with jobs/gigs in four completely different industries (covering all the bases in case one or more dry up). As for last night, when the httpd problems arose, I was working elsewhere, and when I checked in again around 10:30pm everyone else was asleep and I didn't want to start up the scheduler processes without others' input as they were still effectively on the operating table. We're pretty much given up any hope for 24/7 uptime, but BOINC takes care of that as long as you sign up for other projects.

On a more positive note: the "spike merge" is coming along, albeit slowly. May take one more whole week to complete. And we're still doing R&D regarding server shuffling to improve our science database throughput (and therefore speed up our candidate searching).

- Matt

射命丸 文 · 发表于 2010-6-24 17:59:46

2010 年 6 月 24 日

Since last I wrote a lot has happened. Looking at the traffic graphs it's like feast or famine - either we are unable to create/send out workunits, or we're sending out as many as we can fit through the pipe. Mostly it's been the usual gremlins.

However regarding the past 24 hours it was a new problem: the result space on the upload server filled up unexpectedly, which would have been fine except this (perhaps) inspired some RAID freakout on the system. We couldn't really sort it out until this morning. From the looks of it we had something like a six drive simultaneous failure. Jeff and I beat on it for a while - we eventually assumed this was just a hardware blip, and the data was more or less intact on the drives, but the RAID metadata got a little screwed up. Long story short we were able to carefully bring down the RAID and recreate the meta devices from scratch with the data intact, and all was well. Phew. For the record we do have a virtually-up-to-date result storage backup at all times in case of catastrophic failure on this system.

In any case, the main culprit was our disks filling up, so as I write this we're keeping the project down until major queues drain and the constituent workunit/result files can be deleted.

On a more happy (perhaps) note, yesterday the core group of us were in the same place at the same time (which is rare) and we had an ad hoc meeting about our current project status/plans, especially in light of many recent server problems, increasingly random schedules, and embarrassingly low funding. We're all kind of tired and beaten up and wanting some results already - so I like to think this paved the way for several large and ultimately positive changes in the future.

Also Jeff has been working on this nagging mysterious problem where some of our raw data files are only getting partially processed (which vastly increases our "burn rate" and leads to unexpected workunit shortages). He found some major clues today, and we brainstormed why this is happening and what the exact effect is. At least there's a smoking gun on that front.

- Matt

姫海棠 果 · 发表于 2010-7-1 08:26:21

2010 年 7 月 1 日

新的服务器维护时间表

从下周开始，我们的服务器将每周关闭数天，在星期二到星期四这段时间。
这可以让我们集中更多时间去做科学开发。

姫海棠 果 · 发表于 2010-7-1 08:30:05

2010 年 7 月 1 日

阿雷西博望远镜的维修状态

在2月3日望远镜出现结构失灵，3月份进行了部分修复，我们到目前一直在减少动作以观察情况。
下一个阶段的维修预计在7月12日开始，并可能需要全面的运行和测试，大约需要6周时间。

merlinl · 发表于 2010-7-4 11:20:44

提示: 作者被禁止或删除内容自动屏蔽

姫海棠 果 · 发表于 2010-7-12 00:07:00

2010 年 7 月 11 日

客户端的任务现在还在持续，目前的限制值是：

每个 CPU 处理器限制 6 个
每个 GPU 处理器限制 48 个
总数限制 150 个

所有的限制将在下周一解除。

姫海棠 果 · 发表于 2010-7-15 16:20:45

2010 年 7 月 15 日

目前我们正在进行一个 beta 测试，让 VLAR 任务运行于 GPU 上。
我们希望可以在下周把它转移到主项目中去。

姫海棠 果 · 发表于 2010-7-16 22:00:38

2010 年 7 月 16 日

在我们的 SSL 路由和 PAIX 路由之间某个环节出现了问题。而 PAIX 路由到 Hurricane Electric 之间似乎良好。
目前已经几位工作人员在进行调查。
主项目服务器在这个问题解决之前会无法在线。

姫海棠 果 · 发表于 2010-7-16 22:54:42

2010 年 7 月 16 日

由于网络问题，项目下线

我们在两个路由器之间的某个环节失去了连接。其中一个路由是位于空间实验室，而另一个是在 Palo Alto 的 Peering And Internet eXchange (PAIX)。
这个问题清除之前我们无法分配新的任务。

		自动登录	找回密码
密码			新注册用户

merlinl merlinl 当前离线积分 38355 UID 13474 在线时间小时最后登录 1970-1-1 头像被屏蔽	发表于 2010-7-4 11:20:44 \| 显示全部楼层提示: 作者被禁止或删除内容自动屏蔽
	Folding@home各类教程及绿色程序和简介等资料工具大全
	回复使用道具举报

[BOINC] [天文类] SETI@home

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分