|
发表于 2008-6-14 11:39:59
|
显示全部楼层
补全一下,太长就不翻译了。
Nov 30, 2007
DK, our primary site developer, is recovering from surgery and is out of contact. When the troubles arose DK's status wasn't known to the IT group, resulting in confusion. Rhiju stepped in and provided the fix. We apologize for the outage...
David Kim术后恢复中?
Nov 29, 2007
The feeder is failing over and over again thus no new work is being sent out. The feeder application needs to be rebuilt to handle the large number of applications that R@H has had. We know of the problems and will get this done as soon as possible.
任务包生成服务程序持续运行失败导致发放不了新任务包。
Sep 18, 2007
The RAID is rebuilt and now we're chasing down the next bottleneck....
RAID正在重建。
Sep 17, 2007
We're experiencing another hardware problem that is effecting the overall project, causing the validator and assimilators to fall behind. The temporary fileserver to which we've switched has dropped on of its RAID disks. We've found a replacement and the RAID is rebuilding now. Once finished the project should start catching up w/ itself...
又一次硬件故障,RAID丢盘。
Sep 08, 2007
Rosetta@home has experienced a horrendous hardware/fireware failure. We essentially lost the SAN partition upon which the project was running! The newest edition of our SAN hardware was shipped with a firmware revision that contained an insidious bug - one which caused the new SAN disks to vanish after roughly 45 days of service. We - or rather I (KEL) - apologize for the inconveinence, lost time and lost effort that you have endured during our outage. We know full well that your contribution hinges on the understanding that we make maximum use of your valuable resources - that we not waste your time, CPU cylces or good humor. We are planning to express our disappointment to our vendors in clear terms, specifically siting the importance of this project to our research effort. We'll keep you abreast of the outcome.
Aug 30, 2007
There was an unannounced service outage this morning at 8AM PST. The outage was planned - I just forgot to announce it to the R@H community. I (KEL) apologize. We had to upgrade the firmware on our SAN which required the hardware being offline for an hour. We don't anticipate any further issues along this line, but we'll work harder to ensure prior notification to our R@H partners - you.
July 31, 2007
We moved the backend of the BOINC system from our older fileserver (a dual 2.8 GHz Xeon w/ 2 GB RAM, running a 64-bit kernel and serving up 655 GB RAID5 disk (6 X 146 GB 10K SCSI)) to our fiber SAN. The SAN is a clustered filserver, running Polyserve's Matrix Server and serving up 16 450 GB SATA disks as a single RAID5 LUN via two Dell 2950s from within our group's fiber SAN. I'll post more information on and pictures of the SAN setup in the near future....
Apr 10, 2007
The UW campus experienced a large power service failure this morning around 9:15 PST that lasted for roughly 45 minutes. 14 buildings in the South Campus area were effected, including the datacenter in which Rosetta@home is hosted. The battery backup for this facility provides only 10 minutes or so of power to the datacenter floor; there is no backup generator at this site. As a result lost all power to all RAH hardware and had to bring everything back from a cold start. We are reviewing the situation.
Jan 11, 2007
Regarding the recent increasing graphic related errors, we suspect it is a problem of thread synchronization. Basically Rosetta working thread does the simulation which changes all the atom coordinates ( which are saved in shared memory) while the graphic thread tries to read data from that place to draw the graphic or screensaver. Currently there is no locking mechanism to ensure the shared memory is accessed by one thread at a time and this could generate some conflicts or memory corruption and then trigger an error. On one of our local computers, when screensaver or graphic is turned on, it caught errors at a rate of at least one per day on average and without any graphics, it ran flawlessly. The errors which have been observed include crashing(0xc0000005), hung-up (0x40010004) and being stuck( watchdog ending). All the errors were not reproducable with same random number seeds and we think that is due to the radomness in graphic process. Another side proof was that showing sidechains requires accessing shared memory more often and intensively, and after turning off sidechains and rotating, the graphic error rates drop but the problem is not solved completely. There seems to be an correlation between two. We are currently working on adding the new locking mechanism and we will post an update when it is ready to be tested out. Meanwhile, if you have experienced freqeuent client-errors, please temporarily disable the boinc_graphics/screensaver to reduce the problem. Thanks for everybody's support.
Oct 11, 2006
A batch of work units, the first protein-protein docking Rosetta@home experiments, were sent out yesterday that, unfortunately, were destined to fail. The actual work units were completely fine as they were tested on Ralph thoroughly but a rather unfortunate circumstance caused them to fail on clients. The work units had input files whose names ended with "map", and our downloads web server was configured to handle such files as image map files (the default apache setting) so the clients were not getting the actual content of the files and thus the jobs were failing with download errors. The web server's configuration has been updated so the errors should no longer occur, however, the batch was cancelled, so if you have any work units queued on your client whose names start with "DOC_" AND contain the text "pert_bench_1263_" please abort them. For example, "DOC_4HTC_U_pert_bench_1263_1000_0". We are very sorry this happened. It's unfortunate since there has been some recent exciting developments which include an updated application and screensaver, and docking. |
评分
-
查看全部评分
|