楼主: vmzy

[项目新闻] United Devices [已结束]

 楼主| 发表于 2006-2-14 16:27:57 | 显示全部楼层
Feb 13, 2006
We are currently experiencing an issue with agents not being able to connect to the UD servers. This is due to the huge amounts of results we have received for Ligandfit (as well as Rosetta). Since we have been looking into an issue with aborted WUs on the Ligandfit job, I have not been running the result rollup script. This is the process that consolidates the results and then allows us to delete all of the temporary files.

Since we have valid results for all of the current Cancer WUs, I will go ahead and run the rollup script. As soon as that finishes processing, we will be able to delete the temporary result files and dispatching will resume. I will upload a new chunk of cancer data later today.



[ Last edited by vmzy on 2006-2-28 at 16:04 ]


参与人数 1基本分 +15 维基拼图 +7 收起 理由
霊烏路 空 + 15 + 7



使用道具 举报

 楼主| 发表于 2006-2-21 14:51:51 | 显示全部楼层
Feb 19, 2006
I have uploaded another batch of cancer data. The last job will remain active for a few days to give members time to receive credit for any outstanding workunits. Since these new workunits are processing so quickly, I do not think we need to wait a whole week.


[ Last edited by vmzy on 2006-2-28 at 16:09 ]


参与人数 1基本分 +10 维基拼图 +5 收起 理由
霊烏路 空 + 10 + 5



使用道具 举报

 楼主| 发表于 2006-2-21 14:52:40 | 显示全部楼层
Feb 20, 2006
Cancer Data - We continue to have an issue with occassional aborts that is still under investigation. As I mentioned in the Member to Member Support thread, this issue has been identified as a problem with the Ligandfit application itself crashing. This has nothing to do with the UD servers dropping results or not giving due credit. The Ligandfit application has not changed in a very long time, prior to the last batch of cancer data that received no complaints. This would point to something with the new data that is causing Ligandfit to be unstable.

Note that we are receiving successful results from each and every workunit so there are not "bad workunits" that will always fail. Some members have stated that retrying a workunit that just aborted will result in a successful completion.

Although a bit frustrating because of lost points, know that we are getting successful results that can be returned to Oxford to assist in the search for the cure for cancer which is the ultimate objective of this project.

癌症数据 - 我们仍在调查一个异常退出问题。如同我在Member to Member Support帖子中所述,这个问题是Ligandfit应用程序自身的一个崩溃问题。这与UD服务器结果丢失或不给予积分无关。Ligandfit应用程序已经很久未更新了,但老癌症数据却没有问题。这意味着新数据导致Ligandfit不稳定。



[ Last edited by vmzy on 2006-2-28 at 17:00 ]


参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15



使用道具 举报

 楼主| 发表于 2006-2-28 15:09:26 | 显示全部楼层
Feb 27, 2006
Outage: There was a temporary outage over the weekend that caused connectivity problems with Grid.org servers. This was caused by a router failure in our datacenter and was unanticipated. The problem has since been rectified and everything should be functioning normally.

Cancer data: I have uploaded a new chunk of data and have sent the last result set to Oxford for analysis. Some members have noticed that this batch produces an awful lot of hits. I am asking if this is to be expected and if this could in any way be related to the occassional workunit aborts that some members are seeing.



[ Last edited by vmzy on 2006-3-1 at 13:41 ]


参与人数 1基本分 +16 维基拼图 +8 收起 理由
霊烏路 空 + 16 + 8



使用道具 举报

 楼主| 发表于 2006-3-7 22:08:57 | 显示全部楼层
Mar 06, 2006
Things are relatively quiet although we still have a few outstanding issues.

Cancer data - I have not heard back from our Oxford contact regarding the last batch of results I sent. This was an attempt to have them validate the many hits we are seeing on some of these workunits. It seems suspicious to me, but I am not well versed in computational chemistry and cannot speak to this. It will be up to Oxford, who provided the data, to determine if the results are valid for their research. Until then we will just keep crunching away at the new data.

There is still the issue of Ligandfit crashing. Although I do not know why, it appears that there is something about the new data that causes Ligandfit to get in a bad state. I have also mentioned this to our Oxford contact and hope that he may be able to shed some light on this issue. Perhaps the many hits or complexity of the new data is the contributing factor.


癌症数据 - 牛津联络处仍未告诉我有关我发去的上批结果的消息。主要是想让他们确认一下这些任务的高hit率是否正常。我想可能有问题,但我不熟悉计算化学,没有发言权。只能由数据提供方牛津来决定结果是否是他们想要的。到那时我们将继续计算新数据。


[ Last edited by vmzy on 2006-3-8 at 14:54 ]


参与人数 1基本分 +18 维基拼图 +9 收起 理由
霊烏路 空 + 18 + 9



使用道具 举报

 楼主| 发表于 2006-3-9 16:09:09 | 显示全部楼层
Mar 08, 2006
I have received a response from our Oxford contact regarding some of the issues we have seen with the new cancer data.

High number of hits - It was stated that a determination could not be made at this stage as to whether the data was generating too many hits or not. After we process all of the data and send it back, they will apply a cutoff criteria in order to isolate the best hits. It may be that once the data is analyzed, they will suggest a change to the Ligandfit input parameters and a rerun of the data. Remember, research implies trial and error.

Aborts - While nothing specific was identified yet, they stated that is was possible for the data to cause Ligandfit to error out. It was mentioned that some of the datas are not discrete organic molecules which may be different that data that was processed in the past. It may be that under certain circumstances this data causes Ligandfit to abort, although we have seen that a replay of the data usually produces a good result so we are not talking about a bad WU per se. They also stated that in the future they will filter out anything that they think may cause an issue. Bottom line is this issue remains elusive and we do not have the answer yet.


高hit率 - 现在还不能确定数据的hit率是否过高。在我们计算完所有数据并送回结果后,他们将重新设置标准以确定最佳的命中率。也许一旦数据被分析完后,他们将建议Ligandfit修改输入参数,并重算数据。记住,研究意味着反复尝试和错误。

任务异常中止 - 现在还无法确认具体原因,他们说很可能是数据导致Ligandfit出错。它说,一些数据和以前计算的离散有机分子不同。也许在某些特定情况下这些数据会导致Ligandfit退出,我们知道数据的重算通常会产生一个好结果,因此我们说数据本身并没有好坏之分。他们还说,在将来他们将去除任何他们认为可能会导致出错的东西。根本问题是错误的根源仍未找到,我们没法解决它。

[ Last edited by vmzy on 2006-3-9 at 16:51 ]


参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15



使用道具 举报

 楼主| 发表于 2006-3-10 12:23:36 | 显示全部楼层
Mar 10, 2006

We have received too many results again for the servers to hold. I am in the process of clearing some space now. Please be patient while I get the situation remedied. The root problem is that these workunits are processing much more quickly than the previous batch. This is causing many more results in a much shorter period of time. This equates to disk space filling up much more quickly.

It is obvious that our process needs to be changed to limit the time a job runs. I thought two weeks would be sufficient, but I think now I need to turn these around every week.

I am currently uploading a new job. It has already started dispatching and I have verified a new workunit. Since we got a little backed up, it may take a while for everyone to connect and get a new workunit. I will post some info tomorrow on our new (shorter) process for turning these jobs around.





[ Last edited by vmzy on 2006-3-10 at 13:06 ]


参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15



使用道具 举报

 楼主| 发表于 2006-3-24 21:23:05 | 显示全部楼层
Mar 13, 2006
Cancer job - As you probably noticed, we had a small outage last week due to the amount of results being returned for the new cancer jobs. These WUs are running within a few hours each probably due to the (non) complexity of the new protein. This translates to lots of disk space being used up very quickly. Due to this, I will need to roll up the results of these jobs much faster than pervious ones in order to keep below maximum disk space. Here is what I propose:

On Wednesdays I will submit a new cancer job. Once it has been activated, I will run the roll up script which should only affect the previous job. The roll up script will mark the previous job inactive as part of its processing. Once the roll up script has finished, I will reset the job to be active. This will allow outstanding results to be credited.

On Fridays, I will delete the older job that has already had the results rolled up so any outstanding workunits will not be credited. You would have had 2 days to return your results which should be plenty considering the WUs complete within hours.

There may be a small window while the results are being rolled up to where newly returned results will be rejected. This would be due to the job being temporarily marked as inactive. Unfortunately, I see no way around this. The impact should be very minimal. If you are that concerned about a lost workunit, shutdown your agent on wednesdays until I post that the job has been reactivated.

癌症任务 - 你或许注意到了,由于新癌症任务的返回结果太多,我们上星期小停了一下机。或许新蛋白质不够复杂,这些任务的计算速度很快。导致磁盘空间耗用很快。因此,我将会需要回收这些小计算量任务的结果,以便尽量保持多的磁盘空间。我的计划如下:




[ Last edited by vmzy on 2006-4-17 at 11:20 ]


参与人数 1基本分 +40 维基拼图 +20 收起 理由
霊烏路 空 + 40 + 20



使用道具 举报

 楼主| 发表于 2006-3-24 21:23:29 | 显示全部楼层
Mar 15, 2006
As per our new process described on Monday, I have uploaded a new cancer job and rolled up results from the previous one. On Friday I will deactivate the previous job afterwhich no more credit will be given.

Hopefully this process will keep results at a minimum to conserve disk space, but still allow everyone to participate and receive credit for their work.



[ Last edited by vmzy on 2006-4-17 at 11:25 ]


参与人数 1基本分 +15 维基拼图 +7 收起 理由
霊烏路 空 + 15 + 7



使用道具 举报

 楼主| 发表于 2006-3-24 21:23:54 | 显示全部楼层
Mar 20, 2006
Cancer data - Things are running relatively smoothly right now. I hope the new cancer job process is working for everyone. As soon as we crunch through all of the new data against the current new protein, Oxford would like us to crunch the new data against the last older protein as well. It will be interesting to see if the number of aborts we are seeing goes down or remains the same.

Please note that for UDMon users the ud_mon.ini file must be updated with the new protein in the Proteins section. If this is not done, you will see many aborts for WU 8581771 in the log, which is a protein not a workunit. Please remind newer members of this if you see them post about the aborts. We do have a random abort problem, but those have nothing to do with WU 8581771. The line should look like this:


癌症数据 - 一切正常。我希望新的癌症任务,每个人都能正常使用。一旦我们把当前新蛋白质的所有新数据算完,牛津会把以前的蛋白质用新数据再算一下。看看程序异常退出问题是否仍会发生。

UDMon用户请注意更新ud_mon.ini文件,请在Proteins区段中加入新的蛋白质信息。如果你没有这样做,你将会在日志中看到,许多WU 8581771任务将会被屏蔽。如果你见到新用户发与此相关的帖子,请提醒一下他们。我们确实有任务异常退出的问题,但是与WU 8581771无关。更新内容如下:

[ Last edited by vmzy on 2006-4-17 at 10:51 ]


参与人数 1基本分 +30 维基拼图 +15 收起 理由
霊烏路 空 + 30 + 15



使用道具 举报

 楼主| 发表于 2006-3-24 21:24:12 | 显示全部楼层
Mar 22, 2006
I have rolled up the results from the current Cancer job and submitted a new one. Credit for the previous job will be given until Friday morning. If you are running UD Mon and are concerned about getting credit for all workunits, this would be a good time to dump/refresh your cache with the new ones.
我已经回收了当前的癌症任务的结果,而且发放了新的任务。前一个任务的积分,将会在星期五早晨发放。如果你正在运行UD Mon,而且关心应得的任务积分,请抓紧时间刷新您的任务缓存。

[ Last edited by vmzy on 2006-4-17 at 10:40 ]


参与人数 1基本分 +12 维基拼图 +6 收起 理由
霊烏路 空 + 12 + 6



使用道具 举报

 楼主| 发表于 2006-3-25 22:57:45 | 显示全部楼层
Mar 24, 2006
There have been a few suggestions to extend the time that the previous cancer job is deleted in order to give slow machines a chance to finish crunching. We want to accomodate all members regardless of the speed of their machines so I am going to change the day to Monday. New jobs will be submitted on Wednesdays and old jobs will be deleted on Mondays.

[ Last edited by vmzy on 2006-4-17 at 10:34 ]


参与人数 1基本分 +12 维基拼图 +6 收起 理由
霊烏路 空 + 12 + 6



使用道具 举报

 楼主| 发表于 2006-3-28 20:56:01 | 显示全部楼层
Mar 27, 2006
Grid.org status

Cancer job - There is no new information to post at this time other than to restate that each week the previous cancer job will be deleted on Mondays instead of Wednesdays. This will give members running on slower machines a chance to return their outstanding workunits for credit.

Cancer job deleted

The previous cancer job has been deleted. If you have any outstanding workunits that have not yet been processed, they will not be given credit and should be dumped.
癌症任务 - 没有新信息发布,除了再次声明,每周将在星期一而不再星期三删除早先的癌症任务。这将给机子慢的成员一个提交他们的任务的机会。

[ Last edited by vmzy on 2006-4-12 at 17:36 ]


参与人数 1基本分 +18 维基拼图 +9 收起 理由
霊烏路 空 + 18 + 9



使用道具 举报

 楼主| 发表于 2006-4-4 23:13:57 | 显示全部楼层
Apr 03, 2006
Cancer data - I have deleted the previous Cancer job so no more credit can be given for outstanding workunits. We continue to process a new job each week, which means things are going pretty well. The Ligandift aborts still continue and probably will through the entire new batch of data that Oxford provided. We will see what the behavior is when we run this data against the previous protein which will be the next task.
癌症数据 - 我删除了先前的癌症任务,因此不会再给前任务积分了。如果一切正常的话,我们继续每星期处理一个新任务。Ligandift的异常退出问题仍没有解决,大概牛津提供的所有新数据都会遇到此问题。我们将与以前的计算数据进行对比以确定究竟是什么行为导致了出错。

[ Last edited by vmzy on 2006-4-12 at 17:29 ]


参与人数 1基本分 +16 维基拼图 +8 收起 理由
霊烏路 空 + 16 + 8



使用道具 举报

 楼主| 发表于 2006-4-7 20:50:35 | 显示全部楼层
Apr 06, 2006
Backing off

We experienced a brief outage again due to the large amount of results. I thought that rolling up once a week was sufficient, but apparently we still ran out of space. The system has been recovered now and everyone should be able to connect now.

There may be some lost workunits due to the nature of this problem. I apologize for that. A new job has been submitted so if you are running UDMon, you should dump all of your cached workunits and reload. Credit will not be given for the previous outstanding workunits because the job had to be deleted as part of the recovery procedure.

I will be adding some additional disk monitoring tools to hopefully notify me before this issue can reoccur. Again, I thought that our weekly rollup process was sufficient. Please bear with me as we continue to tweak the process.



[ Last edited by vmzy on 2006-4-12 at 17:21 ]


参与人数 1基本分 +26 维基拼图 +13 收起 理由
霊烏路 空 + 26 + 13



使用道具 举报

您需要登录后才可以回帖 登录 | 新注册用户



Archiver|手机版|小黑屋|中国分布式计算总站 ( 沪ICP备05042587号 )

GMT+8, 2024-12-13 12:43

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表