opm-playboy 发表于 2007-5-7 16:44:39

可惜!到92%又出错啦!

运行3600小时之后,到92.58%又一个错误
2007-5-7 16:07:35|climateprediction.net|Aborting task hadcm3lbm_btu5_05321152_1: exceeded CPU time limit 12147899.159664
2007-5-7 16:07:35|climateprediction.net|Deferring communication 1 min 0 sec, because Unrecoverable error for result hadcm3lbm_btu5_05321152_1 (Maximum CPU time exceeded)
2007-5-7 16:07:40|climateprediction.net|Computation for task hadcm3lbm_btu5_05321152_1 finished

zglloo 发表于 2007-5-7 18:21:50

实在是可惜啊 !

tcogh327 发表于 2007-5-7 18:40:55

是不是当时可用内存小了,才出错?或者是已经超过上报时间了?
搞不懂,俺还从来没有完成过90%以上的进度 深切同情ING……

Kaoh 发表于 2007-5-7 20:47:45

原帖由 tcogh327 于 2007-5-7 18:40 发表 http://www.equn.com/forum/images/common/back.gif
或者是已经超过上报时间了?

不會吧...
它那邊明明就說可以不管期限的.....
如果真的如此,那我現在才不過5.6%,到明年1/7日鐵定算不完.......

opm-playboy 发表于 2007-5-7 21:16:32

那个WU的Report deadline 2 Jan 2008 9:50:40 UTC
出错时机器什么都没用,只开着BOINC,一个CPDN一个E@H,物理内存至少还有500多M,大概算到92%就死机了,重启后就报错了。
我怀疑是和E@H有关,算第一个S5R2C的WU就死机,重启后那个WU也报错。
现在只好重新再来吧,算完那几个E@H的WU就集中计算力到CPDN,我不信搞不掂你。

alexpon 发表于 2007-5-7 21:23:34

已代 Post CPDN forum 去問囉~~~
這問題大家該滿關注的

tcogh327 发表于 2007-5-7 22:26:26

原帖由 opm-playboy 于 2007-5-7 21:16 发表 http://www.equn.com/forum/images/common/back.gif
那个WU的Report deadline 2 Jan 2008 9:50:40 UTC
出错时机器什么都没用,只开着BOINC,一个CPDN一个E@H,物理内存至少还有500多M,大概算到92%就死机了,重启后就报错了。
我怀疑是和E@H有关,算第一个S5R2C的WU就死机,重启后那 ...
有志者,事竞成

这个对机器的要求太高了。

另外,我现在是铁定在规定时间内算不完一个WU了,算多少是多少吧,也算是做了贡献。

alexpon 发表于 2007-5-8 02:23:22

以下是回覆參考

Message 28512 - Posted 7 May 2007 14:36:03 UTC
Last modified: 7 May 2007 14:49:17 UTC
Hi Alexpon

A day or two ago I wrote a post about this error for inclusion in the cpdn READMEs. It isn't in the READMEs yet, but here's the draft. If you have a backup you can use this method to save your model. If you haven't got a backup, it's impossible and you'll need to get a new model.

上述意指若是無作備份動作~~認命吧 算新的~~~~~
--------------------------------------------------------------------------

When boinc runs its benchmarks, a work unit (climate model) has an estimated number of Floating Point Operations (fpops) assigned to it. Boinc assigns a maximum number (bound) of fpops that the model will be allowed to use. This is similar to imposing a time limit depending not on calendar dates but on how long the model is allowed to actively run on a particular computer. On a slow computer a model will be allowed to run for longer, but on a fast computer the limit (bound) is lower. Under normal conditions a model will never hit the limit (fpops bound). The limit is imposed to ensure that a looping model forgotten by its owner cannot run indefinitely. A model that hits the bound/limit will crash with a message like 'CPU time exceeded'.

When a model is transferred from a slow to a much faster computer, or a computer's CPU is changed for a faster one, boinc runs new benchmarks. It recalculates the fpops bound as if the model had spent its whole life on the faster computer, and does not take the move into account. The model may therefore hit its new reduced limit before it completes. The longer the model crunched on the slow computer and the more it has been speeded up, the more likely this problem becomes. A crash with this error is more likely if the model runs at least twice as fast on the new computer and the model is more than half completed when transferred. In the case of a model already ¾ completed and speeded up 3x or 4x, this error is very probable. But the fpops limit imposed includes a safety margin which allows many transferred models to complete without problems.

PRECAUTIONS

Before moving the contents of the boinc folder to the new computer, the complete contents of this folder should be backed up. If you forgot to do this, make a backup as soon after the move as possible. Here is a selection of backup and restore methods.

FIX

The fpops bound value (number) has to be changed in a boinc file. One can do this as soon as boinc has run its benchmarks on the faster computer, or run the model in the hope that the problem will not occur. If one chooses the second option, it is essential to make regular backups. If the model does crash with this error, the fpops value can be changed immediately after restoring a backup and letting boinc run benchmarks again.

This fix is not intended for workunits from other projects that are too short to be worth backing up and restoring if they crash. If this error occurs repeatedly with short workunits, it should be reported on that project's forum. The project's programmers can then correct the code for this faulty batch of workunits.

HOW TO FIX IT (surprisingly easy!)
大致是說明若是有備份要如何還原 按步驟作~~

*If your model has already crashed with this error, restore your backup.

*Start boinc.

*Open boinc manager and look at the Messages tab to check that boinc has run its new benchmarks.

*Write down the long name and number of the model. It will begin with something like 'hadsm3ln......'.

*Suspend the model if it is running, using the boinc manager Activity menu.

*Exit from boinc.

*In Windows Explorer, go to C\Program files\BOINC (or Climate Change Experiment).

*Double-click on BOINC.

*Find the client_state.xml file. Right-click on it.

*Select Edit.

*The Client_state file will open up in Notepad, which allows it to be edited.

*Find the <Workunit> block containing the name of the model.

*Double the value (number) for <rsc_fpops_bound>.

*Click Save.

*Close Notepad.

*Restart boinc and the model.

Please note that it is not necessary to edit the client_state_previous.xml file.


(Thanks also to Thyme Lawn)
____________
Cpdn news, announcements and the 4 project READMEs

alexpon 发表于 2007-5-8 02:33:48

http://bbc.cpdn.org/forum_thread.php?id=2748<---以下雲文內容都在此
上面頁面是 BBC 的備份工具說明
http://www.rakhal.com/cpdn/BBC-backup.exe
下載路逕如上~~熱心的網友寫的For Windows 系統使用無linux版本
For example, if your BOINC directory is C:\Program Files\Climate Change Experiment or ..\BOINC and you created C:\CCE-bu for your backups, go to Start | Run and you'd type:
==================================================================
BBC-backup "C:\Program Files\Climate Change Experiment" "C:\CCE-bu"
--------------------------------------------------------------------------------------
BBC-backup "C:\Program Files\BOINC" "C:\CPDN-bu"
==================================================================
Please note the two required spaces and the four double quote symbols( " ).

and press Enter.
上面大致說明
建個 XXXXX.bat    XXX名字自取
基本上我是用下列命令列模式參考
BBC-backup "C:\Program Files\BOINC" "C:\CCE-bu" -t 720
程式名             來源路逕                  備份地路逕
-t 參數說明 請看原文內容自己參考
剛剛才發現我的BOINC 目錄有1.XGB讚喔
執行後 會成為Process 定時間備份工作管理員 去看會有 BBC-backup <--程序運行

opm-playboy 发表于 2007-5-8 14:04:51

备份不是BOINC自己的事吗,怎么还要客户手动去做呀,我才不得闲。

alexpon 发表于 2007-5-8 14:48:04

隨意囉~~
有的人92% Fail 是心情一種無法言語的鬱悶 2XXX~3XXX小時的運算 畢竟就差臨門一腳破0 WU完成
雖然還是有積分,不過就是XXXX 去它XX的OOXX花開富貴(星爺的電影看太多 )
反正我也只是點兩下他就每12小時 停下 BOINC 備份
備份完畢又會帶起 BOINC~~~
參考囉
##########################
D:\tool\BOINC>BBC-backup "C:\program Files\BOINC" "D:\tool\boinc\CPDN-bu" -t 720

BBC-Backup 1.05

Shutting down BOINC
Performing backup
Restarting the experiment
Waiting for checkpoint. Checkpoints to go:1

Shutting down BOINC
Performing backup
Restarting the experiment
##########################
-t 720<--間隔 720分鐘執行動作

測試DOS Command mode訊息如上

[ 本帖最后由 alexpon 于 2007-5-8 14:49 编辑 ]

Kaoh 发表于 2007-5-8 17:06:39

哈哈,我是不習慣備份的
一來佔空間,二來...真的出錯也是命,就認了吧.

zglloo 发表于 2007-5-8 17:34:41

对了 问一下 你们 是不是 算不到100% 依然可以拿到分数!

tcogh327 发表于 2007-5-8 22:01:51

原帖由 zglloo 于 2007-5-8 17:34 发表 http://www.equn.com/forum/images/common/back.gif
对了 问一下 你们 是不是 算不到100% 依然可以拿到分数!
是的。

请参阅新手指南

opm-playboy 发表于 2007-5-15 22:03:19

终于发现死机的问题了,原来是丽台A6600GT显卡出毛病,把箱子底里放着的丽台MX400古懂装上,OK了。
页: [1] 2
查看完整版本: 可惜!到92%又出错啦!

论坛官方淘宝店开业啦~