[BOINC] [生命科学类及其他] World Community Grid

vmzy · 发表于 2012-10-3 10:10:44

Oct 2, 2012 1:46:48 PM
Beta Test for Help Conquer Cancer - GPU v6.56
The beta test 6.55 ran well, but upon discussions with Berkeley there is a slightly different implementation of the fix for the hung workunits. This test is to check that version of the fix.

There were a couple of reports of hung workunits in the 6.55 test so we are hoping that this improved fix will avoid even those.
大意：
HCC-GPU 6.56开测
6.55版测试还算顺利，不过仍有少量用户遇到任务挂起。我们和Berkeley讨论了下，对程序做了修改，继续测试新程序。

vmzy · 发表于 2012-10-11 09:26:30

本帖最后由 vmzy 于 2012-10-11 17:02 编辑

Oct 10, 2012 3:57:24 PM
Server Maintenance - Friday Oct 12 starting at 15:00 UTC
We will be updating our servers to complete the final bit of work to firmly put our filesystem issues from earlier this year behind us (see https://secure.worldcommunitygri ... thread_thread,33481). This maintenance work will consist of changing how our servers are connected to the storage devices in order to eliminate one SAN switch that was a bottleneck.

Due to the redundancy in the environment, we will be able to complete the majority of the work without impacting volunteer access to the environment. However, when the database servers are being worked on we will need to stop the website database and BOINC database for a period of up to 60 minutes.

We will be starting the work at 15:00 UTC on Friday Oct 12. The work will be done over a 12 hour period. We cannot provide precise times when the database servers will be stopped as it depends on how quickly we are able to complete the work. We will post here when the work is completed.

Thank you for your patience and for your contribution.
大意：
计划于10月12日23时开始服务器例行维护，预计进行12小时。届时网站和BOINC的数据库将关闭1小时。

Oct 10, 2012 8:43:59 PM
Launch of Graphics Processing Unit (GPU) capability for the Help Conquer Cancer Research Project
World Community Grid is pleased to announce the launch of Graphics Processing Unit (GPU) processing capability for the Help Conquer Cancer project. This will allow members to contribute to this project using their computer's graphics card. Please click here for more information about this exciting announcement.

Thank you to all of our members for your continued support!
大意：
正式发布HCC的GPU任务。

vmzy · 发表于 2012-10-17 09:25:07

Oct 16, 2012 6:16:38 PM
Periodic Out of Work Messages/BOINC Outage Wed Oct 17 starting at 17:00 UTC
Volunteers contributing to the Help Conquer Cancer project are periodically getting messages that no jobs are available to send.

The cause of these periodic messages is a slow response in a database query that fills the buffer that stores the jobs that are examined when a client requests new work.

It turns out the database index used by that query is not optimal and it is resulting in the slow response. We will need to alter that index to make it include the required keys. However, it will take up to 3 hours for that index to be altered. During that time access to the table will need to be stopped.

We are going to make this change at 17:00 UTC on Wednesday October 17th. During this time, volunteers will not be able to report completed jobs or obtain new work.

We apologize for the inconvenience.
大意：
HCC老是报告没任务，经查是数据库表的索引出了问题。计划于北京时间18日凌晨1点起，对数据库进行维护，届时项目将暂停约3小时。

vmzy · 发表于 2012-12-4 20:20:08

Dec 3, 2012 12:18:18 PM
FightAIDS@Home project webcast: replay now available!
If you missed the live FightAIDS@Home webcast with Dr. Alex Perryman on November 29, 2012, you may now view the replay online here.

The webcast provided the latest updates about the FightAIDS@Home project, and showed how the simulations that you help run are enabling researchers to identify allosteric compounds and ultimately produce more effective anti-HIV medications.

Thank you to the members who joined the webcast, and to those who participated in the Q&A session!
大意：
错过FAAH直播的可以去看下录像。

vmzy · 发表于 2012-12-5 09:19:53

Dec 4, 2012 8:36:23 PM
HCC1 Beta Test for Linux GPU
We are releasing a batch of work units for beta test on Linux GPU only. This will include 64 bit ATI and NVIDIA as well as 32 bit ATI and NVIDIA.

There will be 5 batches released with a total of 3840 work units sent with a quorum of 2. These will have double images in them similar to how they are being run on production right now.

Thanks,
-Uplinger
大意：
HCC Linux GPU版开始测试。

vmzy · 发表于 2012-12-6 09:23:54

Dec 5, 2012 3:35:09 PM
HCC1 Beta Test for Mac Intel
Note: This is not a GPU test but only CPU. This is to fix some of the issues members are noticing with older Mac OS. We will be using the smae work units that the linux BETA GPU are using.

Thanks,
-Uplinger
大意：
HCC1 Mac Intel CPU版开始测试。主要修正在一些老的mac os中出现的问题。

qscgy44 · 发表于 2012-12-7 22:27:25

很有意义的活动支持!

vmzy · 发表于 2012-12-14 09:16:24

Dec 13, 2012 7:54:00 PM
Reduced HFCC project priority
We have reduced the feeder priority of the Help Fight Childhood Cancer project as we are nearing the end of target-10. This means that users with multiple projects selected are less likely to get HFCC work units, however users with only HFCC selected should continue to get work units without interruption. The next target is estimated to be available in April (this is only an estimate) and by reducing the feeder priority, users with only HFCC selected will continue to get work units for a longer period of time.

Seippel
大意：
HFCC 的target-10快算完了，所以我们降低了该项目的发包优先级。下一个target预计在明年4月放出。

vmzy · 发表于 2012-12-29 20:03:46

Dec 28, 2012 2:22:34 PM
C4CW Target 07 Beta Test
Greetings,

We will be starting a small beta test for target07 for Computing for Clean Water. This will just be a test on the work units themselves and not the science application. Please let us know in the issues thread if you notice anything strange with these new work units.

There will be 5000 work units sent out in total. I will do it in two stages, first 1000 to make sure there aren't any major issues that appear quickly, then a few hours later release the remaining 4000 work units.

Thanks,
-Uplinger
大意：
C4CW Target 07开测，先放1k个任务，确认无大问题后，再放剩下4k个。

vmzy · 发表于 2013-1-10 09:30:23

本帖最后由 vmzy 于 2013-1-16 11:02 编辑

Jan 9, 2013 6:37:36 PM
Degraded database performance
We have been experiencing poor database performance. We were investigating how to improve it and we are running the MySQL optimize command on two tables. Unfortunately, this is proceeding much slower than we expected so the BOINC grid is stopped for a couple of hours. We will inform you here when things are started again.

Jan 9, 2013 8:45:26 PM
We have started things up again. Unfortunately we didn't get as much performance increase as we hoped. We have ordered additional memory for the servers which we expect to get installed in the next 2-3 weeks.
大意：
数据库性能下降，我们对2个表进行了优化，不过貌似没有多大效果。为此我们买了新的内存，预计2-3周后安装到位。

Jan 13, 2013 5:19:48 PM
We are currently running backups of the database starting at 7:30 UTC on Sunday and Wednesday mornings. In order to allow them to finish relatively quickly while we wait for the additional RAM, we are stopping the backend processes (i.e. validation) while the backup runs.

This means that a large backlog of work needing validation will build up during that timeframe. It will resume around 15:00 UTC and take 12-18 hours to catch up.
大意：
周日和周三下午3点半将进行数据库备份（用时12-18小时），为新内存安装做准备，届时，后台校验程序将暂停，统计数据会产生较大延迟。

Jan 14, 2013 7:22:21 PM
We have to modify one of the indexes on the 'result' table. As a result we will have to stop the schedulers for about 75 minutes.
修改结果表的索引，暂停发包75分钟。

Jan 14, 2013 10:05:33 PM
Work is flowing again and we have started all backed process up EXCEPT for hcc1 validation. We want to let everything finishing catching up before we start hcc1 validation again. We appreciate your patience.
除了hcc1的校验程序（稍后启动），其他服务全部启动完毕。

Jan 15, 2013 9:36:31 PM
Ok - I'm overdue for an update here, but I finally have some good news.

All processes are running again and we have put the system under max load and everything is behaving well. Here are some of the things that we had to change:

1) The bufferpool we have for MySQL is 44GB. However, we had innodb_buffer_pool_instances set to 1. This resulted in heavy contention for locks. MySQL has threads use spin locks for a short period of time while waiting for a lock. This caused very high cpu use on the server which resulted in a degradation in performance for all database activity. We are now using innodb_buffer_pool_instances=8 which has significantly reduced contention for locks which has lowered the cpu use. This is allowing better performance for all transactions.

2) We lowered the transaction isolation level for some backend daemons. There was contention in particular on some of the indexes for the result table (it is a 96GB table on disk). By changing the isolation level, fewer locks are held on the indexes which has increased throughput.

3) We have change the way the jobs that load work compute how much work needs to be loaded. In particular, there was one query that computes the average jobs completed per day over the past 4 days. This value is part of the calculation used to determine how much work needs to be loaded. This query was executed repeatedly when work was being loaded and is particularly resource intensive. We have replaced the dynamically computed value with a manually computed value for now. Sometime in the future we will replace this with a job that runs once per day.

4) In a few cases we have improved the SQL used to make the queries more efficient.

5) We have added additional instances of daemons for the very backend processes in order to ensure that data is deleted from the database and filesystems promptly so that we don't fall behind again.

At this time the backend processes are catching up quickly. There is only a backlog of 98,000 workunits for hcc1 validation. It has been catching up at a rate of 20,000 workunits/hour so we will almost be caught up by end of day.

Jan 15, 2013 10:14:00 PM
I should mention that we will still be stopping the backend daemons while the database backups are run. We will wait until the new memory is installed before we change this.
大意：
好消息，服务器恢复正常了。不过等内存到位后，我们要对数据库服务器做如下改动。
1、MySQL的缓冲池是44G。不过我们只建了一个实例，这导致高负载下，会经常锁表，导致cpu占用高，性能下降。为此我们将把缓冲池改为8个。以大幅提高连接事务处理能力，并降低cpu占用。
2、降低事务隔离级别，以减少事务锁表（主要是在96G大的结果表索引上）的产生，提高事务处理能力。
3、服务器每次往发包服务器上传任务时，都要根据最近4天的处理能力，动态计算上传任务量，这会消耗大量的资源。我们打算暂时把上传量手工设为常量。将来会修改脚本改为每天动态计算一次。
4、修改部分SQL语句，提高效率。
5、增加服务进程数，以确保数据库数据删除的同时，能尽快删除文件系统的对应文件。以避免两者出现不同步，导致文件系统死锁。

vmzy · 发表于 2013-1-18 09:42:27

本帖最后由 vmzy 于 2013-1-24 10:12 编辑

Jan 17, 2013 11:06:32 PM
Say No to Schistosoma project on hold
Due to an outage on the research server, we have temporarily disabled sending new work for this project.

Jan 17, 2013 11:56:38 PM
An update/correction to the above:
New work which is already loaded will continue to be sent out over the next few days at a reduced pace until the supply is exhausted. After that point, only resend work units will be sent out.
大意：
SN2S项目服务器宕机，项目方无法提供新任务，我们下调了发包优先级。

Jan 23, 2013 10:55:34 PM
The outage on the research server is over and this project is now back to normal status.
项目方服务器恢复，现恢复发包优先级。

vmzy · 发表于 2013-1-18 22:14:35

Jan 18, 2013 2:43:13 AM
Re: Degraded database performance
The good news is that over the past few days we have seen the back end server daemons catch up and work has been flowing steadily to the users.

The bad news is that a number of our backend processes that that read large numbers of records for reports or for large batch updates of information have been performing worse and worse. We have identified the continuing cause of this issue.

I've included some articles below that help explain the issue.

Briefly:

MySQL maintains an 'undo log' that allows multiple current transactions to proceed with different isolation levels. Adding records to this undo log is done as part of ongoing transactions. However, cleaning up the 'undo log' is performed on a delayed basis when the server has time and is able to perform the clean up (known as purge) . If the server is under heavy load, then this delay can be very long and the 'history list length' can be come very large.

By default, the purge process is performed as part of MySQL's 'master thread'. This means that the purge process competes with other activities for time to perform purge operations.

While the undo log is small and the log can remain in memory, the purge process can operate quickly. However, if the process starts to fall behind, the log can go large and eventually it has to be moved to disk. Once this happens the process takes considerably longer and it is likely to fall further behind.

Due to the nature of the undo log, its size directly impacts the performance of different types of queries. As it gets longer, the performance of these queries becomes slower and slower.

At this time, our undo log contains over 51 million entries. It is significantly behind and it is falling further behind daily. The database averages about 3,500 transactions per second. The undo log only stores entries that modify the database which are about half our transactions. This means that the undo log is over 8 hours behind and falling further behind.

Yesterday and this morning we made a number of changes to the database in order to improve the performance of the purge process (including using the new MySQL 5.5 option to run the purge process in its own thread). Unfortunately, with the current size of the undo log and it not residing entirely within memory, these additional options are not able to allow the process to catch up during normal operations.

As a result, we are going to have to take unusual action in order to clear out the undo log. Specifically, we are going to have to stop all access to the database, take a complete backup, delete the existing database, and then restore the database from backup. This is the process that we used when we migrated from MySQL 5.1 to MySQL 5.5. Unfortunately, we estimate that with the current database size, this outage could take up to 24 hours. I will be posting details about that outage shortly.

This will restore the database to its normal behavior. We have high confidence that the changes that we have put in place this week and last week will allow the purge process to keep up with the database transactions. However, as we plan to continually recruit additional volunteers to help us grow bigger, we are examining more substantive changes to accommodate this growth. These changes range from repacking workunits with various sizes in order to reduce the total number of results per day, migrating to MySQL 5.6 when it is released later this year, migrating to Percona Server, as well as adding a replica database for the purpose of performing read only transactions against it (reports, backups, result status page on website, etc). Any one of these options will ensure that we do not face this issue again in the future. We need to investigate further against our long term growth plans to make the proper changes.

We appreciate the patience you have shown while we investigate and we look forward to returning to normal operations soon.

http://www.pythian.com/news/3257 ... mysql-history-list/
https://mysqlquicksand.wordpress ... ade-blues-part-one/
http://www.mysqlperformanceblog. ... -innodb-tablespace/
http://dev.mysql.com/doc/refman/5.5/en/glossary.html#glos_purge
大意：
坏消息，由于数据库操作过于频繁，导致数据库经常锁死。
我们打算重建数据库，看能否改善。届时将停机24小时左右。
将来，等年底MySQL 5.6发布我们会尽快升级，并且将数据库迁移至Percona系统。并将把只读的操作迁移到镜像数据库，以减少写表时死锁的产生。

vmzy · 发表于 2013-1-23 14:31:14

Jan 23, 2013 1:31:30 AM
Re: Degraded database performance
Ok - we have now handle the very significant surge in scheduler requests following the outage. The system handle the load, but it was definitely under pressure.

Now that we have been up and running for a few hours, the system is running very well. The end of day stats and other processes are running now and they are running quickly. Additionally, some of our backend processes such as reports that we use monitoring batch progression are working properly again and quickly.

We are going to be allowing the system to continue to run the backend server daemons while the database backup runs in a few hours. If all goes as plan, there will not be significant interference.

The next item on our agenda is the installation of the additional memory which will occur about 36 hours from now.
大意：
数据库重建完毕，数据库性能基本上恢复正常了。接下来，将在36小时内完成数据库服务器内存添加工作。

vmzy · 发表于 2013-1-26 22:07:39

Jan 25, 2013 11:37:41 PM
FightAIDS@Home Beta Test Jan 25, 2013
Greetings,

We are starting a beta test on FightAIDS@Home to test an updated version of the science application. This test will be done on the same CPU platforms as production. The first test will have 3710 work units to be sent out.

Also, I will not have the validators turned on for this beta test and will probably leave them off for 24-48 hours. Members will notice pending validation results.

Due to some of the changes in the application members may see a higher than usual results that are not valid. Please do not be alarmed as this is the information I am going to collect when I turn the validators back on. There may be a few kinks in these work units, but that is part of testing.

Thanks,
-Uplinger
大意：
测试新版FAAH计算程序，大约3710个任务，这些任务校验失败率会比较高，大家不必紧张。

vmzy · 发表于 2013-1-27 21:25:15

本帖最后由 vmzy 于 2013-1-29 23:04 编辑

Jan 26, 2013 3:58:10 PM
Help Conquer Cancer Mac GPU Beta Test Jan 26, 2013 [ON HOLD]
Greetings,

We will be conducting a Beta test for Mac GPU only. The version you will download will be shown as 7.10. Also, this test will involve 3840 work units.

Thanks,
-Uplinger

Updated to show as On Hold -U

Jan 26, 2013 6:59:31 PM
I have put the beta test on hold as I am investigating why machines that should be able to run the beta are not getting work. I will resume the beta test when this is fixed. Right now, there is no ETA on the fix.

Thanks,
-Uplinger
大意：
HHC-GPU开始测试MAC平台7.10版计算程序。大约3840个任务。
貌似有些测试机接不到任务，测试暂停，待查明原因后恢复测试。

Jan 28, 2013 6:26:29 PM
This beta test has been resumed.

Thanks,
-Uplinger
恢复测试

		自动登录	找回密码
密码			新注册用户

[BOINC] [生命科学类及其他] World Community Grid

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分