[BOINC] [生命科学类及其他] World Community Grid

vmzy · 发表于 2012-7-20 09:36:26

Jul 19, 2012 6:20:07 PM
Re: Reduced Help Fight Childhood Cancer project priority
This project has been returned to it's normal priority.

Seippel
大意：
HFCC发包优先级恢复至正常。

vmzy · 发表于 2012-7-22 10:20:28

Jul 20, 2012 8:15:59 PM
Periodic Issues with File Uploads and Downloads

We are currently experience an intermittent issue that causes a significant slowdown on the filesystems that support file downloads and file uploads. Due to this file upload and downloads will be intermittently unavailable.

Jul 20, 2012 2:23:33 PM
Currently, when this issue occurs, it impacts scheduler requests, website/forum access and file upload/downloads. However, it is only file upload/downloads that are involved in the actual root cause of the issue.

As a result, we are going to change things so that website/forum traffic and scheduler requests go through one set of servers while file upload/download requests go through a separate set of servers.

This change will be largely transparent. However, we will be changing the scheduler URL to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi from https://grid.worldcommunitygrid.org/boinc/wcg_cgi/fcgi. In order to force the clients to change their setting, we will disable the scheduler at https://grid.worldcommunitygrid.org/boinc/wcg_cgi/fcgi. This means that your client will try 10 times to connect before querying the website again for the current location of the scheduler. It will then get the new location and connect properly. No action is needed on your part, but you will see messages in the software client.

Jul 20, 2012 8:11:28 PM
Scheduler request failed: HTTP file not found
Your BOINC client will now start seeing this message as we transition from having our schedulers located at https://grid.worldcommunitygrid.org/boinc/wcg_cgi/fcgi to having them located at https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi

Unfortunately, there is no way to do this 'gracefully'. Instead your client will attempt 10-15 times to connect to the old scheduler before contacting the main website page in order to find out the current location of the scheduler. Once that happens, it will resume communicating properly. This will all happen automatically if left alone.

If you wish to complete this process faster on your machine, then please go to the 'Advanced View' of the software, then go to the 'Projects' tab, Select 'World Community Grid' and click on the 'Update' button. If you do this repeatedly (10-15 times) it will eventually fetch the new location.

If you are more technically inclined, then you can stop the BOINC client, edit the client_state.xml and find the line that says:

<scheduler_url>https://grid.worldcommunitygrid.org/boinc/wcg_cgi/fcgi</scheduler_url>

and replace it with

<scheduler_url>https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi</scheduler_url>

Save the file and then start your the BOINC client again. You should now be able to connect successfully.
大意：
服务器文件系统出错，导致上传下载异常。为此我们更换了Scheduler服务器地址。新服务器地址可以通过如下方法更新。
1、什么都不做，wcg在失败10-15次后会自动访问主服务器，更新Scheduler服务器地址。
2、手工刷新wcg项目，在项目状态出现‘通讯被延迟’字样后，再次点击更新。如此反复10-15次后wcg会访问主服务器，更新Scheduler服务器地址。
3、直接编辑配置文件。退出BOINC客户端，在BOINCDATA文件夹下，找到client_state.xml，直接将wcg的scheduler_url修改为https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi。然后重新运行BOINC客户端。

vmzy · 发表于 2012-7-27 09:33:41

Jul 24, 2012 6:53:05 PM
Re: Unstable Server Environment
The issue with uploading/downloading is not yet resolved. The change that we made on July 20th was to ensure that our volunteers could reliably access the website and forums. We have also moved scheduler requests as well since they are independent of the filesystem issue as well.

We are working with the support groups for Red Hat Linux and IBM GPFS (the filesystem) to resolve this issue.

Users will unfortunately continue to see messages such as this until the issue is resolved:
24/07/2012 12:43:58 | World Community Grid | Started upload of c4cw_target06_088990280_0_0
24/07/2012 12:43:58 | World Community Grid | Started upload of c4cw_target06_088989962_0_0
24/07/2012 12:44:20 | World Community Grid | Temporarily failed upload of c4cw_target06_088990280_0_0: connect() failed
24/07/2012 12:44:20 | World Community Grid | Backing off 1 hr 29 min 3 sec on upload of c4cw_target06_088990280_0_0
24/07/2012 12:44:20 | World Community Grid | Temporarily failed upload of c4cw_target06_088989962_0_0: connect() failed
24/07/2012 12:44:20 | World Community Grid | Backing off 11 min 18 sec on upload of c4cw_target06_088989962_0_0
24/07/2012 12:44:21 | | Project communication failed: attempting access to reference site
24/07/2012 12:44:24 | | Internet access OK - project servers may be temporarily down.

Jul 26, 2012 9:04:34 PM
We continue to experience this issue. We have investigated a few items suggested by the support groups, but nothing that has resolved the issue or led us to understand the root cause.

In the meantime, we are implementing some workarounds that will allow us limit the duration of this issue when it occurs. Although this won't eliminate the issue, the goal is to minimize the impact on the end users. Part of the workarounds include disabling scheduler requests and file uploads when the issue occurs. As a result, there will be times when both file uploads/downloads and schedule requests are disabled.
大意：
尽管我们已经更换了新的scheduler服务器，但是上传/下载还是经常出问题（如：Internet access OK - project servers may be temporarily down.）。我们已经联系Red Hat Linux 和 IBM GPFS相关技术支持来解决问题。
技术支持给了些建议，不过目前问题依旧。现在我们只能人肉监控服务器状况，一旦出现问题就暂时关闭boinc服务器。所以近期wcg会经常宕机，请大家做好应急预案。

vmzy · 发表于 2012-8-26 12:00:11

Aug 24, 2012 8:58:56 PM
Re: Unstable Server Environment
It has been awhile since we posted an update on this issue.

We have continued to engage the resources of the GPFS support team and they have been digging through trace files created while one of the incidents was occurring. They have found two points of interest that point to how we will reach final resolution of this issue.

First a little background:

We use the IBM GPFS software so that we can mount SAN storage on multiple servers at once. This allows us to scale horizontally (meaning that as more volunteers participate and we grow, we can add more servers to keep up with the load). We have two major directory locations that we use. Your computers connect to the first one to download input files. The second is where your computers upload completed result files. Both the download and upload directories have 1024 sub-directories each in order to balance load and prevent some inefficiencies that occur with large directories.

What they found:

1) The GPFS filesystem is implemented in a way that when storage is allocated to record the list of files in a directory, that storage can never be reduced. We had a situation last year where we accidentally created a significant number of files in some of the download sub-directories. Due to the fact that the directory storage blocks are never shrunk in size, this meant that we had a lot of extra directory blocks in use. GPFS caches these directory blocks in memory in order to optimize performance. However in some cases, one directory was using up to 14 MB of storage for its directory bocks (and thus 14 MB of RAM). About half the download directory blocks had this issue. We did some testing to confirm and these sub-directories only actually require 1MB of storage. Since 512*14MB is a little over 7GB, the cache we have assigned (2GB) needs to evict data frequently to bring in different data blocks. There are two fixes for this:

a) Add more RAM
b) Recreate the directories

We have already created and tested a script to recreate the directories. We ran it on 24 directories and it worked as expected and we saw a significant reduction in the directory blocks used. Next week we plan to run this on all of the oversized directories. We expect this on its own to mitigate the issue. However, we are evaluating the addition of more RAM to see if that would also provide value, room for growth and further stability.

2) In the traces, the GPFS support team also saw that normally the latencies on a read or write request was measured to be about 1 ms on average. However, during these issues, the latencies rose to 10ms and severely impacted the ability of the cache to write dirty blocks to disk and free up cache space to load in new directory blocks. The team is now looking into what is causing this to occur. We now have a way to reproduce the latencies without cause an issue to occur so we are hopeful that we can get to the bottom of this issue without further disruption to our end users.

We believe that both 1 and 2 had to occur together to cause the filesystem lock-up that we have been experiencing. We are quite confident that once we run the scripts to recreate the directories, the issue will be resolved. However, we are also going to resolve the read/write latency issue and examine adding RAM so that we can ensure that this issue is firmly behind us.

THANK YOU for your ongoing patience and support while we have worked through this problem.
大意：
经过IBM GPFS的调研我们初步确定了问题所在。

问题1：为了性能优化，我们将文件目录数据块缓存在了内存里，平时每个文件夹一般只占1兆内存，但去年我们一次误操作在下载服务器里生成了一大堆垃圾文件，而这些文件也被缓存了进去（GPFS不会同步删除缓存中已删除文件的信息），导致目录数据块暴增至14兆，最终使内存缓存占用达到7G，而实际上我设置的系统缓存只有2G，这导致了文件系统的不稳定。
对此有2种解决方案：
a、增加内存、缓存
b、重新创建目录缓存数据块
目前2个方案同步进行中。

问题2：同时IBM还发现平时硬盘读取速度大概在1毫秒，但是当出现问题时，会增至10毫秒，这严重影响性能，导致了文件系统锁死。这个问题的原因和解决方案，ibm专家还在调研中。

vmzy · 发表于 2012-8-27 09:40:56

Aug 26, 2012 2:04:31 PM
Errors on Computing for Sustainable Water
There is an issue with the current set of workunits that we are processing for the Computing for Sustainable Water project. We have temporarily disabled the project until we can identify and resolve the issue. We expect this to only take a day or two.

We appreciate your patience while we work through this.
大意：
CFSW的任务有问题，现在项目暂停，纠错。希望一两天内可以搞定这个问题。

vmzy · 发表于 2012-8-28 09:24:00

Aug 27, 2012 7:22:24 PM
Re: Errors on Computing for Sustainable Water [Resolved]
We have fixed this problem and work is available for the project again.
大意：
CFSW的任务问题修复完毕，项目现恢复正常。

muclemanxb · 发表于 2012-8-28 19:59:20

LS这个到算是好消息

vmzy · 发表于 2012-8-29 12:49:08

Aug 28, 2012 4:19:23 PM
Re: Unstable Server Environment
We started running the scripts to rebuild the directories as identified by 1b) above. We have completed running these against the significantly oversized directories and we have seen significantly stability improvement since this finished. In fact we have now gone 36 hours without this issue re-occurring. This is the longest period of time we have gone without seeing the issue since early June. As a result, we have added redundancy back into the environment and we are watching it closely to ensure that it remains stable. However, we are very optimistic at this point.

We are now running the script against the directories that are only modestly oversized (a factor of 2-3 larger than required). This will save us additional 700MB or so of required cache size for caching these directory structures. This will give us improved performance and further reduce concerns of a repeat.

Additionally, we are moving to install additional RAM on these servers. This will allow us to increase the cache size to further improve the performance of GPFS and provide additional insurance against this issue re-occuring again in the future.. We hope to install the memory by end of week. We will be able to complete this without impacting the users.
大意：
我们重建了文件夹缓存，目前看起来很稳定，已经无故障运行36小时了（6月以来最长的一次）。
接下来我们准备加装内存，以避免将来再出问题。

vmzy · 发表于 2012-8-30 22:44:19

Aug 30, 2012 12:34:25 PM
Re: Reduced FAAH and GFAM project priority
This has been returned to its normal priority.
大意：
GFAM项目恢复发包优先级。

Aug 30, 2012 12:37:46 PM
Re: Unstable Server Environment
We have finished running the script to rebuild the directories and have also restored our redundant configuration for the servers in the environment. We have not seen a re-occurrence of this issue since Sunday so we are pleased to see that the steps we have taken are addressing the issue.

Today, starting in a short while, we will be increasing the RAM on the servers and giving GPFS more cache space to work with. While not strictly required to resolve this issue, investigation during the issue revealed that the cache was barely sufficient for the performance we need. Adding this additional RAM is prudent to ensure high performance as we continue to grow.
大意：
文件缓存重建完毕，从周日以来系统没有再出过问题，目前看来问题已经完美解决。
稍后我们将加内存，并调大缓存分配空间。

vmzy · 发表于 2012-9-6 09:58:40

Sep 5, 2012 5:33:28 AM
Re: Unstable Server Environment
We have completed the work to add more RAM on the servers and give the GPFS filesystem more cache space to work with. This in addition to the directory size changes has resolved this issue.

We continue to investigate the issues detected with the SAN performance and latency. However, as we had hoped, resolving one of the two identified issues allowed the system to resume normal operations so we are marking this issue as resolved.
大意：
内存加装完毕，缓存调整完毕。文件系统不定时宕机的问题应该解决了。
接下来开始着手调查存储服务器性能及延时的问题。

vmzy · 发表于 2012-9-12 09:08:25

Sep 11, 2012 3:52:38 PM
Beta Test for Help Fight Childhood Cancer ( Sept 11, 2012 )
Hello,

We are starting a short beta test for Target 10 on Help Fight Childhood Cancer. There will be about 1,000 work units sent out to all platforms.

Thanks,
-Uplinger
大意：
开始测试HFCC Target 10对象，大概1千个任务，所有平台都参加测试。

vmzy · 发表于 2012-9-14 09:22:56

Sep 13, 2012 9:25:31 PM
Beta Test for Help Conquer Cancer - GPU v6.50
We are starting a new round of beta testing. This round is to accomplish three goals:

1) We have broaden the types of graphics card that will be able to download work. We are using this to confirm which cards should be allowed to receive work when we go to production.

2) We are confirming that the server change we made last week works as expected.

3) We are looking for other issues that we might have missed in previous rounds due to the bigger issues we were looking at.

thanks,
Kevin
大意：
开始测试hcc-gpu 6.50新版内核。扩大了显卡白名单，修改了服务器代码，修改了一些之前发现的bug。

vmzy · 发表于 2012-9-18 10:49:52

Sep 17, 2012 5:57:24 PM
Re: Beta Test for Help Conquer Cancer - GPU v6.50
We have our results from this round of beta testing.

Outcome #1:

The following cards/card types are confirmed to not be able to run the project:

AMD/ATI:

ATI Radeon HD 2300/2400/3200 (RV610)
ATI Radeon HD 2600 (RV630)
ATI Radeon HD 3800 (RV670)
ATI Radeon HD 4350/4550 (R710)
ATI Radeon HD 4600 series (R730)
ATI Radeon HD 4700/4800 (RV740/RV770)

NVIDIA:

GeForce 210
GeForce 310
GeForce 310M
GeForce 405
GeForce 610M
GeForce 8400
GeForce 8400GS
GeForce 8400 GS
GeForce 8500 GT
GeForce 8600 GS
GeForce 8600 GT
GeForce 8700M GT
GeForce 8800 GT
GeForce 8800 GTS 512
GeForce 8800M GTS
GeForce 9300 GE
GeForce 9400 GT
GeForce 9500 GT
GeForce 9600 GSO
GeForce 9600 GSO 512
GeForce 9600 GT
GeForce 9800 GT
GeForce 9800 GTX+
GeForce 9800 GTX/9800 GTX+
GeForce 9800 S
GeForce G210
GeForce GT 120
GeForce GT 130
GeForce GT 130M
GeForce GT 230M
GeForce GT 330M
GeForce GT 420
GeForce GT 520
GeForce GT 520M
GeForce GT 630
GeForce GTS 240
GeForce GTS 250
GeForce GTX 260M
GeForce GTX 280M
GeForce GTX 660M
NVS 3100M
NVS 4200M
NVS 5100M
Quadro 400
Quadro FX 1600M
Quadro FX 1800
Quadro FX 2700M
Quadro FX 2800M
Quadro FX 3700
Quadro FX 380
Quadro FX 570M
Quadro FX 580
Quadro FX 770M
Quadro FX 880M
Quadro NVS 290

Outcome #2:

We did not see a re-occurrence of the device id issue (running on cpu vs running on gpu) so this issue is fixed.

Outcome #3:

Too much noise from outcome #1 above.

We will be clearing out this run later today (in about 3 hours) and preparing for another run to look for additional issues without the noise of bad cards.
We have released more workunits to continue this test, but this time excluding the cards mentioned above.
大意：
HCC GPU v6.50第一轮测试结果出炉：
1、某些显卡确认不支持。列表见鸟语原文。
2、device id问题（cpu、gpu串包）已经解决。
3、由于不支持显卡导致大量包出错。现准备把不支持显卡加黑后，进行第二轮测试。前次以测兼容性为主，这次主测程序bug。

vmzy · 发表于 2012-9-22 09:28:25

Sep 21, 2012 7:02:40 PM
Beta Test for Help Conquer Cancer - GPU v6.51
We have confirmed the that the set of graphics to exclude is correct. However, during the last test we observed that about 3.5% of results were 'hung' at some point during the processing. This beta test includes additional logging to help us identify what is going on.
大意：
开始测试hcc-gpu 6.51新版内核。
上次测试的显卡黑名单确认有效。但是测试中发现，有3.5%的任务会随机卡死。这次测试增加了输出日志，来寻找bug原因。

vmzy · 发表于 2012-9-29 09:50:27

Sep 28, 2012 8:45:07 PM
Beta Test for Help Conquer Cancer - GPU v6.55
We were able to identify the issue that was causing the workunits to get hung at the end of the workunit. We have implemented a fix that we believe will resolve the issue and that worked in the test cases we ran in alpha testing. Due to the fix residing in the BOINC API layer that handles communications from the core client to the research application, we are discussing what we did with the BOINC development team. This round of beta testing will give us some good insight into the behavior at large scale.

Please note that there is a natural 30 second - 3 minute span at the end of each workunit where the % complete does not increment. This is not the hang. It is only if it has been stuck at this spot for > 10 minutes that you can consider it hung.

Please note that the GPU option should be fixed now and the option to choose whether to run GPU work all the time or only when you are not using your computer are now available.
大意：
HCC-GPU 6.55开测
任务挂起的bug已经修改了，内测中正常了。现在开始公测。bug的起因是BOINC内核的通信API和计算程序不兼容，我们正在联系BOINC开发团队，看是否能对BOINC内核做相应修改。
注：任务结束的时候进度条一般会卡30秒-3分钟，这是正常的。我们所说的任务挂起是指，进度条超过10分钟不走。
GPU选项中的‘GPU一直运行或者只在空闲时计算’现在已经可以生效了，欢迎测试。

		自动登录	找回密码
密码			新注册用户

[BOINC] [生命科学类及其他] World Community Grid

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

评分

浏览过的版块