本帖最后由 Baiqing_Lyu 于 2020-4-16 11:32 编辑
如题,原帖在此:https://foldingforum.org/viewtopic.php?f=18&t=34224&start=15#p325911
Hi folks! I just wanted to step in to apologize for the persistent issues with plfah1-*.mskcc.org and plfah2-*.mskcc.org over the past couple of weeks.
Our poor little servers at MSKCC were purchased in 2007 and have been chugging away serving the majority of GPU work units probably up to where FAH hit nearly 1 exaflop before they started to fall over.
Also, I apologize for the perceived lack of transparency here---it's really been just a matter of putting out tons of other fires and focusing our energies on providing critical scientific support for active COVID-19 drug discovery efforts. Huge thanks to @bruce for pointing me here to at least dash off a quick update.
We've had a few issues that have appeared:
* Throughput limits with the server code: This is now addressed, or at least much improved!
* Drive failures: Now fixed, with more shelf spares that have just showed up!
* Disk space that was burned through something like 5-20x faster due to all of you wonderful people completing WUs so quickly: We're rapidly evacuating completed projects to a new external storage unit that just came online days ago, and have 14T free on plfah1 now. We're also going to release a new core22 version this week that will allow us to send back only the solute structure data and thus solve these storage issues for good.
* Server software stability: Occasionally, it looks like the server can't operate on specific WUs because the files are reported by the OS as "busy". We don't have a solution for this yet, but so far have been restarting the server when we encounter this. We could use a rapid warning system, maybe, instead of having to have @bruce ping us when this happens for a couple of hours.
TL;DR: I think it's OK to ahead and keep the blacklist for a few more days. We've spun up a bunch of new external servers which are going to be much more performant.
We're bringing new (actual, brand new!) physical servers online at MSK in the next few days. We got them just before the COVID-19 emergency, and it's been slower to roll them out as a result, but we hope we can get them into service to replace plfah1/2 in the next week!
Huge thanks to all of you folks, and stay safe and healthy. We love you all.
~ John Chodera @ MSKCC ~
大致的意思是07年买的服务器 plfah1-1.mskcc.org和plfah2-1.mskcc.org终于要退休,即将会有新的软件,硬件已更高的性能代替工作!目前,这两个服务器为绝大部分的GPU提供任务,它们的性能如下:
(IP,地址,服务器类别,服务器软件版本,相关负责人,每小时发包率,出错,警告,有任务,任务类别,公开包量,Beta包量,算法,空间,续航时间,检查时间)
期待更新以后新的发包功率!
|