Ranger (巡游者) 超级计算机

BiscuiT · 发表于 2008-1-21 16:48:46

http://huichen.org/66
UT Austin和Sun公司合作搭建了一台有62976个核的超级计算机 Ranger (字面翻译“巡游者”，我给它起中文名为“润哲”，老婆称其为“大电脑”

)，就坐落UT的pickle工程校区，在我办公室几百步远的地方。这台计算机的理论峰值为每秒504万亿浮点运算，预计今年2月一号投入正式运营，按照理论峰值排序，仅次于美国核武器安全局在LLNL的IBM蓝色基因L超级计算机（按照去年11月top500的报告其理论峰值每秒593万亿浮点运算）。

Ranger “润哲” 超级计算机。图片来自TACC网站
因为工作的关系我是UT的另一台超级计算机Lonestar(孤星，理论峰值62万亿次)的用户并且将来有可能会使用Ranger，所以这个星期4/5在UT的高性能计算中心(TACC)参加了为期两天的培训成为第一批试用Ranger的用户之一,并进入Ranger的机房零距离接触了这台世界第二的大家伙。

先说说Ranger的硬件：
其实超级计算机并不是很多人想象的主频巨高的类似个人电脑的机器，而是一台由成千上万个CPU通过网络(infiniband,ethernet等)连接起来组成的集群（cluster)。Ranger共有3936个计算节点(node)，每个节点的主板上有4个socket，每个socket上插着一个AMD的4核barcelona处理器，每个处理器有8G内存，这样Ranger总计有3936×4x4=62976个核心(每个核心的主频是1.994G)和8Gx4×3936=125TB的内存。

计算节点的主板。图片来自TACC网站。
AMD的barcelona四核和Intel的“伪”quad-core相比的不同是后者没有独立的L2二级缓存, core0/1 和core2/3之间的通讯需要经过内存，而前者在L2和内存之间加了一个L3三级缓存从而实现了L2的独立，要知道核心对L2/L3/memory的访问速度依次递减大约一个数量级，从而barcelona的核间通信速度大为提高。不仅如此，在每个node的主板上采用了非对称的HyperTransport进行连接(CPU0和CPU3之间没有直接连接)，CPU之间的通信不需要通过”local”的内存，这提高了4个CPU之间的通信速度。

世界上最大的Infiniband交换器，上方闲置的插槽预示着Ranger可以进一步扩展CPU数目。图片来自TACC网站。
对超级计算机来说，比单个CPU的主频更重要的性能指标是计算节点之间的连接速率，Ranger采用的是两台topspin270infiniband 交换机,P2P的fat tree拓扑结构。我用PALLAS MPI(IMB)MPI-I实测的速度可以达到1GB/s以上，已经非常接近硬件的极限：

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 62 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
   #bytes #repetitions    t[usec] Mbytes/sec
         0       1000       1.86       0.00
         1       1000       2.16       0.44
         2       1000       2.21       0.86
         4       1000       2.29       1.67
         8       1000       2.17       3.52
         16       1000       2.24       6.82
         32       1000       2.29       13.31
         64       1000       2.36       25.87
      128       1000       2.47       49.41
      256       1000       2.95       82.70
      512       1000       3.16    154.44
      1024       1000       3.89    251.30
      2048       1000       5.56    351.54
      4096       1000       8.04    486.15
      8192       1000       16.81    464.85
      16384       1000       27.81    561.80
      32768       1000       49.79    627.63
      65536       640       92.94    672.44
   131072       320    140.81    887.70
   262144       160    240.18    1040.87
   524288          80    445.92    1121.28
   1048576          40    883.92    1131.32
   2097152          20    2007.03    996.50
   4194304          10    3950.14    1012.62

这么大的机器(占地面积大约13mx13m)，需要耗费大量的电能来驱动和水冷散热，每年消耗的电费大约是一百万美元，我站在机器中间能听到巨大的噪音并感到炙热的散热气流扑面而来。机器的计算节点采用了易插拔设计，更换节点很容易，维护人员给我们演示了一块刚刚坏掉拔出来的blade，很牛逼的设计是主板中央有个蓝色按钮，按下去那么主板相应的坏掉的硬件边上的灯会亮，发现是内存挂了

其次谈谈软件：
登录Ranger的登录节点和登录其他Linux系统没什么区别。我登录时的界面：

chen@chen:~$ ra
[email protected]'s password:
Last login: Thu Jan 17 15:00:02 2008 from xxxxxxx.utexas.edu
------------------------------------------------------------------------------
         Welcome to the Ranger Opteron Linux Cluster
Texas Advanced Computing Center, The University of Texas at Austin
         ** Unauthorized use/access is prohibited. **
------------------------------------------------------------------------------
** Welcome Early Users!
--> Ranger uses the modules program to control your user environment. To see
what packages are available, issue: "module avail"
--> Draft User Guide: http://www.tacc.utexas.edu/services/userguides/ranger/
--> Please contact your assigned early-user support liaison with any questions
--> Example SGE job scripts available in /share/doc/sge
--> ** NEW **: the OpenMPI stack is available for testing.  To access,
--> issue: "module swap mvapich2 openmpi" from a default login shell.
--> ** System Maintenance **: Ranger will be down beginning Friday,
--> January 11th at 11am (Central) to prepare for final acceptance
--> testing.
--> ** normal queue is currently *closed* to all users.
------------------------------------------------------------------------------
--> ** Notes for OSU MVAPICH Team: (1/15/08): systest queue is now open
------------------------------------------------------------------------------
login4$

系统使用的是CentOS Linux系统(基于RedHat), 登录节点的一些系统信息如下：

login4$ cat /proc/version
Linux version 2.6.9-55.0.9.EL_lustre.1.6.3smp([email protected]) (gcc version 3.4.6 20060404 (RedHat 3.4.6-8)) #2 SMP Mon Dec 17 18:34:43 CST 2007

login4$ cat /proc/meminfo
MemTotal:    32916688 kB
MemFree:    32062484 kB
Buffers:       91164 kB
Cached:       423900 kB
SwapCached:       0 kB
Active:       270068 kB
Inactive:    265000 kB
HighTotal:          0 kB
HighFree:          0 kB
LowTotal:    32916688 kB
LowFree:    32062484 kB
SwapTotal:    4192824 kB
SwapFree:    4192824 kB
Dirty:          392 kB
Writeback:          0 kB
Mapped:       29768 kB
Slab:          161024 kB
CommitLimit:  20651168 kB
Committed_AS: 56416 kB
PageTables:    2292 kB
VmallocTotal: 536870911 kB
VmallocUsed:    11888 kB
VmallocChunk: 536858999 kB
HugePages_Total:    0
HugePages_Free:    0
Hugepagesize:    2048 kB

系统中安装的软件组件：

login4$ module avail
------------ /opt/apps/intel10_1/modulefiles ----------
acml/4.0.1 fftw3/3.1.2 hdf5/1.6.5 mvapich/0.9.9 mvapich2/1.0  netcdf/3.6.2  openmpi/1.2.4
---------- /opt/apps/pgi7_1/mvapich2_1_0_1/modulefiles ----------
fftw2/2.1.5             petsc/2.3.3-complexdebug petsc/2.3.3-debug       tao/1.9(default)
petsc/2.3.3(default)    petsc/2.3.3-cxx       slepc/2.3.3(default)    tao/1.9-debug
petsc/2.3.3-complex    petsc/2.3.3-cxxdebug    slepc/2.3.3-debug
--------- /opt/apps/pgi7_1/modulefiles ----------
acml/4.0.1       hdf5/1.6.5       mvapich-devel/0.9.9 netcdf/3.6.2
fftw3/3.1.2       mvapich/0.9.9    mvapich2/1.0       openmpi/1.2.4
--------- /opt/modulefiles ----------
Linux             TACC             cluster          java/1.4.2       java/1.5.0 java/1.6.0(default)
--------- /opt/apps/modulefiles ---------
binutils-amd/070220 gsl/1.10          intel/9.1          pgi/7.1
gotoblas/1.22    intel/10.1(default) mkl/10.0          sun/12
---------- /opt/apps/teragrid/modulefiles --------
apache-ant/1.6.5    globus/4.0.1(default)  gsissh/4.1          teragrid-basic       tgresid/2.0.0
condor/6.7.18(default) globus/4.0.5          gx-map/0.5.3.2       teragrid-dev          tgresid/2.0.3(default)
condor/6.9.1          globus/4.1.3          pacman/3.20          tg-policy/0.2       tgusage/2.9
condor-g/6.7.18       globus-4.0          srb-client/3.4.1    tgproxy/0.9.1

Ranger上的编译器主要是Intel compiler(c/c++/fortran)和portlandgroup的c/c++/fortran编译器，使用openmp和MPI(mvapich2)进行并行编程，函数库包括MKL,ACML等。由于Ranger每个节点有16个计算核和32GB内存，这就为openmp/mpi混合(hybrid)编程提供了用武之地，在第二天上午的培训课程中就包括混合并行编程的讲座和lab。
作业调度(batch system)使用的是SGE和NUMA control。
值得一提的是培训中见到了牛人Goto（数值库gotoblas的作者）。我们培训时的lab code中有一段他用C汇编写的timer程序，据说可以把测量精度提高到8个时钟周期。
其他：
感觉TACC的工作人员极度热情，有问必答，还抢着帮我扔用过的纸杯

机房的环境也很开放，打个电话过去就可以带着进入机房内部参观，放置Ranger的building没有保安，机房的墙壁是玻璃的从走廊就可以看到内部。完全是一副为人民服务的作派

并行计算还是很好玩的，等我有时间一定写篇这方面的介绍。

BiscuiT · 发表于 2008-1-21 16:53:46

Sun Constellation Linux Cluster

System Name:	Ranger
Operating System:	Linux
Number of Nodes:	3,936
Number of Processing Cores:	62,976
Total Memory:	123TB
Peak Performance:	504TFlops
Total Disk:	1.73PB
Description:
TACC, in partnership with Sun Microsystems, will build asupercomputer system specifically designed to support very largescience and engineering computing requirements. The new supercomputersystem, named Ranger, will have a peak performance in excess of 500trillion floating point operations per second (Teraflops), making itone of the most powerful supercomputer systems in the world. It willalso provide over 100 Terabytes of memory and 1.7 Petabytes of raw diskstorage. The system is based on next-generation Sun blade servers usingAMD's forthcoming quad-core processors. The Ranger system will behoused in TACC's new building on the J.J. Pickle Research Campus inAustin, Texas. Ranger will go into initial production in December 2007 using Linux(based on a CentOS distribution) and all the system components will beconnected via a full-CLOS InfiniBand interconnect. Eighty-two computeracks will be used to house the quad-socket compute infrastructure withadditional racks housing login, I/O, and general management hardware.Each compute node will be provisioned using local storage and global,high-speed file systems will be provided using the Lustre file systemrunning across 72 I/O servers. Users will interact with the system viafour dedicated login servers and a suite of eight high-speed dataservers. Resource management for job scheduling will be provided withSun Grid Engine (SGE).

Tynox · 发表于 2008-1-21 17:17:20

很好很强大!
幻想用这种机器做Boinc一天可以制造多少分数?多少任务会被处理掉?

gongmao1_2000 · 发表于 2008-1-21 19:54:11

是楼主本人正在经历的事吗?

BiscuiT · 发表于 2008-1-21 21:10:26

文章开头就标有来源地址。。。http://huichen.org/66

不好意思。。大概是标得不怎么显眼。。。囧

zglloo · 发表于 2008-1-22 20:17:41

huichen 这个似乎自己真正参观了这个机房中心！

refla · 发表于 2008-10-14 18:26:27

你的笑脸好像跟坛子里的不一样啊！？

lfk · 发表于 2008-10-14 19:07:24

YY啊.........
不过私人搭建这样一个集群的可能性貌似很小很小....

		自动登录	找回密码
密码			新注册用户

Ranger (巡游者) 超级计算机

浏览过的版块