¡@ ¡@
 
¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

     

¡@
1. Cluster Overview-Hardware and Set Up 

¡@

The Computational Molecular Science (CMS) cluster computing facility has been built up with funding from a variety of sources at The University of Queensland, including DVC(Research), BACS and EPSA faculties, IMB, AIBN, and core user group contributions. The strategy for its usage has been to advance research at UQ in the general area of molecular computations. In line with this strategy, the facility is generally available for usage by the research groups of the Associate Investigators of CCMS.

¡@

¡@

The CMS cluster facility consists of three distributed memory clusters and two shared memory machines. Below are the technical specifications of the currently installed hardware. The first cluster is used by the CCMS and the research groups of its associate investigators at UQ. The second cluster was funded primarily through a Qld Government Smart State Grant, applied for by the Hypersonics centre at UQ with Smith as a CI, and is mainly used by researchers from the Hypersonics centre and the CCMS. The third cluster is a 90-node version of the cluster above, with upgraded opteron processors (2.4GHz) and disk (73GB per node). 63 of these nodes are being acquired by the CCMS and 27 are being acquired for the research group of Federation Fellow Professor Alan Mark.

¡@

1.1 Cluster Overview-Hardware

The specifications of the currently installed machines are as follows:

 

  Xeon Cluster:

  • 128 Sun Fire v20z servers

  • CPU: 2 x Xeon (32-bit) Processor (2.8GHz)

  • Physical memory:3 GB RAM

  • Disk space: 1 x 36GB SCSI Hard Disk

  • Network connection:2 x Gigabit Ethernet interfaes

  2.2GHz Opteron Cluster:

  • 66 Sun Fire v20z servers

  • CPU: 2 x AMD Opteron 248 Processors (2.2GHz)

  • Physical memory: 4GB RAM

  • Disk space: 1 x 36 GB SCSI Hard Disk

  • Network connection: 2 x Gigabit Ethernet interfaces

  • Service Processor: basic remote administration through IMPI (Intelligent Platform Management Interface).

  2.4GHz Opteron Cluster:

  • 90 Sun Fire v20z servers

  • CPU: 2 x AMD Opteron 250 Processors (2.4 GHz)

  • Physical Memory: 4GB RAM

  • Disk space: 1x 73 GB SCSI Hard Disk

  • Network connection: 2x Gigabit Ethernet interfaces

  • Service Processor: basic remote administration through IMPI (Intelligent Platform Management Interface).

 The above hard ware has been arranged into the following 4 clusters:

¡@

HPC Cluster: Giza

Giza is used to run Gaussian, Molpro and Gamess calculations and mainly single processor jobs using home grown Fortran and C/C++ codes. 

  • Compute nodes: 92 Sun Fire v60x servers

  • CPU: 2 x Xeon (32-bit) processor (2.8GHz) 

  • Physical memory: 3GB RAM

  • Disk space: 1 x 36GB SCSI Hard Disk 

  • Network connection: 2 x Gigabit Ethernet Interfaces

  • Batch and scheduling system: Sun Grid Engine, Enterprise edition 5.3

  • Operating System: Linux

¡@

¡P      

    32 compute nodes have been taken out of the cluster due to limitations in space and cooling. They can be used individually if the need arises but not as part of the cluster. This current setup is likely to change in future when more space and appropriate cooling facilities become available.

 

HPC Cluster: Blackhole

The calculations run on Blackhole are mainly home grown parallel MPI codes in Fortran and C/C++. A few nodes are set aside for single processor calculations like Molpro, Gamess and home-grown codes. 

 

  • Compute nodes: 33 Sun Fire v20z servers
  • CPU: 2 x AMD Opteron 248 Processors (2.2GHz)
  • Physical memory: 4GB RAM 
  • Disk space: 1 x 36 GB SCSI Hard Disk
  • Compute nodes: 60 Sun Fire v20z severs
  • CPU: 2x AMD Opteron 250 Processors (2.4GHz)
  • Physical memory: 4GB RAM 
  • Disk space: 1x 73 GB SCSI Hard Disk
  • Network connection: 2 x Gigabit Ethernet Interfaces 
  • Service Processor: basic remote administration through IMPI (Intelligent Platform Management Interface) 
  • Batch and scheduling system: Sun Grid Engine, Enterprise edition 5.3 
  • Operating System: Linux
¡@ ¡@

¡@

HPC Cluster: Grape      

  • Compute nodes: 30 Sun Fire v20z Servers

  • CPU: 2 x AMD Opteron 250 Processors (2.4 GHz)

  • Physical memory: 4 GB RAM

  • Disk space: 1 x 73 GB SCSI Hard Disk

  • Network connection: 2 x Gigabit Ethernet Interfaces

  • Service Processor: basic remote administration through IMPI (Intelligent Platform Management Interface)

  • Operating System: Linux

¡@

HPC Cluster: Baby-Blackhole

This is a separate cluster which is housed by the Hypersonics Centre. This cluster will be used as a test cluster for testing not just codes but also new operating systems and cluster configurations. This will enable us to test new systems first before we port them to the other 3 clusters.

  • Compute nodes: 33 Sun Fire v20z Servers

  • CPU: 2 x AMD Opteron 248 Processors (2.4 GHz)

  • Physical memory: 4 GB RAM

  • Disk space: 1 x 36 GB SCSI Hard Disk

  • Network connection: 2 x Gigabit Ethernet Interfaces

  • Service Processor: basic remote administration through IMPI (Intelligent Platform Management Interface)

  • Operating System: Linux

¡@

Shared Memory Workstation: Redback

  • Sun Fire 280R
  • CPU: 2 x UltraSPARC III 1.2GHz 
  •  Physical memory: 8GB
  • Disk space: 2 x 73GB SCSI Hard Disk 
  • Network connection: Gigabit Ethernet Interface
  • Remote System Control (RSC) module 
  • Operating system: Solaris 9

¡P      

¡@

Shared Memory Workstation: Huntsman

  • Sun Fire V880
  • CPU: 8 x UltraSPARC III 1.2GHz
  • Physical memory: 32GB
  • Disk space: 6 x 73GB SCSI Hard Disk
  • Network connection: Gigabit Ethernet Interface
  • Remote System Control (RSC) module 
  • Operating system: Solaris 9

¡P      

Other Hardware

  • Cisco Catalyst Gigabit-Ethernet switch

  • FireWall

  • 2  x front end servers for Giza and Blackhole

  • Desktop to drive queue

 

1.2 Cluster Overview-Set Up

Giza, Blackhole, Grape, Huntsman and Redback are located at the CCMS in the Chemistry building (Baby-Blackhole is housed in the Hypersonics centre in the Mechanical Engineering building). The clusters and memory machines are housed in two separate neighbouring rooms. The first room houses, the 3 racks of Blackhole, Redback and Huntsman. The second room houses the 3 racks of Giza, the one rack of Grape, the switch, the 3 desktops running the queues, the firewall and the front ends for Giza, Blackhole and Grape.

 

            

 

The cluster network is on an internal network behind a fire wall which controls the access to our clusters from the out side world. The only access from the outside world is to Giza's, Blackhole's and Grape's front end or Redback. There is no direct access from the outside to any of the other hardware on the internal network. Users can access either Giza's or Blackhole's or Grape's front end. The front end of the clusters is where users have their home directories, can edit and compile their codes and submit their jobs to the cluster. Users have no access to the compute nodes.

 

       

 

Giza has 92 compute nodes, 32 to each rack (rack3 has 4 additional nodes that are used for testing and for network services like DHCP and image server), which are named and numbered Sphinx001 to Sphinx092. Blackhole has 93 compute nodes, 33 in rack1 (these are the nodes with the 2.2GHz processors and the 36GB SCSI disks) and 30 in each of rack2 and rack3 (these are the nodes with the 2.4GHz processors and the 73GB SCSI disks), which are named and numbered Star01 to Star93. Grape has 30 compute nodes which are named and numbered Merlot01 to Merlot30. Each cluster is managed by a separate queueing system running the Sun Grid Engine Enterprise version 5.3 software. Below is a sketch of the principle set up of the three clusters and the two SMP machines.

 

 

Users can only submit jobs to the Sphinx compute nodes from Giza, to the Star compute nodes from Blackhole and to the Merlot compute nodes from Grape. The home directories on Blackhole are mounted on all the Star nodes, the home directories on Giza are mounted on all the Sphinx nodes and the home directories on Grape are mounted on all the Merlot nodes. This enables users to read and write from and to their home directories during a calculation. For large temporary files all nodes (Sphinx, Star and Merlot) have 21GB (or 53GB) of local scratch space available which can be accessed by a job running on the node.

¡@

Huntsman and Redback are stand alone machines. Users have their home directories local on each machine and also run jobs directly on Huntsman or Redback. There is no queueing system as both machines only have a very small and exclusive user community. The SMP machines are utilised to perform large memory calculations employing Gaussian, Molpro or Gamess.

¡@

We recently acquired a server (funnelweb) which houses additional temporary file space for Huntsman and Redback. This server is currently being set up.

Top
2. Operating System

¡@

The operating systems installed on the clusters is Linux and Solaris is installed on the SMP machines. On installation of the clusters we desired a 2.6 kernel because of its better support for SMP (the Xeon processors have hyper-threading technology). The Linux distribution Fedora Core 2, which just came out at the time of the first set up of the clusters, has a 2.6 kernel. We were also familiar with Red Hat type Linux distributions which was one more reason why we decided to use Fedora as our main operating system. For a while we considered to change to a Linux enterprise distribution but in the view of the number of compute nodes and plus the additional front end servers the cost would be too expensive.

¡@

The first installation involved only the Xeon and the 2.2GHz Opteron clusters as the 2.4GHz Opteron cluster had not been acquired at that time (September 2004). We installed Fedora Core 2 on both front ends and all the nodes, the standard distribution on Giza and its compute nodes and a 64-bit version on Blackhole and Star nodes. After the full installation we unfortunately found out that Gaussian does not support 2.6 kernels and Gaussian would not run on the Sphinx nodes with Fedora Core 2 installed. We therefore installed Red Hat 9 on two racks of the Giza cluster. This and the start of the upgrade have lead to the following distribution of operating systems on the clusters (in the time from September 2004 until May 2005).

 

¡@

After the new 2.4GHz Opteron cluster arrived we re-arranged our clusters (start June 2005). During the installation of the new cluster Grape and the 2 new racks for Blackhole and the reshuffle of Giza and Blackhole and the creation of Baby-Blackhole, we also decided to upgrade the operating systems for all clusters. Fedora Core 4 just came out a few days before the upgrade and we decided not to go with the newest Fedora Core version but with Fedora Core 3, as this seems to have been a very stable distribution.

We installed Fedora Core 3 on Giza's front and on one of Blackhole's front ends and on Blackhole's and Grape's compute nodes. Blackhole in view of its number of compute nodes (93) has two front end servers now, Blackhole and Blackhole2, sharing the load. Users can log in to either of them and home directories are cross mounted. Blackhole2 which is a new front end server had severe problems with Fedora Core 3 and therefore we installed Fedora Core 4 on Blackhole2 which works fine.

In view of Gaussian and its troubles running under a 2.6 kernel (see above the old configuration) we installed a Sphinx test node with Fedora Core 3 first to test Gaussian and to our amazement Gaussian is running happily under Fedora Core 3 (this is a version that has been compiled under Fedora Core 2 x86 64 with the Portland compiler for a x86 system).

The queue for Blackhole runs Fedora Core 3 and the queue for Giza is still running Fedora Core 2. It housed the old queue (in the old configuration Giza and Blackhole were served by one queue but with the increase of nodes we thought it better to have separate queues for each of the clusters) and we wanted to keep old config files and a working queue in case something went wrong. It will be upgraded in the next round.

The new distribution of operating systems on the clusters now looks like this:

¡@

Top
3. Challenges Scaling Up

¡@

During the installation of the clusters we have faced several challenges. The operation of small clusters, 2-6 nodes, with a small user community (usually one research group in the same office location) works in most cases with applying common sense and consideration for other users. When scaling up to large clusters with a wider user community, comprising of several different research groups in different physical locations administrators are presented with several problems due to the scale of the system.

 

3.1 Installation of the operating system 

The first challenge was the effective installation and maintenance of the operating system. With the large number of compute nodes installing each node individually is not feasible. We therefore had to look into imaging tools to install the operating system across several nodes with as less effort as possible. We finally decided to use SystemImager and we will give more details later in this document.

 

3.2 Assignment of IP addresses

On small cluster IP addresses are usually assigned statically during installation of the operating system. In the case of large clusters as the installation of the operating system is handled automatically also the assignment of IP addresses has to be performed automatically. Also one would like the range and order of the IP addresses to reflect the order and number of the node it is assigned to.

¡@

The assignment of IP addresses during imaging is achieved via a DHCP server. The DHCP server will assign an IP address from a pool of available addresses pre-defined by the administrator. To assign the IPs in order (according to node number) we imaged the nodes in order (one after the other with enough time in between to make sure that an IP has been assigned) and found out that IPs are given out of the pool in reverse order. So if a range of 140.1.0.100 to 140.1.0.200 has been given to the DHCP server the IP 140.1.0.200 would be given out first. This led to that the Star nodes are numbered bottom to top in the rack and the Sphinx nodes top to bottom.

¡@

The DHCP server is only needed for the initial installation of the operating system and for re-imaging. The imaging tool is set up to make the IPs assigned to the nodes static so the DHCP server is not needed for normal operation of the clusters.

 

3.3 Highly-loaded front end servers

The front end servers of Giza and Blackhole have been our biggest challenge so far. Previously when operating a small cluster user workstation were compute nodes and there was in general no central point for login. This means that there was no excessive load on one machine but approximately the same load on all machines.

¡@

Now with a front end for Giza and Blackhole there is one dedicated server for the whole cluster and the load on it can get very high. Therefore users are not allow to run jobs, even small ones, on the front ends. They are exclusive for compiling, editing and submitting. Graphical post processing of results on the front end is also strongly discouraged.

¡@

3.4 Cluster environment

The clusters have to be operated in the right environment. The temperature has to be stable around 21 degrees Celsius and a reliable cooling system is essential. Chilled water supply failures still account for nearly all of our cluster down times.

¡@

One other environmental issue are the noise levels created. Not only the clusters itself but also the air conditioning creates a great amount of noise.

¡@

Concerning operating clusters in a very small space like here at CCMS, maximising on space available is essential. The racks of our cluster came pre-assembled and with the network switches installed in the racks. Therefore the racks occupy very little space and there is no need to run long cables under the floor or in the ceiling.

 

3.5 User management

One of the biggest problems administrators are faced with is the management of users. User experience and especially user expectation can vary greatly. To communicate effectively with users we have set up two dedicated e-mail addresses for each cluster. We also set up a web page with all necessary information to get started and also general information on the hard ware and set up.

¡@

There are four administrators at present associated with two different research centres and located at two different buildings. The staff from CCMS is mainly responsible for the Giza cluster and the SMP machines while Hypersonics staff is responsible for the Blackhole cluster.

¡@

 

¡@ ¡@
¡@ ¡@ ¡@ Marlies Hankel Postdoc (CCMS) ¡@ ¡@ ¡@       Hong Zhang          Research Fellow (CCMS)   ¡@ ¡@
¡@ ¡@ ¡@ ¡@ ¡@ ¡@ ¡@ ¡@ ¡@ ¡@
¡@ ¡@
¡@ ¡@ Rowan Gollan PhD (Hypersonics) ¡@ ¡@ Andrew Denman PhD (Hypersonics) ¡@ ¡@

¡@

We had to think of an effective communication between the administrators concerning the configurations of the clusters. For this we set up an electronic log book each of the administrators has access to. There we log every event concerning the front end, the nodes, SMP machines and the infrastructure. Also information on users is kept there.

¡@
Top
4. Cluster Management Tools

¡@

The cluster management tool described here have been essential in solving all those problems we faced over the past year of operation.

To issue commands cluster wide we use the C3 (Cluster command and control) tools.

¡@

4.1 Cluster imaging tool: SystemImager (v3.2)

The imaging tool is one of the most important tools we use. When we set up the clusters we desired a 2.6 kernel. Imaging tools such as Rocks or Oscar did not support a 2.6 kernel at the time of set up. SystemImager is independent of the Linux distribution and is installed via RPMs.

¡@

One can install one node and configure it for the clusters needs. This node then serves as the so called golden client. An image of the golden client is then pulled onto the server and from there pushed out to all other nodes. We have set up the server to push out the image via the network so there is no need for booting nodes from floppy or CD. We keep an image for each operating system installed on the nodes, Fedora Core 3 (32-bit) and Fedora Core 3 (64-bit). The server is set up in a way that it knows which image to push out to which node.

¡@

4.2 Cluster file sharing: NFS (Network file system)

The home directories of the users on Blackhole, Giza and Grape are hard mounted on the nodes via NFS. NFS is part of the Linux distribution and is very easy to set up. Each front end is a NFS server for the nodes and the amount of requests, like reads and writes from and to the home directories can put considerable stress on the front ends. This can be a point of failure for the cluster as a whole. I/O intensive jobs can overload the front end and if it come to the worst crash it. Therefore I/O intensive jobs are required to use the local scratch space on the nodes. If this is not possible the job cannot be run on the clusters.

¡@

To solve this problem we did an upgrade of NFS from version 3 to version 4. NFSv4 is said to be more stable and suited to a cluster. The set up went alright and everything worked fine for a while but some problems arose. Some of the users could not access parts of their home directories during their jobs anymore. Also a large amount of error messages appeared in the logs and caused problems. For this reason we went back to NFSv3 but now use NFS over TCP. This seems to work fine so far.

¡@

We also tried to install a parallel virtual file system (PVFS) but also encountered problems. For the PVFS several Star nodes have been set up as severs to create a large file system from the local scratch space available on each node. This space could then be accessed from all Star nodes but would not put any stress on the front end as the file system is maintained by the server nodes. Here a parallel job, usually running fine on Blackhole managed to crash all the server nodes which caused severe problems. As we had no time to look more closely into this we just uninstalled PVSF and now only rely on NFS again.

¡@
Top
5. Queueing System

¡@

The queueing system software we use on the clusters is the Sun Grid Engine Enterprise Edition 5.3 software from Sun. Each cluster is served by its own queue. With the version of SGE we are using at the moment, each node is represented as a separate queue. So we have 92 queues on Giza, 93 queues for Blackhole and 30 queues for Grape. There is no 'long' or 'short' or 'express' queue which might contain several nodes but these kind of attributes can be assigned to the queues represented by the nodes.

¡@

The Sphinx and Star queues have 2 slots each, so 2 jobs can be run on each node. (The queue for Grape has not been set up yet as this cluster is still being configured, but will possibly be similar to the other queues.)

¡@

On Blackhole we had to set up exclusive queues for batch and parallel jobs. We encountered problems when a parallel job was occupying one slot on a node and the other slot was occupied by a single processor job. The single processor job would often use too many resources and would 'starve' the parallel job, which would then usually crash.

¡@

Memory and disk space have to be specified by the user when submitting a job. This is important as with two jobs per node jobs have to share available resources.

¡@

Apart from this there are no other limits to jobs in place at the moment. Even with a queueing system in place we still heavily rely on the courtesy, common sense and consideration of the users.

¡@

Top
6. Application Software

¡@

The application software installed and currently used on both clusters includes the following listed below:

¡@

Solid State Materials:

DL POLY (DL PROTEIN), VASP, PWSCF, CPMD

¡@

Small Molecules Quantum Chemistry:

MOLPRO, GAUSSIAN, GAMESS

¡@

Fluid Dynamics:

CFD-FASTRAN

¡@

Biosimulations:

Gromacs

¡@

Compilers and Others:

Intel 8.0 Fortran and C/C++ compilers, Portland Group Workstation Compilers, LAM-MPI

¡@

Home-Grown Codes:

mb_cns - Multi-Block Compressible Navier-Stokes solver (planar and axisymmetric), elmer - 3-D Compressible Navier-Stokes solver, parallel and serial quantum dynamical methods for reactive scattering

¡@
Top

About People  |  CMS Network  |  Sun Centre   |  News & Events  | 
Job Opportunities  |   Linkages  |  Contact Us

CCMS Home

¡@