Dove parallel cluster
dove.ch is a new (April 2006) Opteron-based compute cluster intended for parallel work. It has a head node with dual Opteron 246 CPUs, 4GB RAM and mirrored 160GB SATA disks. There are 8 compute nodes with dual, dual-core Opteron 265 CPUs, 4GB of RAM and an 80GB IDE disk. The head node provides the external connectivity of the cluster (via ssh) and almost everything should be done on this node. You can log into the compute nodes from the head node (but not from outside) but should normally never need to.
The OS is SuSE Linux 9.3. Also installed are the Torque resource management software and Maui scheduler, the low-latency SCore parallel environment, MPICH and LAM and various scientific libraries (ACML, ScaLAPACK etc.). There are both 32- and 64-bit versions of the Portland compilers (v6.1.2). The use of modules makes it easy to swap between these different environments.
The /home filesystem is 85GB in size and is subject to quotas (5GB soft limit, 7.5GB hard limit); it is NFS-exported to all the compute nodes. /scratch is local to each node and should therefore be used for temporary files during the execution of a job. There is no quota restriction on /scratch but it is not backed up and should not be used for long-term file storage. Files are likely to be deleted from /scratch with little warning if the filesystem fills up and causes problems. There is a /sharedscratch filesystem, 688GB in size, which is also NFS-exported to all nodes. Again, this is not subject to quotas so users need to be considerate...
dove is mainly intended for parallel work but can also be used for serial task-farming. The SCore parallel environment provides a low latency MPI, running over a dedicated network. To use SCore you must compile with the SCore compilers (mpicc, mpif90, mpif77) and run your job using the SCore commands. MPI-CH performs much less well than SCore and so should not be used without an excellent reason. There are example scripts for SCore, MPICH, LAM and serial jobs on the head node in /info/pbs/; you should copy these and modify them for your own use rather than trying to write your own from scratch.
All compute jobs must be run through the queueing system. The queueing system will assign a number of nodes to you and run your script on the first node, copying the output back to a user-specified file at the end of the job. The queueing system is Torque with the Maui scheduler. Torque is identical to OpenPBS from the user's point of view, so the system will be familiar to anyone who has used kellogg. However, please note that the queue names here are different and the system is optimized for parallel job throughput.
Further information and documentation:
