UCC Condor Pool
Configuration etc. of UCC Condor pool
Overview
The UCC Condor pool forms part of the University-wide CamGRID environment. CamGRID is a collaborative project between the Cambridge eScience Centre and various Departments/Institutions including the UCC, the School of Biological Sciences, High Energy Physics, Semiconductor Physics, Astrophysics, the Department of Oncology, the National Institute for Environmental eScience (NIEeS) and the eMinerals project at the Department of Earth Sciences.
The pool consists of a central manager/submit host, plus the iPaqs that were used as PWF workstations in the library until summer 2004, and any other unemployed hardware. Thirty new Dell PowerEdge SC1435 (dual, dual-core Opteron) nodes were purchased through the SRIF scheme and commissioned in March 2007. All these machines are setup as dedicated Condor hosts and are not used for interactive work.
Communications about CamGRID are generally sent to the mailing list (ucam-camgrid-users@lists.cam.ac.uk), to which you are advised to subscribe (you can do this online via the Mailman interface).
Guide for Users
Essentially, Condor provides a high throughput computing environment to allow computationally intensive programs to be run. The sort of programs that are ideally suited to Condor are those where a single program needs to be run many times. There are essentially two situations where this applies. The first is where the program generates results in some random fashion, so many runs are needed to generate accurate statistics. The second situation is where a large parameter space needs to be explored and the program will be run with many different inputs, either by reading in different input files or by changing command line parameters.
All Condor users must subscribe to the CamGrid mailing list. To do this, go to https://lists.cam.ac.uk/mailman and request membership of the ucam-camgrid-users list.
Publications
You should be sure to acknowledge your use of these resources in any relevant publications. A sentence in the acknowledgements similar to "We acknowledge the use of computing facilities provided by CamGrid" is all that's needed - the exact wording is unimportant, so long as CamGrid is mentioned. Please also email Mark Calleja with details of the publication.
Some restrictions on what you can run using Condor
- Condor can not be used to simply take a program that takes a long time to run and execute it in a parallel manner across many machines.
- Please bear in mind that if you intend to run a commercial program, there may be restrictions on how many computers can simultaneously run the program due to license issues.
Using Condor
The Condor manual provides extensive details about using Condor. You should begin by reading some sections of the manual, and especially the sections that describe a road-map for running jobs, job submission and job management. The section on job submission is particularly important, and includes some examples of how to submit jobs. Condor provides different universes in which to run jobs, but it is unlikely that you will wish to use anything other than the vanilla and standard universes, which are described in the road-map. Although there is a lot of information in the manual, in practice you will not need to know everything there. A sensible strategy might be to skim-read the sections linked above and to then ensure you understand the examples below. The Condor master is condor.ch.cam.ac.uk and any jobs should be submitted from there.
Job submission
The steps you will need to go through to submit a job to Condor are as follows.
- Have a Linux executable that you can run on an ordinary machine, e.g. by typing
./programnameat a Linux command prompt. - Create a job submission file (see below), which as an example we will call
job.sub - ssh to
condor.ch.cam.ac.uk, and submit your job by running the commandcondor_submit job.sub. After a brief wait, you should be told that the job has been succesfully submitted. If not, either refer to the manual or contact the COs. - Wait for the job to finish! There is some information below about monitoring the job you have submitted.
Examples of how to submit jobs to Condor
To submit jobs to Condor, you need to create a job submission file that contains details about the job. The contents of an example input file are shown here, and a discussion of the things you might want to change in it will follow. For a full discussion of job submission and submission files, please refer to the Condor manual.
# Example 1 of a Condor submission file universe = vanilla executable = programname arguments = -arg1 -arg2 log = program.log output = program.out error = program.err should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = input1.dat, input2.dat, input3.dat notify_user = spqr1@cam.ac.uk queue
- You should change
programnameon the second line to whatever your program is called. - The
arguments =line contains any command line arguments your program takes. The above example is for a program that you would run by issuing the command./programname -arg1 -arg2at a Linux command prompt. If your program requires no arguments, delete this line. - Condor will log useful information to the file referred to on the
log =line. If you think your job is running incorrectly, this is a good place to look. - Anything that would normally be output to the screen (stdout) by your program will be redirected to the file referred to on the
output =line. - Anything that would normally be output as an error message (stderr) by your program will be redirected to the file referred to on the
error =line. - The two lines that begin
should_transfer_files =andwhen_to_transfer_files =should generally be left unchanged from this example, unless you have read the Condor manual and have good reason to change them. - If your program needs to read any input files whilst running, they should be listed on the
transfer_input_files =line. If no such files are needed, remove the line that beginstransfer_input_files =. - The
notify_user =line should contain your email address, as Condor will email you either when your job finished successfully, or if there is a problem. - The final
queuecommand tells Condor to run a single copy of your program. If your program creates any output files, they must be created in the same directory as your executable, not in a subdirectory.
The above example will run your program once, and is not very different to you simply running the program on your computer. However, more usually you will want to run your program many times. One scenario is that your program produces output based on a random number generator, so you want to run it many times and generate multiple outputs. Perhaps the simplest case is that you want to run the program with the same inputs many times, with each instance of the program producing a different output which is then printed to screen. Example 2 is a submission file for this scenario. It is followed by Example 3, which explains how to run the program many times with different inputs. Example 3 also covers the situation where you don't wish to change the inputs, but the program outputs some file(s) rather than only printing to the screen as in Example 2. A discussion of the differences between Example 1 and Examples 2/3 then follows.
# Example 2 of a Condor submission file - changes from Example 1 in blue universe = vanilla executable = programname arguments = -arg1 -arg2 log = program.log output = program.out.$(PROCESS) error = program.err should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = input1.dat, input2.dat, input3.dat notify_user = spqr1@cam.ac.uk queue 20
In Example 2, the output of each file (which would normally be printed to screen) will be stored in the files program.out.0, program.out.1, program.out.2 etc. This will be explained in more detail following Example 3.
# Example 3 of a Condor submission file - changes from Example 2 in blue universe = vanilla executable = programname arguments = -arg1 -arg2 log = program.log output = program.out error = program.err initialdir = dir.$(PROCESS) should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = input1.dat, input2.dat, input3.dat notify_user = spqr1@cam.ac.uk queue 20
There are two differences from the first example here. The last line now reads queue 20, which simply tells Condor to run 20 copies of your program instead of one. There is also a new line referring to initialdir. $(PROCESS) is a variable that will take the values 0, 1, 2... up to N-1 when you instruct Condor to queue N. You should therefore create directories called dir.0, dir.1 etc. depending on how many jobs you wish to run. You must make sure these directories exist before you submit your job. Condor will not create them for you. Prior to submitting your job, you should ensure that each of these directories contains any input files your program needs. This allows you to run your job for many different inputs. On completion, any files that your program creates will be in these directories, so you will have files called (for example) dir.0/output.dat, dir.1/output.dat and so on.
Running on Chemistry machines only
You may want to restrict your jobs to run only on the machines in Chemistry (rather than allowing them to run on machines in other groups), for example because your program requires software that is only installed on the Chemistry machines. You can achieve this by adding the requirement IS_CHEM==TRUE to your submit file.
Monitoring the job after submission
There are a couple of useful tools you can run (from condor.ch.cam.ac.uk) to monitor your job. The condor_status command will show you the current status of all the machines in the Chemistry pool. The columns likely to be of most interest are the State and Activity columns. The State of a machine will be one of
Unclaimed: the machine is not currently in use by anyoneClaimed: Condor is running a job on the machineOwner: Someone is using the machine, so it is unavailable for use by Condor
The two most important values taken by the Activity column are Busy and Idle , which mean respectively that Condor either is or is not using the machine. Each of the Dells has 4 processing cores which are, to all intents and purposes, individual computers (or in Condor's terminology, slots). These show up in condor_status as slotX@gridlockY, where X ranges from 1 to 4.
The condor_q command will display information about any jobs currently queued. Each job has an ID number of the form x.y. If you submitted just one job (your submit file finished with queue) you will see just one entry of the form 10.0. If you submitted multiple jobs (your submit file finished with queue N) you will see N entries which will take the form 10.0, 10.1, 10.2 etc. You will probably be most interested in the column headed ST which tells you the job status. Generally, this will be either I or R for idle and running.
If you wish to remove your job, firstly check the job ID using condor_q. To remove the batch of jobs with IDs N.0, N.1, N.2 etc issue the command condor_rm N.
Self-checkpointing programs
Without check-pointing, if your job is evicted by a higher-priority user you will lose the progress made by the computation. If possible, you should make your program save information about its current status that can be read back in to continue from where the program left off. In order to make use of this you need to alter the when_to_transfer_output = line of the submit file to read
when_to_transfer_output = on_exit_or_evict
Any files generated by the program will then be transferred back to the submit machine on eviction, and return with the job to the new execute machine when it resumes. The rough layout of your program would need to be something like
- Check if file containing information about a previous calculation exists
- if no, start program from beginning
- if yes, parse file and start calculation from the middle
Flocking with Condor
A cluster of computers running Condor (i.e., the computers in Chemistry) is known as a pool. Condor is able to connect multiple pools together using a mechanism known as flocking. This allows a job submitted in one Condor pool to run on other Condor pools. The Chemistry pool is part of CamGrid Environment 2, which is a project co-ordinated by the Cambridge eScience Centre to link together different Condor pools across the university. The main consequence of this is that there are lots more computers available to run your program on! You don't need to do anything to take advantage of this - if all of the machines in Chemistry are being used, your program will then automatically attempt to flock to one of the other pools. A list of the pools currently linked together by CamGrid, along with all the computers available for use, can be found here.
Technical details
Hardware, networking etc.
The Central Manager (and submit host) is called eoarchean.ch.cam.ac.uk, or alternatively eoarchean--ch.grid.private.cam.ac.uk, its IP address is 172.24.116.84 and its RSA key fingerprint is ff:a3:bd:0f:b1:6e:25:b1:24:b3:c6:60:96:58:be:6a. It is a Dell OptiPlex recycled for this purpose; it runs Ubuntu.
The submit host is the only machine with user accounts. It provides secure (ssh) access to the rest of the pool (though normally you should not have to use this, since Condor should take care of your data transfers in the course of your job). The configuration of CamGRID is such that when you submit your job(s) from your local submit host(s), they may either be run on a machine in the local pool or 'flocked-out' to another pool in another Department. There is more disk space on the submit host than on the execute hosts, but you should still not use it for long-term storage of data, since it is not backed up. There is a partition NFS-exported from the head node /sharedscratch.
Accounts on this machine are managed through the Admitto authentication system.
Domain: grid.private.cam.ac.uk
Gateway: 172.24.116.126
Netmask: 255.255.255.192
The SC1435 nodes have 8GB memory, a 160GB hard drive and Opteron 2200 2.4GHz CPUs.
| Hostname | IP address | Function |
| eoarchean--ch.grid.private.cam.ac.uk | 172.24.116.84 | Central manager, submit host |
| gridlock29--ch.grid.private.cam.ac.uk | 172.24.116.96 | Execute host |
| gridlock28--ch.grid.private.cam.ac.uk | 172.24.116.97 | Execute host |
| gridlock27--ch.grid.private.cam.ac.uk | 172.24.116.98 | Execute host |
| gridlock26--ch.grid.private.cam.ac.uk | 172.24.116.99 | Execute host |
| gridlock25--ch.grid.private.cam.ac.uk | 172.24.116.100 | Execute host |
| gridlock24--ch.grid.private.cam.ac.uk | 172.24.116.101 | Execute host |
| gridlock23--ch.grid.private.cam.ac.uk | 172.24.116.102 | Execute host |
| gridlock22--ch.grid.private.cam.ac.uk | 172.24.116.103 | Execute host |
| gridlock21--ch.grid.private.cam.ac.uk | 172.24.116.104 | Execute host |
| gridlock20--ch.grid.private.cam.ac.uk | 172.24.116.105 | Execute host |
| gridlock19--ch.grid.private.cam.ac.uk | 172.24.116.106 | Execute host |
| gridlock18--ch.grid.private.cam.ac.uk | 172.24.116.107 | Execute host |
| gridlock17--ch.grid.private.cam.ac.uk | 172.24.116.108 | Execute host |
| gridlock16--ch.grid.private.cam.ac.uk | 172.24.116.109 | Execute host |
| gridlock15--ch.grid.private.cam.ac.uk | 172.24.116.110 | Execute host |
| gridlock14--ch.grid.private.cam.ac.uk | 172.24.116.111 | Execute host |
| gridlock13--ch.grid.private.cam.ac.uk | 172.24.116.112 | Execute host |
| gridlock12--ch.grid.private.cam.ac.uk | 172.24.116.113 | Execute host |
| gridlock11--ch.grid.private.cam.ac.uk | 172.24.116.114 | Execute host |
| gridlock10--ch.grid.private.cam.ac.uk | 172.24.116.115 | Execute host |
| gridlock09--ch.grid.private.cam.ac.uk | 172.24.116.116 | Execute host |
| gridlock08--ch.grid.private.cam.ac.uk | 172.24.116.117 | Execute host |
| gridlock07--ch.grid.private.cam.ac.uk | 172.24.116.118 | Execute host |
| gridlock06--ch.grid.private.cam.ac.uk | 172.24.116.119 | Execute host |
| gridlock05--ch.grid.private.cam.ac.uk | 172.24.116.120 | Execute host |
| gridlock04--ch.grid.private.cam.ac.uk | 172.24.116.121 | Execute host |
| gridlock03--ch.grid.private.cam.ac.uk | 172.24.116.122 | Execute host |
| gridlock02--ch.grid.private.cam.ac.uk | 172.24.116.123 | Execute host |
| gridlock01--ch.grid.private.cam.ac.uk | 172.24.116.124 | Execute host |
| gridlock00--ch.grid.private.cam.ac.uk | 172.24.116.125 | Execute host |
The other Departments and institutions collaborating in CamGRID are as follows:
172.24.116.0/26 CeSC 172.24.116.64/26 UCC/Chemistry 172.24.116.128/26 HEP 172.24.116.192/26 Earth Sciences 172.24.89.0/26 NIEeS 172.24.89.64/26 Biological Sciences 172.24.89.128/26 Semiconductors 172.24.89.192/26 Astrophysics 172.24.189.0/26 Materials Science 172.24.189.64/26 UCS 172.24.189.128/26 Earth Sciences 172.24.189.192/26 Radio Astronomy 172.24.252.0/26 Semiconductors 172.24.252.64/26 Oncology 172.24.252.128/26 Currently free 172.24.252.192/26 Currently free
Condor configuration
Each node has one or more virtual machine(s) (VMs) depending on the processor (the iPaqs have a single VM, the Dell nodes have four). In general, the VMs are defined to be symmetrical, e.g. the same amount of memory, number of processors, amount of swap space etc. It is trivial to reconfigure nodes with the following setup:
VIRTUAL_MACHINE_TYPE_1 = cpus=1, ram=3000, swap=25% NUM_VIRTUAL_MACHINES_TYPE_1 = 2 VIRTUAL_MACHINE_TYPE_2 = cpus=1, ram=1000, swap=25% NUM_VIRTUAL_MACHINES_TYPE_2 = 2
to allow 'large memory' (>2GB) jobs to run. You will need to include the requirement "Memory > 2000" in your submit script to ensure your job runs on one of the 'large memory' VMs. Please email the Computer Officers on the usual address if you need to take advantage of this functionality.
The submit host will have the requirement 'kill running jobs with memory usage > 2GB' (SYSTEM_PERIODIC_REMOVE = ((JobStatus == 2) && (TARGET.ImageSize >2000000)) ) temporarily relaxed as a consequence.
Software, applications etc.
hadean--ch has some compilers installed: gcc, gcc-c++, gcc-fortran and gfortran.Please ask if there is something specific/else you require and we will endeavour to accommodate your request (within the constraints of licenses, filespace etc.).
The new Dell machines run SLES10 (SP1); the kernel version is 2.6.16.54-0.2.3-smp and the glibc version is 2.4-31.30. GLIBC (major.minor) is advertised in the ClassAd. The Dells are 64-bit machines but your existing (32-bit) code should run without re-compilation. Please report problems to the usual address and we will try to sort it out.
All machines in the pool run the latest stable release of Condor. The Condor installation will generally be updated within a week or two of a new version being released. At present, we only support the vanilla universe of Condor - if you need to take advantage of any of the features on a different condor universe, please come and talk to us. Note that in order to use the checkpointing feature in the standard universe you need to re-compile your code against the condor libraries.
Some frequently-asked questions (and hopefully answers).
There is an on-line manual available (N.B. local mirror). The development versions (indicated by the second digit being odd) are not supported in the CamGRID environment.
