UCC Condor FAQ
User-centric FAQ for the Condor implemention in CamGRID
- What is CamGRID?
- What is Condor?
- How does Condor work?
- What resources are available?
- I'd like to use CamGRID - who do I contact?
- How do I cite use of CamGRID in my publications?
- What if I need a particular application?
- How do I submit a job?
- How can I monitor my jobs?
- What about parallel jobs?
- Condor-users mail archive
- Other online resources
What is CamGRID?
CamGRID is a University-wide grid based on the Condor middleware, co-ordinated by the University Computing Service and the Cambridge eScience Centre (CeSC).
What is Condor?
Condor is a workload-management system (cf. queueing system) for compute-intensive jobs, that particularly specialises in harvesting spare or idle CPU cycles. It is designed for high-throughput rather than high-performance computing. Further details, overview of the Condor System.
How does Condor work?
Like other batch/queueing systems (such as OpenPBS, LSF, GridEngine etc.), Condor works in the following way: users submit their jobs, they are placed into a queue, the time/place of the job execution is based upon a locally-defined policy, then upon completion, the user is informed and the job output collected. Machines advertise various characteristics (OS, architecture, memory, glibc version), and jobs can request particular criteria (OS, architecture, memory, glibc version, disk space...); jobs and machines are matched up by Condor's negotiator mechanism.
Condor can co-ordinate both dedicated resources (from workstation-type hardware to clusters) and periodically-idle desktops to service jobs.
Condor v6.8 manual online at Wisconsin
Local mirror of Condor manual
What resources are available in Cambridge?
At present, CamGRID is split into two distinctive areas, known as PWF Condor/Environment One (N.B. PDF format) and Environment Two. The former is a single large pool of machines comprising the University-wide Public Workstation Facility (PWF), centrally-managed by the University Computing Service. The latter is a loose confederation of interested parties from Departments and Institutions around the University, with each Condor pool being administered locally.
PWF Condor is currently (as of May 2007) unavailable. It is not known when the service will resume.
The status of each local Condor pool in Environment Two is periodically updated and published. Local configuration details (probably only of interest to administrators are also published (latter includes contact details for each pool administrator).
How do I cite use of CamGRID in my publications?
A simple line in any acknowledgment section (usually towards the end of a paper) would be sufficient, e.g. "We are grateful to the University of Cambridge's computational grid, CamGRID, on whose resources these calculations were performed", or something along those lines. This helps to justify the continued existence of CamGRID and the eScience Centre who co-ordinate the resource.
I'd like to use CamGRID - who do I contact?
In the first instance, you should contact your local pool administrator (see http://www.escience.cam.ac.uk/~mcal00/condor/condor_config.html for contact details); if your research group, Department or Institution is not currently involved in CamGRID, then you should encourage them to change this! (see below)
There are two CamGRID mailing lists available: ucam-camgrid-admins@lists.cam.ac.uk and ucam-camgrid-users@lists.cam.ac.uk. As the names suggest, the former is for pool administrators and the latter for users of the CamGRID facilities. Both lists are accessible via the Mailman interface, through which you can subscribe.
What if I need a particular application?
Condor can work either with pre-staged binaries (contact the Pool administrator to discuss local installation of your application in advance) or by shipping in the binaries along with the input files. You may wish to pursue the former route to minimise network traffic, depending on the size of the binary and associated overhead. This course of action is obviously dependent on any licensing restricitions for the particular application you require. The latter is likely to be initially easier to setup.
How do I submit a job?
You need an account on a submit host. Contact your local Condor administrator to arrange this.
You can either run your existing binaries via Condor, or you can re-compile the code against the Condor libraries to take advantage of various features such as checkpointing (available in the Condor 'standard universe'). Either way, you need to construct a submission script, telling Condor what sort of job it is, which files (if any) to transfer, etc.
In the simplest case, we already have a binary which we aren't going to re-compile. There is no checkpointing and the job will appear to be running remotely till the last moment when the output should be returned. We specify the binary, input and output/error files in the submit file, and tell Condor how/when to transfer them. Here is an example of such a script:
universe = Vanilla requirements = Arch == "x86_64" && OpSys == "LINUX" && Memory > 500 executable = /home/ceb45/GULP/gulp3.0.2.0-linux input = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.gin log = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.log output = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.out error = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.err should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT transfer_input_files = /home/ceb45/GULP/gulp3.0.2.0-linux, /home/ceb45/GULP/step_Ca_acute_water_y_fixed.gin Queue
If we have a number of very similar jobs (e.g. parameter space sweep) to run, we can take advantage of Condor's $Process macro. Here is an example submit file showing this feature:
universe = Vanilla requirements = Arch == "x86_64" && OpSys == "LINUX" && Memory > 500 executable = /home/ceb45/GULP/gulp3.0.2.0-linux input = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).gin log = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).log output = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).out error = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).err should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT transfer_input_files = /home/ceb45/GULP/gulp3.0.2.0-linux, /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).gin Queue 5
You can specify to use the fastest machines with the 'rank = kflops' directive. Your program may be able to run on either a 32 bit (INTEL) or 64 bit (X86_64) computer (usually you can run 32 bit binaries on either) - specify which with the ARCH requirement. If your job can run on a machine with either of the 'X86_64' or the 'INTEL' architecture (i.e. 32- or 64-bit), you specify Requirements = (Arch == "INTEL" || Arch == "X86_64") in the submit file. Other requirements may be the operating system (OpSys) or amount of memory required. Check the manual for all the available options.
We actually submit the job to the 'queue' with the command condor_submit name_of_submit_file:
ceb45@eoarchean--ch:~/GULP> condor_submit gulp-submit.sh Submitting job(s)..... Logging submit event(s)..... 5 job(s) submitted to cluster 32.
Checking the 'queue':
ceb45@eoarchean--ch:~/GULP> condor_q -- Submitter: eoarchean--ch.grid.private.cam.ac.uk : <172.24.116.84:3127> : eoarchean--ch.grid.private.cam.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 32.0 ceb45 3/21 11:57 0+00:00:00 I 0 9.8 gulp3.0.2.0-linux 32.1 ceb45 3/21 11:57 0+00:00:00 I 0 9.8 gulp3.0.2.0-linux 32.2 ceb45 3/21 11:57 0+00:00:00 I 0 9.8 gulp3.0.2.0-linux 32.3 ceb45 3/21 11:57 0+00:00:00 I 0 9.8 gulp3.0.2.0-linux 32.4 ceb45 3/21 11:57 0+00:00:00 I 0 9.8 gulp3.0.2.0-linux 5 jobs; 5 idle, 0 running, 0 held
These are only simple examples designed to illustrate the process of preparing jobs for Condor. If you are also running GULP jobs then you could use the scripts above as a template. Other examples may appear here as they become common/standard. Otherwise you will need to analyse how your specific code runs and generate the submit script accordingly. Please contact your local Condor pool administrator if you need help or advice on how to do this.
How can I monitor my jobs?
There is a Vanilla universe file viewer facility, developed by CeSC, which monitors running jobs across all the pools in Environment Two. Contact CeSC in order to get a password to use this facility.
As shown above, you can check on your jobs using the condor_q command. Jobs (or clusters of jobs) can be deleted with the command condor_rm job_number.
What about parallel jobs?
Our Condor pool is not set up to run MPI jobs in the Condor 'parallel universe'. If you need to run jobs of this sort, please contact the local COs to find out what local resources are available and suitable for your needs.
Condor-users mailing list archives
The university of Madison hold an archive of all postings to the condor-users mailing list, which can be searched to see if anyone else has encountered a similar problem in the past. Archive listing (pre-June-2004 to present).
Other online resources
A description of the two grids within the university.
A description of how Env2 of CamGrid is structured, including an architectural diagram.
