UCC Condor FAQ

User-centric FAQ for the Condor implemention in CamGRID


What is CamGRID?

CamGRID is a University-wide grid based on the Condor middleware, co-ordinated by the University Computing Service and the Cambridge eScience Centre (CeSC).

What is Condor?

Condor is a workload-management system (cf. queueing system) for compute-intensive jobs, that particularly specialises in harvesting spare or idle CPU cycles. It is designed for high-throughput rather than high-performance computing. Further details, overview of the Condor System.

How does Condor work?

Like other batch/queueing systems (such as OpenPBS, LSF, GridEngine etc.), Condor works in the following way: users submit their jobs, they are placed into a queue, the time/place of the job execution is based upon a locally-defined policy, then upon completion, the user is informed and the job output collected. Machines advertise various characteristics (OS, architecture, memory, glibc version), and jobs can request particular criteria (OS, architecture, memory, glibc version, disk space...); jobs and machines are matched up by Condor's negotiator mechanism.

Condor can co-ordinate both dedicated resources (from workstation-type hardware to clusters) and periodically-idle desktops to service jobs.

Condor v6.8 manual online at Wisconsin
Local mirror of Condor manual

What resources are available in Cambridge?

At present, CamGRID is split into two distinctive areas, known as PWF Condor/Environment One (N.B. PDF format) and Environment Two. The former is a single large pool of machines comprising the University-wide Public Workstation Facility (PWF), centrally-managed by the University Computing Service. The latter is a loose confederation of interested parties from Departments and Institutions around the University, with each Condor pool being administered locally.

PWF Condor is currently (as of May 2007) unavailable. It is not known when the service will resume.

The status of each local Condor pool in Environment Two is periodically updated and published. Local configuration details (probably only of interest to administrators are also published (latter includes contact details for each pool administrator).

How do I cite use of CamGRID in my publications?

A simple line in any acknowledgment section (usually towards the end of a paper) would be sufficient, e.g. "We are grateful to the University of Cambridge's computational grid, CamGRID, on whose resources these calculations were performed", or something along those lines. This helps to justify the continued existence of CamGRID and the eScience Centre who co-ordinate the resource.

I'd like to use CamGRID - who do I contact?

In the first instance, you should contact your local pool administrator (see http://www.escience.cam.ac.uk/~mcal00/condor/condor_config.html for contact details); if your research group, Department or Institution is not currently involved in CamGRID, then you should encourage them to change this! (see below)

There are two CamGRID mailing lists available: ucam-camgrid-admins@lists.cam.ac.uk and ucam-camgrid-users@lists.cam.ac.uk. As the names suggest, the former is for pool administrators and the latter for users of the CamGRID facilities. Both lists are accessible via the Mailman interface, through which you can subscribe.

What if I need a particular application?

Condor can work either with pre-staged binaries (contact the Pool administrator to discuss local installation of your application in advance) or by shipping in the binaries along with the input files. You may wish to pursue the former route to minimise network traffic, depending on the size of the binary and associated overhead. This course of action is obviously dependent on any licensing restricitions for the particular application you require. The latter is likely to be initially easier to setup.

How do I submit a job?

You need an account on a submit host. Contact your local Condor administrator to arrange this.

You can either run your existing binaries via Condor, or you can re-compile the code against the Condor libraries to take advantage of various features such as checkpointing (available in the Condor 'standard universe'). Either way, you need to construct a submission script, telling Condor what sort of job it is, which files (if any) to transfer, etc.

In the simplest case, we already have a binary which we aren't going to re-compile. There is no checkpointing and the job will appear to be running remotely till the last moment when the output should be returned. We specify the binary, input and output/error files in the submit file, and tell Condor how/when to transfer them. Here is an example of such a script:

universe = Vanilla
requirements = Arch == "x86_64" && OpSys == "LINUX" && Memory > 500
executable = /home/ceb45/GULP/gulp3.0.2.0-linux
input = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.gin
log = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.log
output = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.out
error = /home/ceb45/GULP/step_Ca_acute_water_y_fixed.err
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = /home/ceb45/GULP/gulp3.0.2.0-linux, /home/ceb45/GULP/step_Ca_acute_water_y_fixed.gin
Queue

If we have a number of very similar jobs (e.g. parameter space sweep) to run, we can take advantage of Condor's $Process macro. Here is an example submit file showing this feature:

universe = Vanilla
requirements = Arch == "x86_64" && OpSys == "LINUX" && Memory > 500
executable = /home/ceb45/GULP/gulp3.0.2.0-linux
input = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).gin
log = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).log
output = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).out
error = /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).err
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = /home/ceb45/GULP/gulp3.0.2.0-linux, /home/ceb45/GULP/step_Ca_acute_water_y_fixed-$(Process).gin
Queue 5

You can specify to use the fastest machines with the 'rank = kflops' directive. Your program may be able to run on either a 32 bit (INTEL) or 64 bit (X86_64) computer (usually you can run 32 bit binaries on either) - specify which with the ARCH requirement. If your job can run on a machine with either of the 'X86_64' or the 'INTEL' architecture (i.e. 32- or 64-bit), you specify Requirements = (Arch == "INTEL" || Arch == "X86_64") in the submit file. Other requirements may be the operating system (OpSys) or amount of memory required. Check the manual for all the available options.

We actually submit the job to the 'queue' with the command condor_submit name_of_submit_file:

ceb45@eoarchean--ch:~/GULP> condor_submit gulp-submit.sh
Submitting job(s).....
Logging submit event(s).....
5 job(s) submitted to cluster 32.

Checking the 'queue':

 

ceb45@eoarchean--ch:~/GULP> condor_q

-- Submitter: eoarchean--ch.grid.private.cam.ac.uk : <172.24.116.84:3127> : eoarchean--ch.grid.private.cam.ac.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  32.0   ceb45           3/21 11:57   0+00:00:00 I  0   9.8  gulp3.0.2.0-linux
  32.1   ceb45           3/21 11:57   0+00:00:00 I  0   9.8  gulp3.0.2.0-linux
  32.2   ceb45           3/21 11:57   0+00:00:00 I  0   9.8  gulp3.0.2.0-linux
  32.3   ceb45           3/21 11:57   0+00:00:00 I  0   9.8  gulp3.0.2.0-linux
  32.4   ceb45           3/21 11:57   0+00:00:00 I  0   9.8  gulp3.0.2.0-linux

5 jobs; 5 idle, 0 running, 0 held

These are only simple examples designed to illustrate the process of preparing jobs for Condor. If you are also running GULP jobs then you could use the scripts above as a template. Other examples may appear here as they become common/standard. Otherwise you will need to analyse how your specific code runs and generate the submit script accordingly. Please contact your local Condor pool administrator if you need help or advice on how to do this.

How can I monitor my jobs?

There is a Vanilla universe file viewer facility, developed by CeSC, which monitors running jobs across all the pools in Environment Two. Contact CeSC in order to get a password to use this facility.

As shown above, you can check on your jobs using the condor_q command. Jobs (or clusters of jobs) can be deleted with the command condor_rm job_number.

What about parallel jobs?

Our Condor pool is not set up to run MPI jobs in the Condor 'parallel universe'. If you need to run jobs of this sort, please contact the local COs to find out what local resources are available and suitable for your needs.

Condor-users mailing list archives

The university of Madison hold an archive of all postings to the condor-users mailing list, which can be searched to see if anyone else has encountered a similar problem in the past. Archive listing (pre-June-2004 to present).

Other online resources

A description of the two grids within the university.

The main page for CamGrid's Environment Two. This describes the requirements for joining the grid and lists useful online resources.

A description of how Env2 of CamGrid is structured, including an architectural diagram.

A set of tools for monitoring the state of the grid and users' vanilla jobs. Also has a simple submit file builder.

An example of the sort of configuration tasks a sysadmin might need to carry out on a machine to get it on CamGrid.

A description of how PBS resources can be added to CamGrid (for sysadmins) and examples of how they can have jobs submitted to them (for users).

A brief tutorial on using Parrot within CamGrid for exporting file systems from a submit host in an unprivileged manner.

Instructions on how to link F95 code with the Condor libraries using the NAG compiler to enable standard universe jobs.

A tutorial on how to use DAGMan, Condor.s workflow tool, to perform checkpointing for vanilla jobs at the application layer.