Resources and Services
Help
System Administration
Quick Links

Condor Job Scheduler in 5 Minutes

Dr. Junjun Mao

Research Associate, Levich Institute at City College of New York, Last update: 12/04/2006

*Red text is new.


What is Condor?

Condor is a distributed job scheduler developed by the computer science department at Univ. of Wisconsin – Madison. It utilizes the computing power of a pool of computers that communicate over a network. When a user submits the job to Condor, Condor finds an available computer in the pool and begins running the job on it. Condor is able to satisfy various special job requirement and assign jobs in a way that balances workload.

Condor Project URL is: https://www.cs.wisc.edu/condor/

Serial job or Parallel Job?

A serial job is a process whose order of instructions is programmed. It may or may not spawn child processes. However, a child process only talks to its parent process in a serial job, and only one process is actively using a CPU at any time. Most jobs can be identified as serial jobs.

Some jobs require more than one process to be running at the same time and processes communacate with each other. Condor's Parallel universe is a mechanism to support a wide variety of parallel programming environments, including most implementations of MPI. Three unique shecduler policies are made for such parallel jobs: 1) When parallel jobs are scheduled, a number of machines will be allocted and the processes start at the same time; 2) Parallel processes will not be migrated, suspended, or preempted. 3) The first machine selected is treated specially - when that job exits, Condor shuts down all the other peer processes, even if they haven't finished running yet.

How to submit condor job?

Submitting a job is telling Condor a description file by command:

 condor_submit descriptionfile 

Here are sample description files for serial jobs and a MPI job. All condor variables and commands in submit description file are case incensitive. Note the MPI job is invoked by a script "mpiscript", and the actual MPI executable is referred as the first argument. This is similar to the convention that MPI executables are submited with mpirun script.

Condor checks node status at an interval of about 5 minutes. This is necessary because jobs need time to clean up resources, CPU load is counted at an average basis. So it may take up to 5 minutes for your job to start and another 10 minutes to release CPU locks after the job is done. Once a job is started, the job runs at its full speed.

Serial job
(example descrpition file - submit from any place)

MPI Job
(example descrpition file - submit from working directory)

#####
# Submit a batch of two serial jobs on crebeo
#####
# Section you do not want to edit
universe = Vanilla
output = run.out
error = condor.err
log = condor.log
getenv = True
#####
# Please customize following entries for your need

executable = /home2/jmao/condor_test/prime.py
#arguments =

initialdir = /home2/jmao/condor_test/run_1
queue
initialdir = /home2/jmao/condor_test/run_2
queue
### End of description file ###
#####
# Condor submit description file template for MPI jobs on Benny
#####
# You may want to keep these lines unchanged
universe = parallel
output = run.out
error = condor.err
log = condor.log
should_transfer_files = yes
when_to_transfer_output = on_exit
executable = /home/condor/bin/mpiscript
#####
# You may comment this line if your parallel jobs do not use samaphores
Requirements = semaphore_clean
# Customize the following lines for your need

arguments = /home/jmao/condor_test/jsingh_tests/test2/mpi-two_bub_401_2
machine_count = 2
queue

### End of description file ###

Universe:

Vanilla universe are for serial jobs that are independent of the condor library. Other universes such as “standard” may take advantage of condor library so that they may “migrate” among nodes to get best performance. Parallel universe is for parallel jobs.

Executable:

The path and file name of the executable. The executables should be on the shared file systems. If using relative path, the path should be relative to the location where condor_submit is called.

Output:

The output file name will capture what the program would normally write to stdout.

Error:

The error file name will capture any error messages the program would normally write to the stderr.

Log:

This file keeps condor’s log of this job.

Initialdir:

Used to specify the initial working directory for the job in the queue. It should be a path to a preexisting directory. If not specified, the Initialdir is the directory where condor_submit is invoked. Input, output, error, and log files are all kept in initialdir.

Queue:

Send a job to the condor queue. When you want to set up runs of the same excutable in different directories, you may prepare multiple Initialdir/Queue pairs. Each Initialdir/Queue is a command to submit a job. As long as the executable is the same, one may submit multiple jobs that run under different directories in a single file. Jobs submitted with a description file is a job “cluster”, and each job is called a “process”.

Machine_count:

This entry is for parallel jobs only. It is the number of machines that the parallel job will run on. Note it is not always a good idea to have a big machine_count number, as the condor scheduler has to wait until all machines are availabe to start your job.

Arguments:

This line may bring in command line arguments of the executable.

Job Scheduling - Preemption and Priority

To ensure that each user gets the fair share of using the computing resources, condor constantly calculates the priority for each user and all jobs in the queue. It means even if job 1 is submitted before job 2, it is not necessarily that job 1 executes before job 2. Currently, preemption, or interruption of a running job, is disabled on crebeo and benny to make sure a job will not be interrupted. Once a program starts, it will run to the end.

Jobs that have been running over 7 days may be preempted, depending on the system load.

User Command Summary

Category Command Command Option/Arguments Description
Submit job condor_submit descriptionfile Submit a job
Monitor jobs condor_q   Display the jobs in the queue
condor_q User Display the jobs in the queue submitted by this user
-analyze Inquire resources for the queued jobs
Monitor Condor pool condor_status Display machine status with Condor format
condor_benny Display machine status with customized format
Edit queued and running jobs condor_hold User Put jobs of this user into hold state
Cluster Put jobs of this cluster into hold state
Cluster.process Put specific job into the hold state
condor_release User Release jobs of the user
Cluster Release a job cluster
Cluster.process Release specified job
condor_rm User Remove queued jobs of the user
Cluster Remove queued job cluster
Cluster.process Remove queued job
Review Completed Jobs condor_history   View log of condor jobs completed to date
User View completed jobs of this user
-l Detailed review mode
Adjust job priority condor_prio (+|-|-p) value Cluster.process Increase/Decrease/Set job priority for a job. The job priority ranges from –20 to +20 and the default value is 0.

 

Tips:

To check the working directory of a job in the queue:

condor_q (to get cluster and process number)
condor_q –l Cluster.process | grep Iwd

To check the working directory of a finished job:

condor_history (to get cluster and process number)
condor_history –l Cluster.process | grep Iwd

Putting jobs at "running" state to "hold" state in Vanilla space may result in unexpected behavior, sometimes jobs will start from the beginning. Holding "idle" jobs is safe.


jmao@ccny.cuny.edu