If you need to run batches of related but independent (order of execution doesn’t matter) jobs in SHARCNET or other consortia, this seminar is for you. The batches can consist of serial jobs (“serial farming”) or parallel jobs (“MPI farming”, “CUDA farming” etc.). Typical situation when you might need this is when your calculation/simulation depends on a few parameters which are poorly constrained. If the number of unconstrained parameters is small (say 1-4), you can sample the parameter space by first running a course grid of jobs (first batch), then analyze the results and run a finer grid of jobs, zooming in onto the area of interest (second, third etc. batches of jobs). In the case of a larger number of parameters one can resort to the Monte-Carlo approach, where one or more batches of jobs would attempt to explore the whole parameter space in a pseudo-random fashion.
In this seminar I will briefly touch upon simplified ways to do job farming (from one-line shell commands to the array jobs feature of the scheduler), but will spend most of my time describing our fairly sophisticated set of scripts developed to facilitate serial etc. farming. With a bit of customization, these scripts can be used to
- submit one or more batches of jobs;
- query the jobs status for any specific batch;
- kill all jobs in a specific batch;
- automatically resubmit all the jobs in a batch which didn’t run (because of the 7 days runtime limit) or failed (due to a crashed node, problems in file systems etc.)
I will also specifically cover the scenario when one needs to run a large number (thousands) of very short jobs. Our scripts can accomplish this task very efficiently, by bundling up many small jobs into fewer larger jobs and utilizing dynamic workload balancing.