From Documentation
Jump to: navigation, search
Note: Some of the information on this page is for our legacy systems only. The page is scheduled for an update to make it applicable to Graham.

What is serial farming, and who needs it?

Serial farming is solving a computational (research, engineering etc.) problem by running a bunch of serial jobs on a cluster. You need serial farming, if your problem belongs to HPC (High Performance Computing), and can be split into a large number of independent chunks.

HPC can be defined in different ways, but probably the simplest definition is "something you cannot solve on your desktop".

Serial farming jobs normally have no data dependency. In other words, the order (and concurrency) of their execution can be any, still producing the same result. Monte-Carlo type problems is a good example.

SHARCNET environment

To efficiently run serial farming jobs in SHARCNET, one has to know certain aspects of the SHARCNET's environment.

Scheduler

The basic commands are sqsub, sqkill, sqjobs. Keep in mind that serial jobs are much easier to allocate than large parallel jobs.

The maximum number of concurrently running jobs is limited by the user's Certification Level (256 cpus per cluster for the default Level 1). There is also a cluster specific limit on the number of queued jobs (5000 for orca).

Global vs. cluster specific file systems

Some of our file systems (/home, /work) are shared across multiple clusters, while others are local to the cluster (/scratch). This can be an issue if the serial farm spans more than one cluster. If all the jobs from a serial farm are run on one cluster, any file system should be ok.

Bash scripting

(See also this and this.)

  • You need scripting for serial farming
    • Tens/hundreds of identical sqsub etc. commands.
    • Large number of data files to process: ICs, post-processing...
  • Any scripting language will do (Perl, Python, ...), but
    • Bash is convenient because you already know many shell commands.
    • Parts of the script can be tested by directly executing them in the shell.
  • Usual shell commands
    • ls, cd, mkdir, rm, cp, mv, grep, cat, cut...
  • Variables
    • Environment variables ($PATH etc.)
    • name=value to initialize locally.
    • $name to get the value.
    • $1, $2, ... (or $*) - command-line arguments.
  • Other commands
    • echo, exit, lockfile, shift
    • if test ...; then ...; fi
    • for name in ...; do ... ; done
    • for ((i=0; i<10; i++)); do ...; done
    • Integer arithmetic: e.g. i=$(($i + 1))
    • Inline functions: function name () {...}.
  • Piping
    • |, >, 2>, &>, 2>&1, >>, <,...

Examples

  • Submit $N serial jobs, with command line argument (IC file) data0, data1, ..., dataN:
for ((i=0; i<=$N; i++))
  do 
  sqsub -r1d ./code ./data$i
  done
  • Same, but storing JOBIDs in a file:
for ((i=0; i<$N; i++))
  do
  echo "Submitting job no. $i"
  JOBID=`sqsub -r 7d -o out$i code data$i 2>&1 |grep "submitted as jobid" | cut -d" " -f4`
  echo $JOBID >> jobid.txt
  done
  • Killing all jobs listed in jobid.txt:
sqkill `cat jobid.txt`

GSL (GNU Scientific Library)

Overview

Random number generation

  • The whole state of the generator can be saved (and read) as a binary file.
    • Not portable between 32 and 64 bit CPUs.
  • Can be initialized in different ways:
    • Reading a state file.
    • Reading a seed from stdin/file.
    • Reading a seed from environment.
    • Generator itself can be specified through environment.

Signal handling

  • Not absolutely necessary, but highly desirable.
  • Bash script signal handling:
    • To properly interrupt partially executed job submission script.
    • Command trap handler n1 n2 ...
  • Code signal handling:
  • To save the final random generator state when the job crashes or is killed.
  • Function signal (n, handler);

(Almost) perfect MC setup

Main points

  • Generate once a bunch of random number generator states, place them in a shared file system.
  • Every time you run a serial code, it reads the available state file. At the end, it replaces it with the final state.
  • Complications: file locking should be used; killed jobs=> signal handling.
  • Can also be used in parallel (MPI) codes.

Farming2.png

Implementation

  • A code for random state files generation.
    • Creates many binary state files in given shared directory, like seed339.avail .
  • Using lockfile command to guarantee that each job uses a unique state file. (Be aware that lockfile doesn't work on parallel file systems like our global work and scratch; it does work on NFS file systems like /home . lockfile is not installed system-wide, but you can get a copy here: ~syam/bin/lockfile . Remember - only use it with /home !)
  • Checkpointing: writing state file regularly.
  • Job killed: signal handling.

Example files

  • Can be found in
/home/syam/Monte_Carlo
  • Program genseed.c for initial state file generation.
  • Code example: random.c
  • Bash scripts:
    • multi_run.sh: submitting the jobs;
    • multi_jobs.sh: job stats;
    • multi_kill.sh: killing jobs;
    • release_seeds.sh: cleanup after major crashes.

Conclusions

  • Serial farming is an efficient way of doing HPC.
  • It is very important to know/use a scripting language (like bash).
  • GSL is a great tool, in particular for Monte Carlo simulations.
  • There are setups (I presented one example) that achieve a “perfect” random number generation for serial farming or parallel codes.