(→MPI Job) |
(→MPI Job) |
||
Line 323: | Line 323: | ||
#SBATCH --mem-per-cpu=2000M | #SBATCH --mem-per-cpu=2000M | ||
#SBATCH --time=00:30:00 | #SBATCH --time=00:30:00 | ||
− | #SBATCH --job-name= | + | #SBATCH --job-name=mympijob4-1node-16core # jobname |
− | #SBATCH --output=slurm-%x-%N-%j.out | + | #SBATCH --output=slurm-%x-%N-%j.out # %x = jobname |
# ---------------------------------------------------------- | # ---------------------------------------------------------- | ||
# IMPORTANT: to set license server must do: | # IMPORTANT: to set license server must do: |
Revision as of 20:51, 5 December 2018
LSDYNA |
---|
Description: Suite of programs for transient dynamic finite element program |
SHARCNET Package information: see LSDYNA software page in web portal |
Full list of SHARCNET supported software |
Contents
Introduction
Before a research group can use LSDYNA on sharcnet, a license must be purchased directly from LSTC for the sharcnet license server. Alternately, if a research group resides at a institution that has sharcnet computing hardware (mac, uwo, guelph, waterloo) it maybe possible to use a pre-existing site license hosted on an accessible institutional license server. To access and use this software you must open a ticket and request to be added to the sharcnet lsdyna group.
Version Selection
Graham Cluster
License
Create and customize this file for your license server, where XXXXX should be replaced with your port number and Y with your server number:
[roberpj@gra1:~] cat .licenses/ls-dyna.lic #LICENSE_TYPE: network #LICENSE_SERVER: XXXXX@licenseY.uwo.sharcnet
Module
The modules are loaded automatically in the testomp.sh and testmpi.sh script shown below in the example section:
For single node smp jobs the available versions can be found by doing:
[roberpj@gra-login2:~] module spider ls-dyna ls-dyna/7.1.2 ls-dyna/7.1.3 ls-dyna/8.1 ls-dyna/9.1 ls-dyna/9.2 ls-dyna/10.0
To load any version do for example:
module load ls-dyna/9.1
For multi node mpp jobs the available versions can be found by doing:
[roberpj@gra-login2:~] module spider ls-dyna-mpi ls-dyna-mpi/7.1.2 ls-dyna-mpi/7.1.3 ls-dyna-mpi/8.1 ls-dyna-mpi/9.1 ls-dyna-mpi/9.2 ls-dyna-mpi/10.0
To load the 7.1.X modules do the following:
module load openmpi/1.6.5 ls-dyna-mpi/7.1.3
To load the 8.X or 9.X modules do:
module load openmpi/1.8.8 ls-dyna-mpi/9.2
To load the 10.X modules do:
module load openmpi/1.10.7 ls-dyna-mpi/10.0
Legacy Clusters
License
Once a license configuration is established, the research group will be given a 5 digit port number. The value should then be inserted into the appropriate departmental export statement before loading the module file as follows:
o UofT Mechanical Engineering Dept
export LSTC_LICENSE_SERVER=Port@license1.uwo.sharcnet
o McGill Mechanical Engineering Department
export LSTC_LICENSE_SERVER=Port@license2.uwo.sharcnet
o UW Mechanical and Mechatronics Engineering Dept
export LSTC_LICENSE_SERVER=Port@license3.uwo.sharcnet
o Laurentian University Bharti School of Engineering
export LSTC_LICENSE_SERVER=Port@license4.uwo.sharcnet
o Fictitious Example:
export LSTC_LICENSE_SERVER=12345@license1.uwo.sharcnet
Module
The next step is load the desired sharcnet lsdyna module version. Check which modules are available by running the module avail command then load one the modules as shown:
[roberpj@orc-login2:~] module avail lsdyna ------------------------------------ /opt/sharcnet/modules ------------------------------------ lsdyna/hyb/r711.88920 lsdyna/mpp/r711.88920 lsdyna/smp/r611.79036 lsdyna/mpp/ls971.r85718 lsdyna/mpp/r712.95028 lsdyna/smp/r711.88920 lsdyna/mpp/ls980B1.011113 lsdyna/mpp/r800.95359 lsdyna/smp/r712.95028 lsdyna/mpp/r611.79036 lsdyna/mpp/r901.109912 lsdyna/smp/r800.95359 lsdyna/mpp/r611.80542 lsdyna/smp/ls980B1.78258 lsdyna/smp/r901.109912
r9.0.1 versions
o For serial or threaded jobs:
module load lsdyna/smp/r901.109912
o For mpi jobs:
module unload intel openmpi lsdyna mkl module load intel/15.0.3 openmpi/intel1503-std/1.8.7 lsdyna/mpp/r901.109912
r8.0.0 versions
o For serial or threaded jobs:
module load lsdyna/smp/r800.95359
o For mpi jobs:
module unload intel openmpi lsdyna mkl module load intel/15.0.3 openmpi/intel1503-std/1.8.7 lsdyna/mpp/r800.95359
r7.1.2 versions
o For serial or threaded jobs:
module load lsdyna/smp/r712.95028
o For mpi jobs:
module unload intel openmpi lsdyna module load intel/12.1.3 openmpi/intel/1.6.5 lsdyna/mpp/r712.95028
r7.1.1 versions
o For serial or threaded jobs:
module load lsdyna/smp/r711.88920
o For mpi jobs:
module load lsdyna/mpp/r711.88920
r6.1.1 versions
o For serial or threaded jobs:
module load lsdyna/smp/r611.79036
o For mpi jobs:
module unload intel openmpi lsdyna module load intel/11.1.069 openmpi/intel/1.6.4 lsdyna/mpp/r611.79036 or module load intel/11.1.069 openmpi/intel/1.6.4 lsdyna/mpp/r611.80542
where r611.80542 provides lsdyna_s only.
ls980 versions
o For serial or threaded jobs:
module load lsdyna/smp/ls980B1.78258
o For mpi jobs:
module unload intel openmpi lsdyna module load intel/12.1.3 openmpi/intel/1.4.5 lsdyna/mpp/ls980B1.011113 or module load intel/12.1.3 openmpi/intel/1.6.2 lsdyna/smp/ls980B1.78258
Note1) restart capability for ls980smpB1 and ls980mppB1 is not supported Note2) the module using legacy openmpi/intel/1.4.5 will run extremely slow
Job Submission
Graham Cluster
The submission scripts myompjob.sh and mympijob.sh for the airbag problem are shown in the Example Job section below, for both graham and orca. Please note that you should specify your own username if submitting to a def account (not roberpj). Alternatively you could specify your resource allocation account:
Sample threaded job submit script:
sbatch myompjob.sh
Sample mpi job submit script:
sbatch mympijob.sh
Legacy Clusters
When loading a lsdyna sharcnet legacy module on the new orca or a sharcnet legacy system, the single or double precision solvers are specified with lsdyna_s or lsdyna_d respectively, as shown in the following sqsub commands:
1cpu SERIAL Job
sqsub -r 1h -q serial -o ofile.%J --mpp=2G lsdyna_d i=airbag.deploy.k ncpu=1
4cpu SMP Job
sqsub -r 1h -q threaded -n 4 -o ofile.%J --mpp=1G lsdyna_s i=airbag.deploy.k ncpu=4
If using an explicit solver, one can specify a conservative initial memory setting on the command line as follows:
export LSTC_MEMORY=auto sqsub -r 1h -q threaded -n 4 -o ofile.%J --mpp=2G lsdyna_d i=airbag.deploy.k ncpu=4 memory=754414
where memory is the minimum number of 8 byte words shared by all processors in double precision, or, 4 byte words in single precision.
The initial value can be determined by starting a simulation interactively on the command line, and finding the output resembling: Memory required to begin solution : 754414. The number of words can be specified as memory=260M instead of memory=260000000, for further details see https://www.d3view.com/2006/10/a-few-words-on-memory-settings-in-ls-dyna/.
8cpu MPI Job
sqsub -r 1h -q mpi -o ofile.%J -n 8 --mpp=2G lsdyna_d i=airbag.deploy.k
The initial memory can also be specified for mpi jobs on the sqsub command line with "memory=" for the first master processor to decompose the problem and "memory2=" used on all processors including the master to solves the decomposed problem, where the values are specified as 4 bytes per word in single precision and 8 bytes per word in double precision. The number of words can be specified as memory=260M instead of memory=260000000 OR memory2=260M instead of memory2=260000000 for further details see https://www.d3view.com/2006/10/a-few-words-on-memory-settings-in-ls-dyna/. The initial values can be found by running simulation interactively on a orca compute node and checking the output for a line such as: Memory required to begin solution (memory= 464898 memory2= 158794 ) which could then be implemented for a job run in the queue by doing:
export LSTC_MEMORY=auto sqsub -r 1h -q mpi -o ofile.%J -n 8 --mpp=2G lsdyna_d i=airbag.deploy.k memory=464898 memory2=158794
The specification of memory2 on sharcnet or compute canada clusters is not beneficial since the queue reserves the same memory per core for all nodes, such that the decomposition master node process cannot be allocated to have larger system memory. Therefore its sufficient to specify a single memory parameter by doing:
export LSTC_MEMORY=auto sqsub -r 1h -q mpi -o ofile.%J -n 8 --mpp=2G lsdyna_d i=airbag.deploy.k memory=464898
where the LSTC_MEMORY variable will only allow the memory to grow for explicit simulations. The following slurm examples demonstrate how to prescribe the memory parameters exactly.
Example Job
Copy the airbag example to your account:
rsync -av orca.computecanada.ca:/opt/sharcnet/lsdyna/r901.109912/examples r901.109912_examples cd r901.109912_examples/examples/misc/airbag gunzip airbag.deploy.k.gz
Graham (default modules)
Please note that graham does not have the sharcnet legacy modules installed on it.
Threaded Job
Sample submission script mysmpjob1.sh for 4 core single precision smp job using compute canada default cvmfs modules:
#!/bin/bash #SBATCH --account=def-roberpj-ab #SBATCH --nodes=1 # use one node #SBATCH --cpus-per-task=4 # number threads #SBATCH --mem=4000M # total memory #SBATCH --time=00:30:00 # hrs:min:sec #SBATCH --job-name=testsmp1 # %x = jobname #SBATCH --output=slurm-%x-%N-%j.out echo "SLURM_CPUS_PER_TASK= "$SLURM_CPUS_PER_TASK export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # ---------------------------------------------------------- # IMPORTANT: to set license server do either: # 1) export LSTC_LICENSE_SERVER=port@some.license.server # 2) create ~/.licenses/ls-dyna.lic as shown above # ---------------------------------------------------------- module load ls-dyna/9.1 ls-dyna_s i=airbag.deploy.k ncpu=4 memory=1000M
where memory = 4000M / 4 (bytes/word) = 1000M words
Mpi Job
Sample submission script mympijob2.sh for 4 core double precision mpp job using compute canada default cvmfs modules:
#!/bin/bash #SBATCH --account=def-roberpj-ab #SBATCH --ntasks=4 # number of ranks #SBATCH --mem-per-cpu=4000M # memory per rank #SBATCH --time=00:30:00 # hrs:min:sec #SBATCH --job-name=testmpi2 # %x = jobname #SBATCH --output=slurm-%x-%N-%j.out # ---------------------------------------------------------- # IMPORTANT: to set license server do either: # 1) export LSTC_LICENSE_SERVER=port@some.license.server # 2) create ~/.licenses/ls-dyna.lic as shown above # ---------------------------------------------------------- module load openmpi/1.8.8 ls-dyna-mpi/9.1 srun ls-dyna_d i=airbag.deploy.k memory=500M memory2=500M
where memory = 4000M / 8 (bytes/word) = 500M words
New Orca (legacy modules)
Please note, while new orca has most compute canada cvmfs modules available by default it does not have the graham the ls-dyna or ls-dyna-mpi modules installed. Therefore currently the only way to run lsdyna is by using the legacy sharcnet lsdyna modules as shown in the following two smp and mpi examples.
Threaded Job
Submission script mysmpjob3.sh to run 16 core, single precision, single node, smp job with sharcnet legacy modules:
#!/bin/bash #SBATCH --account=def-roberpj-ab #SBATCH --nodes=1 # use one node #SBATCH --cpus-per-task=16 # number threads #SBATCH --mem=4000M # total memory #SBATCH --time=00:30:00 # hrs:min:sec #SBATCH --job-name=mysmpjob3-1node-16core # jobname #SBATCH --output=slurm-%x-%N-%j.out # %x = jobname echo "SLURM_CPUS_PER_TASK= "$SLURM_CPUS_PER_TASK export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # ---------------------------------------------------------------- # IMPORTANT: to set license server for legacy modules: export LSTC_LICENSE_SERVER=31047@license3.sharcnet.ca # since ~/.licenses/ls-dyna.lic does NOT work with legacy modules # ---------------------------------------------------------------- module purge --force export MODULEPATH=/opt/sharcnet/modules module load lsdyna/smp/r712.95028 lsdyna_s i=airbag.deploy.k ncpu=$SLURM_CPUS_PER_TASK memory=500M
where memory = 4000M / 8 (bytes/word) = 500M words
MPI Job
Submission script mympijob4.sh running 16 core single precision mpi job on 1 node with sharcnet legacy modules:
#!/bin/bash #SBATCH --account=def-roberpj-ab #SBATCH --ntasks=16 #SBATCH --mem-per-cpu=2000M #SBATCH --time=00:30:00 #SBATCH --job-name=mympijob4-1node-16core # jobname #SBATCH --output=slurm-%x-%N-%j.out # %x = jobname # ---------------------------------------------------------- # IMPORTANT: to set license server must do: export LSTC_LICENSE_SERVER=port@some.license.server # ~/.licenses/ls-dyna.lic does not work with legacy modules # ---------------------------------------------------------- module purge --force export MODULEPATH=/opt/sharcnet/modules module load intel/12.1.3 openmpi/intel/1.6.5 lsdyna/mpp/r712.95028 machinefile=`srun hostname -s | tr '\n' ':' | sed 's/:/,/g' | sed 's/.$//'` mpirun -H $machinefile lsdyna_s i=airbag.deploy.k memory=250M memory2=250M
where memory = 1000M / 4 (bytes/word) = 250M words
Legacy Clusters
STEP1) The following shows sqsub submission of the airbag example to the mpi queue. Its recommended to first edit airbag.deploy.k and change endtim to 3.000E-00 so the job runs long enough to perform the restart in steps 2 and 3 below:
cp -a /opt/sharcnet/lsdyna/r611.79036/examples /scratch/$USER/test-lsdyna cd /scratch/$USER/test-lsdyna/misc/airbag gunzip airbag.deploy.k.gz cp airbag.deploy.k airbag.deploy.restart.k nano deploy.restart.k (reduce plot file creation frequency) *DATABASE_BINARY_D3PLOT $ dt lcdtdeploy.restart.k 1.000E-01 <--- change from 5.000E-04 export LSTC_LICENSE_SERVER=#####@license#.uwo.sharcnet module unload intel openmpi lsdyna module load intel/11.1.069 openmpi/intel/1.6.4 lsdyna/mpp/r611.79036 sqsub -r 10m -q mpi --mpp=2G -n 8 -o ofile.%J lsdyna_d i=airbag.deploy.k
STEP2) With the job still running, use the echo command as follows to create a file called "D3KIL" that will trigger generation of restart files at which point the file D3KIL itself will be erased. Do this once a day if data loss is critical to you OR once or twice just before the sqsub -r time limit is reached. Further information can be found here http://www.dynasupport.com/tutorial/ls-dyna-users-guide/sense-switch-control:
echo "sw3" > D3KIL sqkill job#
STEP3) Before the job can be restarted, the following two lines must be added to the airbag.deploy.restart.k file:
*CHANGE_CURVE_DEFINITION 1
STEP4) Now resubmit the job as follows using "r=" to specify the restart file:
sqsub -r 1h -q mpi --mpp=2G -n 8 -o ofile.%J lsdyna_d i=airbag.deploy.restart.k r=d3dump01
General Notes
Graham Cluster
Command Line Use
Parallel lsdyna jobs can be run interactively on the command line (outside the queue) for short run testing purposes as follows:
o Orca Development Node MPP (mpi) Example
ssh orca.sharcnet.ca ssh orc-dev1 (or dev2,3,4) module unload intel openmpi lsdyna module load intel/12.1.3 openmpi/intel/1.6.2 lsdyna/mpp/r711.88920 mpirun -np 8 lsdyna_d i=airbag.deploy.k
o Orca Development Node SMP (threaded) Example
ssh orca.sharcnet.ca ssh orc-dev1 (or dev2,3,4) module unload intel openmpi lsdyna module load lsdyna/smp/r711.88920 lsdyna_d i=airbag.deploy.k ncpu=8
Memory Issues
A minumum of mpp=2G is recommended although the memory requirement of a job may suggest much less is required. For instance setting mpp=1G for the airbag test job above will result in the following error when running a job in the queue:
-------------------------------------------------------------------------- An attempt to set processor affinity has failed - please check to ensure that your system supports such functionality. If so, then this is probably something that should be reported to the OMPI developers. --------------------------------------------------------------------------
Another error message which can occur after a job runs for a while if mpp is chosen to small is:
>>>>> Process 45 <<<<< >>>>> Signal 11 : Segmentation Violation <<<<<
To get an idea of the amount of memory a job used run grep on the output file:
[roberpj@orc-login1:~/samples/lsdyna/airbag] cat ofile.2063302.orc-admin2.orca.sharcnet | grep Memory | Distributed Memory Parallel | Memory size from command line: 500000, 500000 Memory for the head node Memory installed (MB) : 32237 Memory free (MB) : 11259 Memory required (MB) : 0 Memory required to process keyword : 458120 Memory required for decomposition : 458120 Memory required to begin solution (memory= 458120 memory2= 230721) Max. Memory reqd for implicit sol: max used 0 Max. Memory reqd for implicit sol: incore 0 Max. Memory reqd for implicit sol: oocore 0 Memory required to complete solution (memory= 458120 memory2= 230721)
Version and Revision
Again run grep on the output file to extract the major and minor revision:
[roberpj@orc-login1:~/samples/lsdyna/airbag] cat ofile.2063302.orc-admin2.orca.sharcnet | grep 'Revision\|Version' | Version : mpp s R6.1.1 Date: 01/02/2013 | | Revision: 78769 Time: 07:43:30 | | SVN Version: 80542 |
Check License Status
Running and Queued Programs
To get a summary of running and/or queued jobs use the lstc_qrun command as follows, where queued means the job has started running according to the sqsub command but its actually sitting waiting for license resources to come available. Once these are acquired the job will start running on the cluster and appear as a Running Program according to lstc_qrun ie)
[roberpj@hnd19:~] export LSTC_LICENSE_SERVER=#####@license#.uwo.sharcnet [roberpj@hnd19:~] lstc_qrun Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable Running Programs User Host Program Started # procs ----------------------------------------------------------------------------- dgierczy 20277@saw61.saw.sharcn LS-DYNA_971 Mon Apr 8 17:44 1 dgierczy 8570@saw32.saw.sharcn LS-DYNA_971 Mon Apr 8 17:44 1 dscronin 25486@hnd6 MPPDYNA_971 Tue Apr 9 20:19 6 dscronin 14897@hnd18 MPPDYNA_971 Tue Apr 9 21:48 6 dscronin 14971@hnd18 MPPDYNA_971 Tue Apr 9 21:48 6 dscronin 15046@hnd18 MPPDYNA_971 Tue Apr 9 21:48 6 dscronin 31237@hnd16 MPPDYNA_971 Tue Apr 9 21:53 6 dscronin 31313@hnd16 MPPDYNA_971 Tue Apr 9 21:54 6 dscronin 6396@hnd15 MPPDYNA_971 Tue Apr 9 21:54 6 csharson 28890@saw175.saw.sharc MPPDYNA_971 Wed Apr 10 16:48 6 No programs queued
To show license file expiration details, append "-R" to the command:
[roberpj@hnd19:~] export LSTC_LICENSE_SERVER=#####@license#.uwo.sharcnet [roberpj@hnd19:~] lstc_qrun -R Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable **** LICENSE INFORMATION **** PROGRAM EXPIRATION CPUS USED FREE MAX | QUEUE ---------------- ---------- ----- ------ ------ | ----- LS-DYNA_971 12/31/2013 - 966 1024 | 0 dgierczy 20277@saw61.saw.sharcnet 1 dgierczy 8570@saw32.saw.sharcnet 1 MPPDYNA_971 12/31/2013 - 966 1024 | 0 dscronin 25486@hnd6 6 dscronin 14897@hnd18 6 dscronin 14971@hnd18 6 dscronin 15046@hnd18 6 dscronin 31237@hnd16 6 dscronin 31313@hnd16 6 dscronin 6396@hnd15 6 csharson 28890@saw175.saw.sharcnet 6 LICENSE GROUP 58 966 1024 | 0 PROGRAM EXPIRATION CPUS USED FREE MAX | QUEUE ---------------- ---------- ----- ------ ------ | ----- LS-OPT 12/31/2013 0 1024 1024 | 0 LICENSE GROUP 0 1024 1024 | 0
Killling A Program
The queue normally kills lsdyna jobs cleanly. However its possible that licences for a job (which is no longer running in the queue and therefore no longer has any processes running on the cluster) will continue to be tied up according to the lstc_qrun command. To kill such a program determine the pid@hostname from the lstc_qrun command then run the following kill command. The following example demonstrates the procedure for username roberpj:
[roberpj@orc-login2:~] lstc_qrun -R | grep roberpj roberpj 45312@orc329.orca.shar MPPDYNA Wed Dec 5 15:50 12 [roberpj@orc-login2:~] lstc_qkill 45312@orc329.orca.sharcnet.ca Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable Program queued for termination [roberpj@orc-login2:~] lstc_qrun -R | grep roberpj [roberpj@orc-login2:~]
Legacy Instructions
The following binaries remain available on saw and orca for backward compatibility testing:
[roberpj@orc-login2:~] cd /opt/sharcnet/local/lsdyna [roberpj@orc129:/opt/sharcnet/local/lsdyna] ls ls971* ls971_d_R3_1 ls971_d_R4_2_1 ls971_s_R3_1 ls971_s_R4_2_1 ls971_s_R5_1_1 ls971_d_R4_2_0 ls971_d_R5_0 ls971_s_R4_2_0 ls971_s_R5_0
There are currently no sharcnet modules for these versions, hence jobs should be submitted as follows:
module load lsdyna export PATH=/opt/sharcnet/local/lsdyna:$PATH export LSTC_LICENSE_SERVER=XXXXX@license3.uwo.sharcnet cp /opt/sharcnet/local/lsdyna/examples/airbag.deploy.k airbag.deploy.k SERIAL JOB: sqsub -q serial -r 1d -o ofile.%J ls971_d_R4_2_1 i=airbag.deploy.k THREADED JOB: sqsub -q threaded -n 4 -r 1d -o ofile.%J ls971_d_R4_2_1 ncpu=4 para=2 i=airbag.deploy.k
Note! the ls971_s_R3_1 and ls971_d_R3_1 binaries do not work, a fix is being looked for.
References
o LSTC LS-DYNA Homepage
http://www.lstc.com/products/ls-dyna
o LSTC LS-DYNA Support (Tutorials, HowTos, Faq, Manuals, Release Notes, News, Links)
http://www.dynasupport.com/
o LS-DYNA Release Notes
http://www.dynasupport.com/release-notes
o LS-PrePost Online Documentation FAQ
http://www.lstc.com/lspp/content/faq.shtml
o LSTC Download/Install Overview Page
Provides links to binaries for LS-DYNA SMP/MPP, LS-OPT, LS-PREPOST, LS-TASC.
http://www.lstc.com/download
Memory Links
o Implicit: Memory notes
http://www.dynasupport.com/howtos/implicit/implicit-memory-notes
o LS-DYNA Support Environment variables
http://www.dynasupport.com/howtos/general/environment-variables
o LS-DYNA and d3VIEW Blog (a few words on memory settings)
http://blog2.d3view.com/a-few-words-on-memory-settings-in-ls-dyna/
o Convert Words to GB ie) memory=500MW (3.73GB)
http://deviceanalytics.com/memcalc.php