From Documentation
Jump to: navigation, search
(DIGITS v3 with nv-caffe v0.14, CUDA 7.5, cuDNN v4 (only works on Copper and Mosaic clusters))
 
Line 1: Line 1:
 +
{{Template:GrahamUpdate}}
 +
 
The NVIDIA Deep Learning GPU Training System (DIGITS) puts the power of deep learning in the hands of data scientists and researchers.
 
The NVIDIA Deep Learning GPU Training System (DIGITS) puts the power of deep learning in the hands of data scientists and researchers.
  

Latest revision as of 14:46, 25 July 2017

Note: Some of the information on this page is for our legacy systems only. The page is scheduled for an update to make it applicable to Graham.

The NVIDIA Deep Learning GPU Training System (DIGITS) puts the power of deep learning in the hands of data scientists and researchers.

DIGITS v3 with nv-caffe v0.14, CUDA 7.5, cuDNN v4 (only works on Copper and Mosaic clusters)

DIGITS server can be submitted as a job to a compute node and you can use a web browser to access it from login node.

You should copy DIGITS to your /work folder first. All the data will be generated under DIGITS folder.

rsync -arl /opt/sharcnet/testing/NVDIGITS/DIGITS-digits-3.0 /work/yourusername/

Then you should go into the DIGITS-digits-3.0 folder and setup the environment:

source /opt/sharcnet/testing/NVDIGITS/setdigits.sh

Then you can submit "digits-server" as a job:

On Mosaic (e.g. asking for a whole node with 4h runtime limitation):
sqsub -q gpu -f threaded -n 20 --gpp=1 --mpp=200g -o output.txt -r 4h ./digits-server

On Copper (e.g. asking for a whole node (8GPUs) with 4h runtime limitation):
sqsub -q gpu -f threaded -n 16 --gpp=8 --mpp=90g -o output.txt -r 4h ./digits-server

If running on Copper or mos1, you should setup the envrionment again (run: source /opt/sharcnet/testing/NVDIGITS/setdigits.sh) after logging into a GPU node, and then run "./digits-server" directly.

To use DIGITS, you should login to Mosaic or Copper in another session with X11 window forwarding enabled. (adding -Y when ssh to the cluster, e.g. ssh -Y yourname@mosaic.sharcnet.ca. You should also prepare a web browser on SHARCNET machine. You can download a Firefox from https://www.mozilla.org/en-US/firefox/all/, please choose a LINUX-64bit version. Or copy /work/feimao/software_installs/firefox to yourplace. Go to the firefox folder and run ./firefox then open the webpage:

http://mos1:34448/ (mos1 should be replaced by the compute node name which will be shown if using sqjobs command)

If someone else is running a server on the specific port, the output will look like this:

016-03-17 14:13:52 [74190] [ERROR] Connection in use: ('0.0.0.0', 34448)
2016-03-17 14:13:52 [74190] [ERROR] Retrying in 1 second.
2016-03-17 14:13:53 [74190] [ERROR] Connection in use: ('0.0.0.0', 34448)
2016-03-17 14:13:53 [74190] [ERROR] Retrying in 1 second.
2016-03-17 14:13:54 [74190] [ERROR] Connection in use: ('0.0.0.0', 34448)
2016-03-17 14:13:54 [74190] [ERROR] Retrying in 1 second.
2016-03-17 14:13:55 [74190] [ERROR] Connection in use: ('0.0.0.0', 34448)
2016-03-17 14:13:55 [74190] [ERROR] Retrying in 1 second.
2016-03-17 14:13:56 [74190] [ERROR] Connection in use: ('0.0.0.0', 34448)
2016-03-17 14:13:56 [74190] [ERROR] Retrying in 1 second.

In which case, you can provide another port which you want to connect on. Keep trying ports until you find one free. The port number requested needs to be >1024, as numbers <1024 require root access. To change the port, simply specify the new port against the IP mask:

./digits-server -b 0.0.0.0:34449

DIGITS with nv-caffe v0.13, CUDA 7.5, cuDNN v3 (only works on Copper and Mosaic clusters)

To run DIGITS, please copy "/work/feimao/software_installs/DIGITS" and "/work/feimao/software_installs/caffe-new/caffe-libs" to your folder (e.g. /work/yourname/)

  • Go to mos1 on Mosaic or cop1 on Copper and change the modules:
module unload intel/12.1.3 mkl/10.3.9 openmpi/intel/1.6.2
module load intel/15.0.3
module load hdf/serial/5.1.8.11 
module unload cuda
module load cuda/7.5.18 
module load python/intel/2.7.10
  • Export the paths to nvcaffe and all python dependencies:
export PATH=/work/feimao/software_installs/caffe-new/caffe-libs/bin:/work/feimao/software_installs/caffe-new/caffe-libs/include:/work/feimao/software_installs/DIGITS/libs/bin:$PATH
export LD_LIBRARY_PATH=/work/feimao/software_installs/caffe-new/caffe-libs/lib:$LD_LIBRARY_PATH
export PYTHONPATH=/work/feimao/software_installs/DIGITS/libs/lib/python2.7/site-packages/:$PYTHONPATH
  • Set the proper nv-caffe path: (nv-caffe should be compiled in a same way of caffe, please reference Caffe page for instructions.)

Please modify digits.cfg file under /work/yourplace/DIGITS/digits/digits/digits.cfg

[DIGITS]
caffe_root = /work/yourplace/DIGITS/nv-caffe-0.13


To start the server

After login to mos1 or cop1 and do all the commands above. You should go to DIGITS/digits folder and run:

./digits-server

To connect to the server by a web browser

To use DIGITS, you should login to Mosaic or Copper in another session with X11 window forwarding enabled. (adding -Y when ssh to the cluster, e.g. ssh -Y yourname@mosaic.sharcnet.ca. You should also prepare a web browser on SHARCNET machine. You can download a Firefox from https://www.mozilla.org/en-US/firefox/all/, please choose a LINUX-64bit version. Or copy /work/feimao/software_installs/firefox to yourplace. Go to the firefox folder and run ./firefox then open the webpage:

http://mos1:34448/