From Documentation
Jump to: navigation, search
(Caffe (master branch Dec 3, 2015) with cudnn v3(CUDA7.5) on Copper and Mosaic)
Line 18: Line 18:
 
export PATH=/work/yourname/caffe-libs/bin:/work/yourname/caffe-libs/include:/work/yourname/cudnn3:$PATH
 
export PATH=/work/yourname/caffe-libs/bin:/work/yourname/caffe-libs/include:/work/yourname/cudnn3:$PATH
 
export LD_LIBRARY_PATH=/work/yourname/caffe-libs/lib:/work/yourname/cudnn3:$LD_LIBRARY_PATH
 
export LD_LIBRARY_PATH=/work/yourname/caffe-libs/lib:/work/yourname/cudnn3:$LD_LIBRARY_PATH
export PYTHONPATH=/work/yourname/caffe-libs/lib/python2.7/python_packages:/work/yourname/caffe-master-1203-75-python2710/python:$PYTHONPATH
+
export PYTHONPATH=/work/yourname/caffe-libs/lib/python2.7/site-packages:/work/yourname/caffe-master-1203-75-python2710/python:$PYTHONPATH
 
</pre>
 
</pre>
  
Line 40: Line 40:
 
export PATH=/work/yourname/caffe-libs/bin:/work/yourname/caffe-libs/include:/work/yourname/cudnn3:$PATH
 
export PATH=/work/yourname/caffe-libs/bin:/work/yourname/caffe-libs/include:/work/yourname/cudnn3:$PATH
 
export LD_LIBRARY_PATH=/work/yourname/caffe-libs/lib:/work/yourname/cudnn3:$LD_LIBRARY_PATH
 
export LD_LIBRARY_PATH=/work/yourname/caffe-libs/lib:/work/yourname/cudnn3:$LD_LIBRARY_PATH
export PYTHONPATH=/work/yourname/caffe-libs/lib/python2.7/python_packages:$PYTHONPATH
+
export PYTHONPATH=/work/yourname/caffe-libs/lib/python2.7/site-packages:$PYTHONPATH
 
</pre>
 
</pre>
 
and go to the Caffe folder and copy the Makefile.config.example to Makefile.config:
 
and go to the Caffe folder and copy the Makefile.config.example to Makefile.config:

Revision as of 14:45, 20 January 2016

Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind.

Caffe (master branch Dec 3, 2015) with cudnn v3(CUDA7.5) on Copper and Mosaic

All Copper and Mosaic nodes have the latest CUDA 7.5.

The caffe dependency libraries are under /work/feimao/software_installs/caffe-new/caffe-libs and /work/feimao/software_installs/cudnn3, you can use this folder directly or copy it somewhere under your folder(e.g. /work/yourname/caffe-libs and/work/yourname/cudnn3 ) If you don't need to compile your own Caffe, you can simply copy the one from:/work/feimao/software_installs/caffe-new/caffe-master-1203-75-python2710, which was built on Dec 3, 2015.

How to run a precompiled Caffe

To run the program, you should put caffe-libs and caffe-master-1203-75-python2710 under you folder, and then load the modules and export the environment paths:

module unload intel/12.1.3 mkl/10.3.9 openmpi/intel/1.6.2
module load intel/15.0.3
module load hdf/serial/5.1.8.11 
module unload cuda
module load cuda/7.5.18 
module load python/intel/2.7.10
export PATH=/work/yourname/caffe-libs/bin:/work/yourname/caffe-libs/include:/work/yourname/cudnn3:$PATH
export LD_LIBRARY_PATH=/work/yourname/caffe-libs/lib:/work/yourname/cudnn3:$LD_LIBRARY_PATH
export PYTHONPATH=/work/yourname/caffe-libs/lib/python2.7/site-packages:/work/yourname/caffe-master-1203-75-python2710/python:$PYTHONPATH

How to build Caffe

If you'd like to build your own Caffe, you can simply get the newest Caffe from github (notice that you should stay in login node to have internet access):

[feimao@mos-login test-install-caffe]$ git clone https://github.com/BVLC/caffe.git

You should also prepaired all Caffe's dependency libraries. Just copy the folder /work/feimao/software_installs/caffe-new/caffe-libs/ and /work/feimao/software_installs/cudnn3 to somewhere under your folder (e.g. /work/yourname/caffe-libs and /work/yourname/cudnn3 ) Then you go to mos1 (or cop1), load the modules:

module unload intel/12.1.3 mkl/10.3.9 openmpi/intel/1.6.2
module load intel/15.0.3
module load hdf/serial/5.1.8.11 
module unload cuda
module load cuda/7.5.18 
module load python/intel/2.7.10

And export the environment paths to your caffe-libs folder:

export PATH=/work/yourname/caffe-libs/bin:/work/yourname/caffe-libs/include:/work/yourname/cudnn3:$PATH
export LD_LIBRARY_PATH=/work/yourname/caffe-libs/lib:/work/yourname/cudnn3:$LD_LIBRARY_PATH
export PYTHONPATH=/work/yourname/caffe-libs/lib/python2.7/site-packages:$PYTHONPATH

and go to the Caffe folder and copy the Makefile.config.example to Makefile.config:

[feimao@mos1 caffe]$ cp Makefile.config.example Makefile.config

Then modify the Makefile.config:

  • 1, uncomment
USE_CUDNN := 1
  • 2, uncomment
CUSTOM_CXX := g++
  • 3, change CUDA_DIR to
CUDA_DIR := /opt/sharcnet/cuda/7.5.18
  • 4, If on Copper, add a line "-gencode arch=compute_37,code=sm_37" to CUDA_ARCH
  • 5, change BLAS to
BLAS := mkl
  • 6, uncomment BLAS_INCLUDE, and change to
BLAS_INCLUDE := /opt/sharcnet/intel/15.0.3/mkl/include
  • 7, uncomment BLAS_LIB, and change to
BLAS_LIB := /opt/sharcnet/intel/15.0.3/mkl/lib/intel64
  • 8, change PYTHON_INCLUDE to
PYTHON_INCLUDE := /opt/sharcnet/python/2.7.10/intel/include \
                /opt/sharcnet/python/2.7.10/intel/include/python2.7 \
               /opt/sharcnet/python/2.7.10/intel/lib/python2.7/site-packages/numpy/core/
  • 9, change PYTHON_LIB to
PYTHON_LIB := /opt/sharcnet/python/2.7.10/intel/lib
  • 10, change INCLUDE_DIRS to
INCLUDE_DIRS := $(PYTHON_INCLUDE) /opt/sharcnet/hdf/5.1.8.11/serial/include /work/yourname/caffe-libs/include /work/yourname/cudnn3 /usr/local/include
  • 10, change LIBRARY_DIRS to
LIBRARY_DIRS := $(PYTHON_LIB) /opt/sharcnet/hdf/5.1.8.11/serial/lib /work/yourname/caffe-libs/lib /work/yourname/cudnn3 /usr/local/lib /usr/lib
  • 11, run command to build Caffe and pycaffe:
make all -j8 && make test -j8 && make pycaffe

How to test the Caffe build

When the "make all" and "make test" finish, you can run "make runtest" to test if you build Caffe properly. You may get errors related to "test_gradient_based_solver.cpp" or other numerical inconsistency between CPU and GPU. Those errors was caused by MKL's Conditional Numerical Reproducibility setting. To avoid these errors, you should set MKL_CBWR:

export MKL_CBWR=AUTO

If you use GPU, you don't have to set this during the calculation.

Mosaic cluster installation instructions for Caffe with cudnn v2

You can copy dependencies from

/work/feimao/software_installs/caffe-mosaic/caffe-libs

to somewhere under your folder, OR use these files from their original path directly. You can download cuDNN v2, or copy from /work/feimao/software_installs/cudnn2

Copy compiled Caffe folder for mosaic with cudnn v2 support from:

/work/feimao/software_installs/caffe-mosaic/caffe-master-0717-mosaic-65-python

To run the program, you should do

module unload intel mkl openmpi
module load gcc/4.9.2
module load python/gcc/2.7.8
module switch cuda/6.0.37 cuda/6.5.14

You should also export environment path (can be your folder, including where you put cuDNN v2) using command below:

export LD_LIBRARY_PATH=/work/feimao/software_installs/cudnn2:/work/feimao/software_installs/caffe-mosaic/caffe-libs:/opt/sharcnet/cuda/6.5.14/toolkit/lib64/:$LD_LIBRARY_PATH

export the ptyhon path as well:

export PYTHONPATH=/work/feimao/software_installs/caffe-mosaic/caffe-libs/python_packages/lib/python2.7/site-packages:/work/feimao/software_installs/caffe-mosaic/caffe-master-0717-mosaic-65-python/python:$PYTHONPATH

If you want to compile caffe by your self, you should also export the path above to PATH variable(export PATH=/work/feimao/software_installs/caffe-mosaic/caffe-libs:$PATH). And modify the Makefile and Makefile.config as the ones in "caffe-master-0717-mosaic-65-python" folder. If you need help with the installation or using caffe, please email help@sharcnet.ca or submit a ticket.

SHARCNET installation instructions for old version of Caffe

Building Caffe and its dependencies on SHARCNET is not an easy job. If you are not willing to change/modify Caffe's source file, you can copy the Caffe and its dependencies from /work/feimao/software_installs/ directories to yours.

The paths to all dependencies:

Protobuf: /work/feimao/software_installs/protobuf
LevelDB: /work/feimao/software_installs/leveldb
Snappy: /work/feimao/software_installs/snappy
OpenCV(for cuda 6 on monk): /work/feimao/software_installs/opencv
OpenCV(for cuda6.5 on angel): /work/feimao/software_installs/opencv65
Boost(with python 2.7.8 support): /work/feimao/software_installs/boost/boost_157_2
Gflags: /work/feimao/software_installs/gflags
Glog: /work/feimao/software_installs/glog
LMDB: /work/feimao/software_installs/lmdb
cuDNN: /work/feimao/software_installs/cudnn

There are two version of Caffe, one for Monk, one for Angel(with newer 750Ti gpu).

Monk version is build with cuda6 and python support(no cudnn):

/work/feimao/software_installs/caffe/caffe-monk-cuda6-python

Angel version is build with cuda 6.5 with cudnn support:

/work/feimao/software_installs/caffe/caffe-angel-cuda65-cudnn

Setting up the environment for Caffe

To run caffe, you should load the gcc module first:

module unload intel/12.1.3 mkl/10.3.9 openmpi/intel/1.6.2 
module load gcc/4.8.2

Then export the PATH and LD_LIBRARY_PATH: For Monk:

export PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv/include:/work/feimao/software_installs/opencv/bin:/work/feimao/software_installs/boost/boost_157_2/include:/work/feimao/software_installs/lmdb/include/:/work/feimao/software_installs/lmdb/bin/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/bin:/work/feimao/software_installs/lmdb/include:/work/feimao/software_installs/glog/include/:/work/feimao/software_installs/gflags/include/:/work/feimao/software_installs/protobuf/bin/:/work/feimao/software_installs/snappy/include/:$PATH

export LD_LIBRARY_PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv/lib:/work/feimao/software_installs/boost/boost_157_2/lib:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/lib/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/protobuf/lib/:/work/feimao/software_installs/gflags/lib:/work/feimao/software_installs/glog/lib/:/work/feimao/software_installs/snappy/lib/:$LD_LIBRARY_PATH

For Angel:

export PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv65/include:/work/feimao/software_installs/opencv65/bin:/work/feimao/software_installs/boost/boost_157_2/include:/work/feimao/software_installs/lmdb/include/:/work/feimao/software_installs/lmdb/bin/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/bin:/work/feimao/software_installs/lmdb/include:/work/feimao/software_installs/glog/include/:/work/feimao/software_installs/gflags/include/:/work/feimao/software_installs/protobuf/bin/:/work/feimao/software_installs/snappy/include/:$PATH

export LD_LIBRARY_PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv65/lib:/work/feimao/software_installs/boost/boost_157_2/lib:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/lib/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/protobuf/lib/:/work/feimao/software_installs/gflags/lib:/work/feimao/software_installs/glog/lib/:/work/feimao/software_installs/snappy/lib/:$LD_LIBRARY_PATH

ImageNet example on Angel and Monk

For training AlexNet on ImageNet dataset, you should download the images first and put them into "data" folder under caffe root. Then go to "examples/imagenet/" and modify create_imagenet.sh by adding "--backend=leveldb" and change the name from lmdb to leveldb for both "train" and "val" data like this:

GLOG_logtostderr=1 $TOOLS/convert_imageset \
    --resize_height=$RESIZE_HEIGHT \
    --resize_width=$RESIZE_WIDTH \
    --shuffle \
    --backend=leveldb \
    $TRAIN_DATA_ROOT \
    $DATA/train.txt \
   $EXAMPLE/ilsvrc12_train_leveldb

Then you should also modify the "train_val.prototxt" under "models/bvlc_alexnet". You have to change "LMDB" to "LEVELDB" and fix the path to the leveldb database.

data_param {
    source: "examples/imagenet/ilsvrc12_train_leveldb"
    backend: LEVELDB
    batch_size: 256
  }

If running on Angel, you should set a smaller batch_size if you get "out of memory" error. It is caused by the small amount of RAM on the GTX 750Ti(2GB).

After changing the parameters in solver.prototxt, you should be able to run Caffe now.

To submit a job, you should setup a properaiat memory size which is large enough for you network size.

sqsub -q gpu --gpp=1 --mpp=24g -r 24h -o out.txt ./build/tools/caffe train --solver=models/bvlc_alexnet/solver.prototxt

Bug on Monk

The GPU on Monk is relatively old and with only SM2.0 support. For large networks, it will be running out of blocks per grid dim (SM 2.0 had only 65535 blocks per dim, while 3.0 bumped it to 2^31-1). This will happen when using AlexNet. There is no official fix for this problem yet. Based on this webpage[1], I modifed the code in "include/util/device_alternate.hpp" to

// CUDA: number of blocks for threads.
inline int CAFFE_GET_BLOCKS(const int N) {
int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
//return (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
if (num_blocks > 65535)
{
        num_blocks=65535;
}
return num_blocks;//deal with sm20 devices
}

}  // namespace caffe

== References ===