From Documentation
Jump to: navigation, search
(Mosaic cluster installation instructions for Caffe with cudnn v2)
(Caffe2 on Graham)
 
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<big>'''Caffe''' is a deep learning framework developed with cleanliness, readability, and speed in mind.
 
<big>'''Caffe''' is a deep learning framework developed with cleanliness, readability, and speed in mind.
  
==Running Caffe (master branch Jan 29, 2016) on Copper and Mosaic ==
+
==Caffe2 on Graham==
 +
 
 +
Caffe2 is avaiable on Graham cluster. Please reference the page: [[https://docs.computecanada.ca/wiki/Caffe2 Caffe2 on Graham]]
 +
 
 +
*If user get "ImportError: No module named google.protobuf.internal" problem, a temporary fix can be adding PYTHONPATH in the python command:
 +
<pre>
 +
PYTHONPATH=/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc5.4/python27-scipy-stack/2017a/lib/python2.7/site-packages/:$PYTHONPATH python code.py ...
 +
</pre>
 +
 
 +
==Caffe on IBM Minsky deep learning server==
 +
 
 +
Caffe is included in IBM PowerAI toolset. Please reference the page: [[https://www.sharcnet.ca/help/index.php/Minsky#Caffe Caffe on Minsky]]
 +
 
 +
==Running Caffe (master branch Jun 28, 2016) on Copper and Mosaic ==
 
Sharcnet doesn't maintain Caffe as a module, but we provide all its dependencies and a precompiled Caffe as an example.
 
Sharcnet doesn't maintain Caffe as a module, but we provide all its dependencies and a precompiled Caffe as an example.
  
Line 9: Line 22:
 
</pre>
 
</pre>
 
===Set up modules and environment variables===
 
===Set up modules and environment variables===
There is a script (caffe-set-env.sh) that can help setup all the modules and environment variables:
+
There is a script (caffe-set-env-cudnn4.sh) that can help setup all the modules and environment variables:
 
<pre>
 
<pre>
 
module unload intel gcc mkl openmpi hdf python cuda
 
module unload intel gcc mkl openmpi hdf python cuda
Line 16: Line 29:
 
module load cuda/7.5.18
 
module load cuda/7.5.18
 
module load python/intel/2.7.10
 
module load python/intel/2.7.10
export CAFFE_ROOT=/opt/sharcnet/testing/caffe/caffe-master-160129
+
export MKL_CBWR=AUTO
export PATH=/opt/sharcnet/testing/caffe/caffe-libs/bin:/opt/sharcnet/testing/caffe/caffe-libs/include:/opt/sharcnet/testing/cudnn/cudnn3:$PATH
+
export CAFFE_ROOT=/opt/sharcnet/testing/caffe/caffe-master-160628
export LD_LIBRARY_PATH=/opt/sharcnet/testing/caffe/caffe-libs/lib:/opt/sharcnet/testing/cudnn/cudnn3:$LD_LIBRARY_PATH
+
export PATH=/opt/sharcnet/testing/caffe/caffe-libs/bin:/opt/sharcnet/testing/caffe/caffe-libs/include:/opt/sharcnet/testing/cudnn/cudnn4:$PATH
export PYTHONPATH=/opt/sharcnet/testing/caffe/caffe-libs/lib/python2.7/site-packages:/opt/sharcnet/testing/caffe/caffe-master-160129/python:$PYTHONPATH
+
export LD_LIBRARY_PATH=/opt/sharcnet/testing/caffe/caffe-libs/lib:/opt/sharcnet/testing/cudnn/cudnn4:$LD_LIBRARY_PATH
 +
export PYTHONPATH=/opt/sharcnet/testing/caffe/caffe-libs/lib/python2.7/site-packages:/opt/sharcnet/testing/caffe/caffe-master-160628/python:$PYTHONPATH
 
</pre>
 
</pre>
In this file, CAFFE_ROOT is set to a precompiled Caffe which was built on Jan 29th, 2016. If you want to use this version of caffe, you don't have to run the commands above, you can simply do:
+
In this file, CAFFE_ROOT is set to a precompiled Caffe which was built on Jun 28th, 2016. If you want to use this version of caffe, you don't have to run the commands above, you can simply do:
 
<pre>
 
<pre>
source /opt/sharcnet/testing/caffe/caffe-set-env.sh
+
source /opt/sharcnet/testing/caffe/caffe-set-env-cudnn4.sh
 
</pre>
 
</pre>
 
*You should run this "source" command every time you login to a sharcnet system, or put it into your .bashrc if you really know what these commands will do.
 
*You should run this "source" command every time you login to a sharcnet system, or put it into your .bashrc if you really know what these commands will do.
Line 125: Line 139:
  
 
===Submit jobs===
 
===Submit jobs===
 +
*On copper and Mosaic, please export MKL_CBWR=AUTO before submitting the jobs. This will be set if you do "source /opt/sharcnet/testing/caffe/caffe-set-env-cudnn4.sh"
 
Once we have solver.prototxt and train_val.prototxt files ready, we can submit jobs using sqsub. The command is  
 
Once we have solver.prototxt and train_val.prototxt files ready, we can submit jobs using sqsub. The command is  
 
<pre>
 
<pre>
Line 130: Line 145:
 
     --solver=/work/feimao/imagenet/bvlc_alexnet/solver.prototxt
 
     --solver=/work/feimao/imagenet/bvlc_alexnet/solver.prototxt
 
</pre>
 
</pre>
Caffe need at least 2 cores, one for controlling GPU, one for loading data. The size of memory (mpp) depends on the module size. 32 GB is recommended as a smallest size to try. "gpp" should be one as Caffe (master branch) supports only single GPU for each process.
+
Caffe can use 2 CPU cores, one for controlling GPU, one for loading data. The size of memory (mpp) depends on the model size. 32 GB is recommended as a smallest size to try. "gpp" should always be one because Caffe (master branch) supports only single GPU.
  
== Building Caffe with cudnn v3(CUDA7.5) on Copper and Mosaic ==
+
== Building Caffe with cudnn v4(CUDA7.5) on Copper and Mosaic ==
 
=== How to build Caffe===
 
=== How to build Caffe===
 
If you'd like to build your own Caffe, you can simply get the newest Caffe from github (notice that you should stay in login node to have internet access):
 
If you'd like to build your own Caffe, you can simply get the newest Caffe from github (notice that you should stay in login node to have internet access):
Line 138: Line 153:
 
[feimao@mos-login test-install-caffe]$ git clone https://github.com/BVLC/caffe.git
 
[feimao@mos-login test-install-caffe]$ git clone https://github.com/BVLC/caffe.git
 
</pre>
 
</pre>
You should also prepare all Caffe's dependency libraries. Just copy the folder /work/feimao/software_installs/caffe-new/caffe-libs/ and /work/feimao/software_installs/cudnn3 to somewhere under your folder (e.g. /work/yourname/caffe-libs and /work/yourname/cudnn3 )
 
 
Then you go to mos1 (or cop1), load the modules:
 
Then you go to mos1 (or cop1), load the modules:
 
<pre>
 
<pre>
module unload intel/12.1.3 mkl/10.3.9 openmpi/intel/1.6.2
+
module unload intel mkl openmpi cuda
 
module load intel/15.0.3
 
module load intel/15.0.3
 
module load hdf/serial/5.1.8.11  
 
module load hdf/serial/5.1.8.11  
module unload cuda
 
 
module load cuda/7.5.18  
 
module load cuda/7.5.18  
 
module load python/intel/2.7.10
 
module load python/intel/2.7.10
 
</pre>
 
</pre>
And export the environment paths to your caffe-libs folder:
+
And export the environment paths to cudnn and caffe-libs folder:
 
<pre>
 
<pre>
export PATH=/opt/sharcnet/testing/caffe/caffe-libs/bin:/opt/sharcnet/testing/caffe/caffe-libs/include:/opt/sharcnet/testing/cudnn/cudnn3:$PATH
+
export PATH=/opt/sharcnet/testing/caffe/caffe-libs/bin:/opt/sharcnet/testing/caffe/caffe-libs/include:/opt/sharcnet/testing/cudnn/cudnn4:$PATH
export LD_LIBRARY_PATH=/opt/sharcnet/testing/caffe/caffe-libs/lib:/opt/sharcnet/testing/cudnn/cudnn3:$LD_LIBRARY_PATH
+
export LD_LIBRARY_PATH=/opt/sharcnet/testing/caffe/caffe-libs/lib:/opt/sharcnet/testing/cudnn/cudnn4:$LD_LIBRARY_PATH
 
export PYTHONPATH=/opt/sharcnet/testing/caffe/caffe-libs/lib/python2.7/site-packages:$PYTHONPATH
 
export PYTHONPATH=/opt/sharcnet/testing/caffe/caffe-libs/lib/python2.7/site-packages:$PYTHONPATH
 
</pre>
 
</pre>
Line 196: Line 209:
 
*10, change INCLUDE_DIRS to
 
*10, change INCLUDE_DIRS to
 
<pre>
 
<pre>
INCLUDE_DIRS := $(PYTHON_INCLUDE) /opt/sharcnet/hdf/5.1.8.11/serial/include /opt/sharcnet/testing/caffe/caffe-libs/include /opt/sharcnet/testing/cudnn/cudnn3 /usr/local/include
+
INCLUDE_DIRS := $(PYTHON_INCLUDE) /opt/sharcnet/hdf/5.1.8.11/serial/include /opt/sharcnet/testing/caffe/caffe-libs/include /opt/sharcnet/testing/cudnn/cudnn4 /usr/local/include
 
</pre>
 
</pre>
 
*10, change LIBRARY_DIRS to
 
*10, change LIBRARY_DIRS to
 
<pre>
 
<pre>
LIBRARY_DIRS := $(PYTHON_LIB) /opt/sharcnet/hdf/5.1.8.11/serial/lib /opt/sharcnet/testing/caffe/caffe-libs/lib /opt/sharcnet/testing/cudnn/cudnn3 /usr/local/lib /usr/lib
+
LIBRARY_DIRS := $(PYTHON_LIB) /opt/sharcnet/hdf/5.1.8.11/serial/lib /opt/sharcnet/testing/caffe/caffe-libs/lib /opt/sharcnet/testing/cudnn/cudnn4 /usr/local/lib /usr/lib
 
</pre>
 
</pre>
*11, run command to build Caffe and pycaffe:
+
*11, uncomment ALLOW_LMDB_NOLOCK if you need to run multiply jobs on the same date set at a same time.
 
<pre>
 
<pre>
make all -j8 && make test -j8 && make pycaffe
+
ALLOW_LMDB_NOLOCK := 1
 +
</pre>
 +
*12, run command to build Caffe and pycaffe:
 +
<pre>
 +
make all -j16 && make test -j16 && make pycaffe
 
</pre>
 
</pre>
  
Line 212: Line 229:
 
export MKL_CBWR=AUTO
 
export MKL_CBWR=AUTO
 
</pre>
 
</pre>
If you use GPU, you don't have to set this during the calculation.
 
 
== Mosaic cluster installation instructions for Caffe with cudnn v2 ==
 
 
== SHARCNET installation instructions for old version of Caffe ==
 
Building Caffe and its dependencies on SHARCNET is not an easy job. If you are not willing to change/modify Caffe's source file, you can copy the Caffe and its dependencies from /work/feimao/software_installs/ directories to yours.
 
 
The paths to all dependencies:
 
 
<pre style="overflow:auto; width:auto;">
 
Protobuf: /work/feimao/software_installs/protobuf
 
LevelDB: /work/feimao/software_installs/leveldb
 
Snappy: /work/feimao/software_installs/snappy
 
OpenCV(for cuda 6 on monk): /work/feimao/software_installs/opencv
 
OpenCV(for cuda6.5 on angel): /work/feimao/software_installs/opencv65
 
Boost(with python 2.7.8 support): /work/feimao/software_installs/boost/boost_157_2
 
Gflags: /work/feimao/software_installs/gflags
 
Glog: /work/feimao/software_installs/glog
 
LMDB: /work/feimao/software_installs/lmdb
 
cuDNN: /work/feimao/software_installs/cudnn
 
</pre>
 
 
There are two version of Caffe, one for Monk, one for Angel(with newer 750Ti gpu).
 
 
Monk version is build with cuda6 and python support(no cudnn):
 
<pre>
 
/work/feimao/software_installs/caffe/caffe-monk-cuda6-python
 
</pre>
 
Angel version is build with cuda 6.5 with cudnn support:
 
<pre>
 
/work/feimao/software_installs/caffe/caffe-angel-cuda65-cudnn
 
</pre>
 
 
== Setting up the environment for Caffe (old version)  ==
 
To run caffe, you should load the gcc module first:
 
<pre>
 
module unload intel/12.1.3 mkl/10.3.9 openmpi/intel/1.6.2
 
module load gcc/4.8.2
 
</pre>
 
 
Then export the PATH and LD_LIBRARY_PATH:
 
For Monk:
 
<pre style="overflow:auto; width:auto;">
 
export PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv/include:/work/feimao/software_installs/opencv/bin:/work/feimao/software_installs/boost/boost_157_2/include:/work/feimao/software_installs/lmdb/include/:/work/feimao/software_installs/lmdb/bin/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/bin:/work/feimao/software_installs/lmdb/include:/work/feimao/software_installs/glog/include/:/work/feimao/software_installs/gflags/include/:/work/feimao/software_installs/protobuf/bin/:/work/feimao/software_installs/snappy/include/:$PATH
 
 
export LD_LIBRARY_PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv/lib:/work/feimao/software_installs/boost/boost_157_2/lib:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/lib/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/protobuf/lib/:/work/feimao/software_installs/gflags/lib:/work/feimao/software_installs/glog/lib/:/work/feimao/software_installs/snappy/lib/:$LD_LIBRARY_PATH
 
</pre>
 
For Angel:
 
<pre style="overflow:auto; width:auto;">
 
export PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv65/include:/work/feimao/software_installs/opencv65/bin:/work/feimao/software_installs/boost/boost_157_2/include:/work/feimao/software_installs/lmdb/include/:/work/feimao/software_installs/lmdb/bin/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/bin:/work/feimao/software_installs/lmdb/include:/work/feimao/software_installs/glog/include/:/work/feimao/software_installs/gflags/include/:/work/feimao/software_installs/protobuf/bin/:/work/feimao/software_installs/snappy/include/:$PATH
 
 
export LD_LIBRARY_PATH=/work/feimao/software_installs/cudnn:/work/feimao/software_installs/opencv65/lib:/work/feimao/software_installs/boost/boost_157_2/lib:/work/feimao/software_installs/lmdb/mdb-mdb/libraries/liblmdb/:/work/feimao/software_installs/lmdb/lib/:/work/feimao/software_installs/leveldb/leveldb-1.15.0:/work/feimao/software_installs/protobuf/lib/:/work/feimao/software_installs/gflags/lib:/work/feimao/software_installs/glog/lib/:/work/feimao/software_installs/snappy/lib/:$LD_LIBRARY_PATH
 
</pre>
 
 
== ImageNet example on Angel and Monk==
 
For training AlexNet on ImageNet dataset, you should download the images first and put them into "data" folder under caffe root. Then go to "examples/imagenet/" and modify create_imagenet.sh by adding "--backend=leveldb" and change the name from lmdb to leveldb for both "train" and "val" data like this:
 
<pre>
 
GLOG_logtostderr=1 $TOOLS/convert_imageset \
 
    --resize_height=$RESIZE_HEIGHT \
 
    --resize_width=$RESIZE_WIDTH \
 
    --shuffle \
 
    --backend=leveldb \
 
    $TRAIN_DATA_ROOT \
 
    $DATA/train.txt \
 
  $EXAMPLE/ilsvrc12_train_leveldb
 
</pre>
 
 
Then you should also modify the "train_val.prototxt" under "models/bvlc_alexnet". You have to change "LMDB" to "LEVELDB" and fix the path to the leveldb database.
 
<pre>
 
data_param {
 
    source: "examples/imagenet/ilsvrc12_train_leveldb"
 
    backend: LEVELDB
 
    batch_size: 256
 
  }
 
</pre>
 
 
If running on Angel, you should set a smaller batch_size if you get "out of memory" error. It is caused by the small amount of RAM on the GTX 750Ti(2GB).
 
 
After changing the parameters in solver.prototxt, you should be able to run Caffe now.
 
 
To submit a job, you should setup a properaiat memory size which is large enough for you network size.
 
<pre>
 
sqsub -q gpu --gpp=1 --mpp=24g -r 24h -o out.txt ./build/tools/caffe train --solver=models/bvlc_alexnet/solver.prototxt
 
</pre>
 
 
== Bug on Monk ==
 
The GPU on Monk is relatively old and with only SM2.0 support. For large networks, it will be running out of blocks per grid dim (SM 2.0 had only 65535 blocks per dim, while 3.0 bumped it to 2^31-1). This will happen when using AlexNet. There is no official fix for this problem yet. Based on this webpage[https://github.com/OpenHero/caffe/commit/04698394cb7e5c784285699e355091687881a175], I modifed the code in "include/util/device_alternate.hpp" to
 
<pre>
 
// CUDA: number of blocks for threads.
 
inline int CAFFE_GET_BLOCKS(const int N) {
 
int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
 
//return (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
 
if (num_blocks > 65535)
 
{
 
        num_blocks=65535;
 
}
 
return num_blocks;//deal with sm20 devices
 
}
 
 
}  // namespace caffe
 
</pre>
 
 
== References ===
 

Latest revision as of 13:49, 13 December 2017

Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind.

Caffe2 on Graham

Caffe2 is avaiable on Graham cluster. Please reference the page: [Caffe2 on Graham]

  • If user get "ImportError: No module named google.protobuf.internal" problem, a temporary fix can be adding PYTHONPATH in the python command:
PYTHONPATH=/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc5.4/python27-scipy-stack/2017a/lib/python2.7/site-packages/:$PYTHONPATH python code.py ...

Caffe on IBM Minsky deep learning server

Caffe is included in IBM PowerAI toolset. Please reference the page: [Caffe on Minsky]

Running Caffe (master branch Jun 28, 2016) on Copper and Mosaic

Sharcnet doesn't maintain Caffe as a module, but we provide all its dependencies and a precompiled Caffe as an example.

All the files that needed to build and run Caffe are under:

/opt/sharcnet/testing/caffe

Set up modules and environment variables

There is a script (caffe-set-env-cudnn4.sh) that can help setup all the modules and environment variables:

module unload intel gcc mkl openmpi hdf python cuda
module load intel/15.0.3
module load hdf/serial/5.1.8.11
module load cuda/7.5.18
module load python/intel/2.7.10
export MKL_CBWR=AUTO
export CAFFE_ROOT=/opt/sharcnet/testing/caffe/caffe-master-160628
export PATH=/opt/sharcnet/testing/caffe/caffe-libs/bin:/opt/sharcnet/testing/caffe/caffe-libs/include:/opt/sharcnet/testing/cudnn/cudnn4:$PATH
export LD_LIBRARY_PATH=/opt/sharcnet/testing/caffe/caffe-libs/lib:/opt/sharcnet/testing/cudnn/cudnn4:$LD_LIBRARY_PATH
export PYTHONPATH=/opt/sharcnet/testing/caffe/caffe-libs/lib/python2.7/site-packages:/opt/sharcnet/testing/caffe/caffe-master-160628/python:$PYTHONPATH

In this file, CAFFE_ROOT is set to a precompiled Caffe which was built on Jun 28th, 2016. If you want to use this version of caffe, you don't have to run the commands above, you can simply do:

source /opt/sharcnet/testing/caffe/caffe-set-env-cudnn4.sh
  • You should run this "source" command every time you login to a sharcnet system, or put it into your .bashrc if you really know what these commands will do.

Prepare data set (e.g. Imagenet)

Although Caffe can take images directly as an input, for a better I/O performance, LMDB database is strongly recommended. Caffe provides a tool and a script to build lmdb database from images. You should copy the script to your place and modify it to meet your own needs. The script is

$CAFFE_ROOT/examples/imagenet/create_imagenet.sh

In this file, there are some variables you should modify:

  • EXAMPLE, should be a folder that you want to storage the lmdb database file.
EXAMPLE=/work/yourname/imagentet/lmdb
  • DATA, should be a folder where you keep the label files, train.txt and val.txt
DATA=/work/yourname/imagenet/labels
  • TOOLS, should be tools folder in Caffe, here we use the precompiled Caffe as an example
TOOLS=$CAFFE_ROOT/build/tools
  • TRAIN_DATA_ROOT and VAL_DATA_ROOT, should be folders that contains your images (e.g. .jpg files)
TRAIN_DATA_ROOT=/work/yourname/imagenet/train/
VAL_DATA_ROOT=/work/yourname/imagenet/val/
  • RESIZE, should be set to true if you haven't resized the images to 256x256
  • In the train.txt and val.txt, you should specify the image names with locations and lables. For example:
n01440764/n01440764_10254.JPEG 0
n01440764/n01440764_10281.JPEG 0

Each line contains one image location and label. "n01440764" is the subfolder's name under $TRAIN_DATA_ROOT. The full path to this image should be /work/yourname/imagenet/train/n01440764/n01440764_10254.JPEG

  • You can use data/ilsvrc12/get_ilsvrc_aux.sh script in Caffe to download the labels for imagenet.

After building the LMDB database, we should also compute an image mean file (imagenet_mean.binaryproto) based on the data you have. The script is $CAFFE_ROOT/examples/imagenet/make_imagenet_mean.sh. You should modify the variables inside to the same as we modified in create_imagenet.sh. If you have run the get_ilsvrc_aux.sh, an imagenet_mean.binaryproto which was computed on imagenet dataset will be downloaded.

Define models (Alexnet as an example)

Caffe provides an Alexnet model in $CAFFE_ROOT/models/bvlc_alexnet. Here we copy this folder to /home or /work so that we can modify it. First we should modify "train_val.prototxt" file. We should change the "mean_file" to where we hold the image mean. And the "source" for training and validation.

name: "AlexNet"
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 227
    mean_file: "/work/feimao/imagenet/imagenet_mean.binaryproto"
  }
  data_param {
    source: "/work/feimao/imagenet/ilsvrc12_train_lmdb"
    batch_size: 256
    backend: LMDB
  }
}
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    mirror: false
    crop_size: 227
    mean_file: "/work/feimao/imagenet/imagenet_mean.binaryproto"
  }
  data_param {
    source: "/work/feimao/imagenet/ilsvrc12_val_lmdb"
    batch_size: 50
    backend: LMDB
  }
}

And then changing the solver.prototxt to locate the train_val.prototxt file and the output snapshot.

net: "/work/feimao/imagenet/bvlc_alexnet/train_val.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 450000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "/work/feimao/imagenet/bvlc_alexnet/caffe_alexnet_train"
solver_mode: GPU

Submit jobs

  • On copper and Mosaic, please export MKL_CBWR=AUTO before submitting the jobs. This will be set if you do "source /opt/sharcnet/testing/caffe/caffe-set-env-cudnn4.sh"

Once we have solver.prototxt and train_val.prototxt files ready, we can submit jobs using sqsub. The command is

sqsub -q gpu -f threaded -n 2 --gpp=1 --mpp=32g -r 4h -o test-submit-caffe.out $CAFFE_ROOT/build/tools/caffe.bin train \
    --solver=/work/feimao/imagenet/bvlc_alexnet/solver.prototxt

Caffe can use 2 CPU cores, one for controlling GPU, one for loading data. The size of memory (mpp) depends on the model size. 32 GB is recommended as a smallest size to try. "gpp" should always be one because Caffe (master branch) supports only single GPU.

Building Caffe with cudnn v4(CUDA7.5) on Copper and Mosaic

How to build Caffe

If you'd like to build your own Caffe, you can simply get the newest Caffe from github (notice that you should stay in login node to have internet access):

[feimao@mos-login test-install-caffe]$ git clone https://github.com/BVLC/caffe.git

Then you go to mos1 (or cop1), load the modules:

module unload intel mkl openmpi cuda
module load intel/15.0.3
module load hdf/serial/5.1.8.11 
module load cuda/7.5.18 
module load python/intel/2.7.10

And export the environment paths to cudnn and caffe-libs folder:

export PATH=/opt/sharcnet/testing/caffe/caffe-libs/bin:/opt/sharcnet/testing/caffe/caffe-libs/include:/opt/sharcnet/testing/cudnn/cudnn4:$PATH
export LD_LIBRARY_PATH=/opt/sharcnet/testing/caffe/caffe-libs/lib:/opt/sharcnet/testing/cudnn/cudnn4:$LD_LIBRARY_PATH
export PYTHONPATH=/opt/sharcnet/testing/caffe/caffe-libs/lib/python2.7/site-packages:$PYTHONPATH

and go to the Caffe folder and copy the Makefile.config.example to Makefile.config:

[feimao@mos1 caffe]$ cp Makefile.config.example Makefile.config

Then modify the Makefile.config:

  • 1, uncomment
USE_CUDNN := 1
  • 2, uncomment
CUSTOM_CXX := g++
  • 3, change CUDA_DIR to
CUDA_DIR := /opt/sharcnet/cuda/7.5.18
  • 4, If on Copper, add a line "-gencode arch=compute_37,code=sm_37" to CUDA_ARCH
  • 5, change BLAS to
BLAS := mkl
  • 6, uncomment BLAS_INCLUDE, and change to
BLAS_INCLUDE := /opt/sharcnet/intel/15.0.3/mkl/include
  • 7, uncomment BLAS_LIB, and change to
BLAS_LIB := /opt/sharcnet/intel/15.0.3/mkl/lib/intel64
  • 8, change PYTHON_INCLUDE to
PYTHON_INCLUDE := /opt/sharcnet/python/2.7.10/intel/include \
                /opt/sharcnet/python/2.7.10/intel/include/python2.7 \
               /opt/sharcnet/python/2.7.10/intel/lib/python2.7/site-packages/numpy/core/
  • 9, change PYTHON_LIB to
PYTHON_LIB := /opt/sharcnet/python/2.7.10/intel/lib
  • 10, change INCLUDE_DIRS to
INCLUDE_DIRS := $(PYTHON_INCLUDE) /opt/sharcnet/hdf/5.1.8.11/serial/include /opt/sharcnet/testing/caffe/caffe-libs/include /opt/sharcnet/testing/cudnn/cudnn4 /usr/local/include
  • 10, change LIBRARY_DIRS to
LIBRARY_DIRS := $(PYTHON_LIB) /opt/sharcnet/hdf/5.1.8.11/serial/lib /opt/sharcnet/testing/caffe/caffe-libs/lib /opt/sharcnet/testing/cudnn/cudnn4 /usr/local/lib /usr/lib
  • 11, uncomment ALLOW_LMDB_NOLOCK if you need to run multiply jobs on the same date set at a same time.
ALLOW_LMDB_NOLOCK := 1
  • 12, run command to build Caffe and pycaffe:
make all -j16 && make test -j16 && make pycaffe

How to test the Caffe build

When the "make all" and "make test" finish, you can run "make runtest" to test if you build Caffe properly. You may get errors related to "test_gradient_based_solver.cpp" or other numerical inconsistency between CPU and GPU. Those errors was caused by MKL's Conditional Numerical Reproducibility setting. To avoid these errors, you should set MKL_CBWR:

export MKL_CBWR=AUTO