From Documentation
Jump to: navigation, search

Tensorflow TensorFlow is an Open Source Software Library for Machine Intelligence.

Using Tensorflow on Graham cluster

Please reference TensorFlow CC Doc

Installing and running precompiled Tensorflow v1.5 with Python2.7.10 on Copper/Mosaic

To install tensorflow, user should stay on the login node and use "tensorflow-1.5-cp27-active" script which will set all the modules and environment paths:

source /opt/sharcnet/testing/tensorflow/tensorflow-1.5-cp27-active

OR do it one by one manually:

module unload cuda intel mkl openmpi hdf python
module load cuda/7.5.18
module load intel/15.0.3 
module load openmpi/intel1503-std/1.8.7
module load hdf/serial/5.1.8.11 
module load python/intel/2.7.10
module unload cuda
module load cuda/8.0.61
export LD_LIBRARY_PATH=/opt/sharcnet/testing/cudnn/cudnn7:$LD_LIBRARY_PATH
export MKL_CBWR=AUTO

Then use the pip to install tensorflow:

For using GPU TensorFlow (with cuDNN support)
pip install /opt/sharcnet/testing/tensorflow/gpu/tensorflow-1.5.0-cp27-cp27m-linux_x86_64.whl --user

For using CPU only TensorFlow (with Intel MKL-DNN support)
pip install /opt/sharcnet/testing/tensorflow/cpu/tensorflow-1.5.0-cp27-cp27m-linux_x86_64.whl --user

User has to notice that the development node and gpu compute nodes don't have internet access. So some tensorflow example will give network unreachable error. To solve this problem, user may need to specify data folder manually. For example:

python cifar10_train.py --trian_dir=/somewhere/cifar10_train --data_dir=/somwhere/cifar10_data

Submitting Jobs

  • User should run command "source /opt/sharcnet/testing/tensorflow/tensorflow-1.5-cp27-active" before submitting jobs

Tensorflow can take advantage of multi-core CPU and multiple GPUs (if code supports) in a node. User needs to know how many CPU cores and GPUs are needed to run a job.

  • For using CPU only, user needs to ask a whole node. Copper has 24 cores per CPU node while Mosaic has 20:
sqsub -q threaded -n 24 --mpp=62g ... (Copper)
sqsub -q threaded -n 20 --mpp=250g ... (Mosaic)
  • On Mosaic, there is only one GPU per node, so user can ask 4 CPU cores with 32G memory: (change it to higher value if needed)
sqsub -q gpu -f threaded -n 4 --gpp=1 --mpp=32g -r <run_time>  -o output.txt python code.py
  • On Copper, which has 8 GPUs per node, we recommend to ask CPU cores and memory based on how many GPUs are needed:
sqsub -q gpu -f threaded -n 16 --gpp=8 --mpp=92g ... (for 8 GPUs)
sqsub -q gpu -f threaded -n 8 --gpp=4 --mpp=46g  ... (for 4 GPUs)
sqsub -q gpu -f threaded -n 4 --gpp=2 --mpp=23g  ... (for 2 GPUs)
sqsub -q gpu -f threaded -n 2 --gpp=1 --mpp=11.5g ...  (for 1 GPUs)

Building Tensorflow (v1.5) from source on Copper/Mosaic (CentOS 6)

  • We don't recommend user to build Tensorflow unless user is doing Tensorflow development work. If only running Tensorflow is needed, please follow the instructions above.

System/software requirements

Tensorflow should be built with internet access. You should stay on Copper/Mosaic's login node. And ask the system admin (by submitting a ticket or email help@sharcnet.ca indicating you're building tensorflow) to change the virtual memory limitation for running JAVA on login nodes.

  • Sharcnet modules setting:
module purge
module load intel/15.0.3
module load cuda/7.5.18
module load openmpi/intel1503-std/1.8.7
module load hdf/serial/5.1.8.11
module load python/intel/2.7.10
module unload intel/15.0.3
module load gcc/4.9.2
module load binutils/2.25.1
module unload cuda
module load cuda/8.0.61
  • Python >= 2.7 with Numpy, wheel, pip

Wheel needs to be installed additionally to SHARCNET's python/intel/2.7.10 module.

To install wheel:

pip install wheel --user


  • JAVA >= 1.8

The system default Java is 1.7. To get it update please run:

/home/edward/public/scripts/java8setup.bash

And please go to ~/package to check the java folder name and export it to JAVA_HOME

export JAVA_HOME=~/packages/jdk1.8.*_*

You may need to prepend the PATH variable with the new java location

export PATH=$JAVA_HOME/bin:$PATH

Then verify the java location and version:

[feimao@cop-login ~]$ which java
~/package/jdk1.8.0_66/bin/java
[feimao@cop-login ~]$ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
  • CUDA >= 8.0

Should have cuda/8.0.61 module loaded

  • cuDNN >= v6

Will use cuDNN v7.0.5 in this instruction

  • Binutils

Should have binutils/2.25.1 module loaded

  • Bazel

Bazel is installed under

/opt/sharcnet/testing/bazel-0.9.0

If user want to build a new bazel, user should do "export LD_LIBRARY_PATH=/opt/sharcnet/gcc/4.9.2/lib64:$LD_LIBRARY_PATH" and remove all "-B/usr/bin" in "tools/cpp/unix_cc_configure.bzl" file before building. If -B/usr/bin exists, it will cause TensorFlow build error "Error: no such instruction: `shrx %rdx,%rax,%rax'"

Building Tensorflow v1.5

  • Get the tensorflow from github:
git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout r1.5
  • Set the PATH/LD_LIBRARY_PATH environments:
export PATH=~/.local/bin/:/opt/sharcnet/testing/bazel-0.9.0:$PATH
export LD_LIBRARY_PATH=/opt/sharcnet/gcc/4.9.2/lib64:$LD_LIBRARY_PATH
  • Modify "third_party/gpus/crosstool/CROSSTOOL_nvcc.tpl"

(locate: toolchain_identifier: "local_linux") FROM:

 tool_path { name: "ar" path: "/usr/bin/ar" }
  tool_path { name: "compat-ld" path: "/usr/bin/ld" }
  tool_path { name: "cpp" path: "/usr/bin/cpp" }
  tool_path { name: "dwp" path: "/usr/bin/dwp" }
  # As part of the TensorFlow release, we place some cuda-related compilation
  # files in @local_config_cuda//crosstool/clang/bin, and this relative
  # path, combined with the rest of our Bazel configuration causes our
  # compilation to use those files.
  tool_path { name: "gcc" path: "clang/bin/crosstool_wrapper_driver_is_not_gcc" }
  # Use "-std=c++11" for nvcc. For consistency, force both the host compiler
  # and the device compiler to use "-std=c++11".
  cxx_flag: "-std=c++11"
  linker_flag: "-Wl,-no-as-needed"
  linker_flag: "-lstdc++"
  linker_flag: "-B/usr/bin/"

%{host_compiler_includes}
  tool_path { name: "gcov" path: "/usr/bin/gcov" }

  # C(++) compiles invoke the compiler (as that is the one knowing where
  # to find libraries), but we provide LD so other rules can invoke the linker.
  tool_path { name: "ld" path: "/usr/bin/ld" }

  tool_path { name: "nm" path: "/usr/bin/nm" }
  tool_path { name: "objcopy" path: "/usr/bin/objcopy" }
  objcopy_embed_flag: "-I"
  objcopy_embed_flag: "binary"
  tool_path { name: "objdump" path: "/usr/bin/objdump" }
  tool_path { name: "strip" path: "/usr/bin/strip" }

TO

tool_path { name: "ar" path: "/opt/sharcnet/binutils/2.25.1/bin/ar" }
  tool_path { name: "compat-ld" path: "/opt/sharcnet/binutils/2.25.1/bin/ld" }
  tool_path { name: "cpp" path: "/opt/sharcnet/gcc/4.9.2/bin/cpp" }
  tool_path { name: "dwp" path: "/usr/bin/dwp" }
  # As part of the TensorFlow release, we place some cuda-related compilation
  # files in @local_config_cuda//crosstool/clang/bin, and this relative
  # path, combined with the rest of our Bazel configuration causes our
  # compilation to use those files.
  tool_path { name: "gcc" path: "clang/bin/crosstool_wrapper_driver_is_not_gcc" }
  # Use "-std=c++11" for nvcc. For consistency, force both the host compiler
  # and the device compiler to use "-std=c++11".
  cxx_flag: "-std=c++11"
  linker_flag: "-Wl,-no-as-needed"
  linker_flag: "-lstdc++"
  linker_flag: "-B/opt/sharcnet/binutils/2.25.1/bin/"
  linker_flag: "-B/opt/sharcnet/gcc/4.9.2/bin/"
  linker_flag: "-Wl,-rpath=/opt/sharcnet/gcc/4.9.2/lib64"
  linker_flag: "-Wl,-rpath=/opt/sharcnet/cuda/8.0.61/lib64"
  linker_flag: "-Wl,-rpath=/opt/sharcnet/testing/cudnn/cudnn7"

%{host_compiler_includes}
  cxx_builtin_include_directory: "/usr/lib/gcc/"
  cxx_builtin_include_directory: "/usr/local/include"
  cxx_builtin_include_directory: "/usr/include"
  cxx_builtin_include_directory: "/opt/sharcnet/gcc/4.9.2/include/c++/4.9.2"
  cxx_builtin_include_directory: "/opt/sharcnet/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include"
  cxx_builtin_include_directory: "/opt/sharcnet/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include-fixed/"
  tool_path { name: "gcov" path: "/opt/sharcnet/gcc/4.9.2/bin/gcov" }

  # C(++) compiles invoke the compiler (as that is the one knowing where
  # to find libraries), but we provide LD so other rules can invoke the linker.
  tool_path { name: "ld" path: "/opt/sharcnet/binutils/2.25.1/bin/ld" }

  tool_path { name: "nm" path: "/opt/sharcnet/binutils/2.25.1/bin/nm" }
  tool_path { name: "objcopy" path: "/opt/sharcnet/binutils/2.25.1/bin/objcopy" }
  objcopy_embed_flag: "-I"
  objcopy_embed_flag: "binary"
  tool_path { name: "objdump" path: "/opt/sharcnet/binutils/2.25.1/bin/objdump" }
  tool_path { name: "strip" path: "/opt/sharcnet/binutils/2.25.1/bin/strip" }
  • Modify "third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl"

FROM:

# Template values set by cuda_autoconf.
CPU_COMPILER = ('%{cpu_compiler}')
GCC_HOST_COMPILER_PATH = ('%{gcc_host_compiler_path}')

NVCC_PATH = '%{nvcc_path}'
PREFIX_DIR = os.path.dirname(GCC_HOST_COMPILER_PATH)
NVCC_VERSION = '%{cuda_version}'

TO:

# Template values set by cuda_autoconf.
CPU_COMPILER = ('/opt/sharcnet/gcc/4.9.2/bin/gcc')
GCC_HOST_COMPILER_PATH = ('/opt/sharcnet/gcc/4.9.2/bin/gcc')

NVCC_PATH = '%{nvcc_path}'
PREFIX_DIR = os.path.dirname(GCC_HOST_COMPILER_PATH)
NVCC_VERSION = '%{cuda_version}'
LLVM_HOST_COMPILER_PATH = ('/opt/sharcnet/gcc/4.9.2/bin/gcc')
AS_PATH = ('/opt/sharcnet/binutils/2.25.1/bin/as')
PREFIX_DIR = os.path.dirname(AS_PATH)

Here we define a path to "as" to make gcc find it. Otherwise, if as is not in the GCC_HOST_COMPILER_PATH, you will get error "gcc: error trying to exec 'as': execvp: No such file or directory".

  • Modify "tensorflow/tensorflow.bzl" using patch below:
--- tensorflow/tensorflow.bzl	2018-01-25 22:22:10.000000000 +0000
+++ tensorflow/tensorflow.bzl.new	2018-02-02 21:51:15.039443161 +0000
@@ -264,7 +264,7 @@
     name,
     srcs=[],
     deps=[],
-    linkopts=[],
+    linkopts=['-lrt'],
     framework_so=tf_binary_additional_srcs(),
     **kwargs):
   native.cc_binary(
@@ -1264,7 +1264,7 @@
 )
 
 def tf_extension_linkopts():
-  return []  # No extension link opts
+  return ["-lrt"]  # No extension link opts
 
 def tf_extension_copts():
   return []  # No extension c opts
  • And add "use_default_shell_env=True," under "mnemonic="PythonSwig"," (around line 483)

Run ./configure inside tensorflow folder:


GPU build configuration

[feimao@cop-login tensorflow]$ ./configure
INFO: $TEST_TMPDIR defined: output root default is '/dev/shm/feimao/bazel' and max_idle_secs default is '15'.
Extracting Bazel installation...
You have bazel 0.9.0- (@non-git) installed.
Please specify the location of python. [Default is /opt/sharcnet/python/2.7.10/intel/bin/python]: 


Found possible Python library paths:
  /opt/sharcnet/python/2.7.10/intel/lib
  /opt/sharcnet/python/2.7.10/intel/lib/python2.7/site-packages
Please input the desired Python library path to use.  Default is [/opt/sharcnet/python/2.7.10/intel/lib]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
No jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 8.0.61


Please specify the location where CUDA 8.0.61 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /opt/sharcnet/cuda/8.0.61


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.0.5


Please specify the location where cuDNN 7.0.5 library is installed. Refer to README.md for more details. [Default is /opt/sharcnet/cuda/8.0.61]:/opt/sharcnet/testing/cudnn/cudnn7


Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]3.5,3.7


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /opt/sharcnet/gcc/4.9.2/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


Add "--config=mkl" to your bazel command to build with MKL support.
Please note that MKL on MacOS or windows is still not supported.
If you would like to use a local MKL instead of downloading, please set the environment variable "TF_MKL_ROOT" every time before build.

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Configuration finished
  • Build tensorflow by bazel (a large number of jobs can reach the process limitation on login node, please keep it less than 8)
bazel build --jobs=2 --config=opt --config=cuda --verbose_failures //tensorflow/tools/pip_package:build_pip_package
  • Build a wheel installation package:
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
  • Use pip to install .whl
pip install ~/tensorflow_pkg/tensorflow-1.5.0-cp27-cp27m-linux_x86_64.whl --user

CPU build configuration

[feimao@cop32 tensorflow]$ ./configure 
WARNING: ignoring http_proxy in environment.
WARNING: Output base '/home/feimao/.cache/bazel/_bazel_feimao/f08e0fafb8438193232e9a68f6687d09' is on NFS. This may lead to surprising failures and undetermined behavior.
You have bazel 0.9.0- (@non-git) installed.
Please specify the location of python. [Default is /opt/sharcnet/python/2.7.10/intel/bin/python]: 


Found possible Python library paths:
  /opt/sharcnet/python/2.7.10/intel/lib
  /opt/sharcnet/python/2.7.10/intel/lib/python2.7/site-packages
Please input the desired Python library path to use.  Default is [/opt/sharcnet/python/2.7.10/intel/lib]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
No jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [y/N]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: n
No CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
Configuration finished
  • Build tensorflow by bazel (a large number of jobs can reach the process limitation on login node, please keep it less than 8)
bazel build --jobs=2 --config=opt --config=mkl --verbose_failures //tensorflow/tools/pip_package:build_pip_package
  • Build a wheel installation package:
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
  • Use pip to install .whl
pip install ~/tensorflow_pkg/tensorflow-1.5.0-cp27-cp27m-linux_x86_64.whl --user