From Documentation
Jump to: navigation, search

Minsky is an IBM S822LC server with dual power8+ chips, 10 cores per socket, 8 SMT (Simultaneous MultiThreading) per core. 4 NVIDIA Pascal P100 GPUs are connected with NVlinks. SSD is equipped as /scratch storage to provide 700GB usable space.

Hardware Information

$ lscpu:
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    8
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Model:                 1.0 (pvr 004c 0100)
Model name:            POWER8NVL (raw), altivec supported
CPU max MHz:           4023.0000
CPU min MHz:           2061.0000
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-79
NUMA node1 CPU(s):     80-159

$ nvidia-smi topo -m
	 GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity
GPU0	 X 	NV2	SOC	SOC	SOC	SOC	SOC	SOC	0-79
GPU1	NV2	 X 	SOC	SOC	SOC	SOC	SOC	SOC	0-79
GPU2	SOC	SOC	 X 	NV2	SOC	SOC	SOC	SOC	80-159
GPU3	SOC	SOC	NV2	 X 	SOC	SOC	SOC	SOC	80-159
mlx5_0	SOC	SOC	SOC	SOC	 X 	PIX	SOC	SOC	
mlx5_1	SOC	SOC	SOC	SOC	PIX	 X 	SOC	SOC	
mlx5_2	SOC	SOC	SOC	SOC	SOC	SOC	 X 	PIX	
mlx5_3	SOC	SOC	SOC	SOC	SOC	SOC	PIX	 X 	

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Softwares

IBM Advance Toolchain

The IBM Advance Toolchain for Linux on Power is a set of open source development tools (compiler, debugger and profiling tools) and runtime libraries that allow users to take leading edge advantage of IBM's latest POWER hardware features on Linux. For more information about it, visit http://ibm.co/AdvanceToolchain.

  • A new update release for the 10.0 series of the IBM Advance Toolchain for Linux on Power is now installed under /opt/at10.0.
  • This release provides many package updates, including:
GCC 6.2
Glibc 2.24
Binutils 2.27
GDB 7.11
Support for Ubuntu 16.04.
GCC provides fixes for complex IEEE 128-bit floating point and support for IEEE 128-bit floating point built-ins.
GCC creates binaries using --mcpu=power8 --mtune=power8 by default on ppc64le.
Valgrind provides a fix for missing support for wbit field on mtfsfi instruction.
Valgrind Itrace provides a new option to only start instruction tracing when a given function starts.
Cross-compiler packages are now signed.
OpenSSL provides 6 security advisories.

IBM PowerAI

  • PowerAI release 4.0 is installed and provides software packages for several Deep Learning frameworks, supporting libraries, and tools:
Bazel
Caffe - BVLC, IBM, and NVIDIA variants
Chainer
DIGITS
NCCL
OpenBLAS
TensorFlow
Theano
Torch

All deep learning frameworks are installed in the folder /opt/DL :

caffe-bvlc - Berkeley Vision and Learning Center (BVLC) upstream Caffe, v1.0.0
caffe-ibm - IBM Optimized version of BVLC Caffe, v1.0.0
caffe-nv - NVIDIA fork of Caffe, v0.15.14
chainer - Chainer, v1.23.0
digits - DIGITS, v5.0.0
tensorflow - Google TensorFlow, v1.1.0
ddl-tensorflow - Distributed Deep Learning custom operator for TensorFlow
theano - Theano, v0.9.0
torch - Torch, v7

Login to the system

Minsky is currently being incorporated into the cloud. Updated access instructions will be posted here when that is done.

Getting started with IBM PowerAI MLDL Frameworks

Before running any GPU job, user should check the GPU avaiablity by using command:

nvidia-smi

then choose an idle GPU by adding CUDA_VISIBLE_DEVICES= before any program. For example:

CUDA_VISIBLE_DEVICES=<gpu_ids> program...
CUDA_VISIBLE_DEVICES=3 program... (for single-gpu job)
CUDA_VISIBLE_DEVICES=0,1 program... (for multi-gpu job)
  • It is highly recommended to copy your input data to local SSD storage: /scratch to get best I/O performance, but still use /home or /work for outputs.

General setup

Each framework package provides a shell script to simplify environmental setup. We recommend that users update their shell rc file (e.g. .bashrc) to source the desired setup scripts. For example:

source /opt/DL/<framework>/bin/<framework>-activate

Caffe

Packages are provided for upstream BVLC Caffe (/opt/DL/caffe-bvlc), IBM optimized BVLC Caffe (/opt/DL/caffe-ibm), and NVIDIA's Caffe (/opt/DL/caffe-nv). The system default Caffe (/opt/DL/caffe) is IBM optimized Caffe. To activate the system default caffe:

source /opt/DL/caffe/bin/caffe-activate

Or to activate a specific variant. For example:

source /opt/DL/caffe-bvlc/bin/caffe-activate
  • Attempting to activate multiple Caffe packages in a single login session will cause unpredictable behavior.

Once caffe is activated, user can directly run command:

CUDA_VISIBLE_DEVICES=<gpu_id> caffe train --solver=...

Tensorflow

To active tensorflow, run the command:

source /opt/DL/tensorflow/bin/tensorflow-activate

Then user can run python with tensorflow code:

CUDA_VISIBLE_DEVICES=<gpu_ids> python code.py

Torch

To active Torch, run the command:

source /opt/DL/torch/bin/torch-activate

Then user can run th with lua code:

CUDA_VISIBLE_DEVICES=<gpu_ids> th code.lua

Theano

To active theano, run the command:

source /opt/DL/theano/bin/theano-activate

Then user can run python with theano code:

CUDA_VISIBLE_DEVICES=<gpu_ids> python code.py

DIGITS

To active digits, run the command:

source /opt/DL/digits/bin/digits-activate-sn '''(This is a SHARCNET modification, changed job folder from /home to /work which has a lot more space)'''

To start DIGITS server with default port (5000):

CUDA_VISIBLE_DEVICES=<gpu ids> digits-devserver

To start DIGITS server with specific port:

CUDA_VISIBLE_DEVICES=<gpu ids> digits-devserver -p <port_num>

To use DIGITS, user should login to any SHARCNET cluster in another session with X11 window forwarding enabled. (adding -Y when ssh to the cluster, e.g. ssh -Y username@copper.sharcnet.ca. User should also prepare a web browser on SHARCNET machine. User can download a Firefox from https://www.mozilla.org/en-US/firefox/all/, please choose a LINUX-64bit version. Or copy /work/feimao/software_installs_old/firefox to user's folder. Go to the firefox folder and run ./firefox then open the webpage:

http://minsky.uwo.sharcnet:5000 (or the port specified by user)