From Documentation
Jump to: navigation, search
(SciNet: obsolete)
Line 6: Line 6:
  
 
[https://www.sharcnet.ca/my/systems/show/108 Copper], 8 nodes, 4 K80 cards (8 GPU devices) per node
 
[https://www.sharcnet.ca/my/systems/show/108 Copper], 8 nodes, 4 K80 cards (8 GPU devices) per node
 
== SciNet ==
 
SciNet recently acquired and installed one K20c card (K20c has fewer cores than K20x: 2496 vs. 2688; also, the core-memory bandwidth is lower: 208 vs. 250 GB/s; see [http://www.nvidia.ca/object/tesla-servers.html here]). There is no scheduler. One has to obtain SciNet account first, then
 
ssh login.scinet.utoronto.ca
 
ssh arc01
 
ssh arc09
 
 
nvidia-cuda-proxy-server is always running on this node, meaning one can use the Hyper-Q feature of K20 (when more than one CPU thread can efficiently share a single K20 GPU).
 
  
 
== Calcul Quebec ==
 
== Calcul Quebec ==

Revision as of 14:36, 17 May 2019

This page is mostly outdated, as newer and more powerful Pascal GPUs are now available across Compute Canada (in Graham, Cedar etc.).

Availability

SHARCNET

Mosaic, 20 nodes, 1 K20 GPU per node

Copper, 8 nodes, 4 K80 cards (8 GPU devices) per node

Calcul Quebec

Cluster Guillimin (http://www.calculquebec.ca/en/resources/compute-servers/guillimin) has 100 K20s (and also 100 Phi cards).

Benchmarking

The following K20c timings are expressed in monk (M2075 GPU) timing units.

Code Name Precision[1] K20c, 1 thread K20c, 8 threads[2] Hyper-Q speedup[3] K20c/serial, 1 thread[4] K20c/serial, 8 threads[5]
Random Lens Design, low res DP 0.60 1.79 3.00 58 21.5
Random Lens Design, high res DP 0.89 1.48 1.66 272 57
  1. SP/DP/MP: single/double/mixed precision
  2. Per thread timing: wall clock time divided by the number of threads
  3. Speedup due to Hyper-Q: the ratio of K20c/8 threads timing to K20c/1 thread timing.
  4. Speedup of 1 CPU thread+K20c vs. serial code ran on 1 CPU thread; on arc09. Be aware that arc09 cpu core is 1.75x slower than orca's; so for comparing K20c to orca, one has to divide this number by 1.75.
  5. Speedup of 8 CPU threads+K20c (GPU farm) vs. serial code ran on 8 CPU threads (serial farm); on arc09. Be aware that arc09 cpu core is 1.75x slower than orca's; so for comparing K20c to orca, one has to divide this number by 1.75.

Benchmarked codes

  • Random Lens Design. Written by Sergey Mashchenko. Discovery of new complex lens designs by means of global optimization (search of the global minimum of the merit function in 40-100 dimensional space). Monte Carlo type search (good for serial/GPU farming). Purely double precision (needed for the smoothness of the merit function.) Both serial and CUDA (capability 2.x) versions. Merit function is computed from the results of ray tracing through the system; each ray is handled by a separate CUDA thread. Low resolution runs (~10,000 rays/threads) are used for the initial search; high resolution runs (~100,000 rays/threads) are used to fine-tune the best candidates. The CUDA code has more than 10 kernels and a few device functions. The device memory consumption is ~200MB for low res, ~400 MB for high res. At low resolution, the occupancy number is fairly low (~0.20).