From Documentation
Revision as of 14:36, 17 May 2019 by Ppomorsk (Talk | contribs) (SciNet: obsolete)

Jump to: navigation, search

This page is mostly outdated, as newer and more powerful Pascal GPUs are now available across Compute Canada (in Graham, Cedar etc.).

Availability

SHARCNET

Mosaic, 20 nodes, 1 K20 GPU per node

Copper, 8 nodes, 4 K80 cards (8 GPU devices) per node

Calcul Quebec

Cluster Guillimin (http://www.calculquebec.ca/en/resources/compute-servers/guillimin) has 100 K20s (and also 100 Phi cards).

Benchmarking

The following K20c timings are expressed in monk (M2075 GPU) timing units.

Code Name Precision[1] K20c, 1 thread K20c, 8 threads[2] Hyper-Q speedup[3] K20c/serial, 1 thread[4] K20c/serial, 8 threads[5]
Random Lens Design, low res DP 0.60 1.79 3.00 58 21.5
Random Lens Design, high res DP 0.89 1.48 1.66 272 57
  1. SP/DP/MP: single/double/mixed precision
  2. Per thread timing: wall clock time divided by the number of threads
  3. Speedup due to Hyper-Q: the ratio of K20c/8 threads timing to K20c/1 thread timing.
  4. Speedup of 1 CPU thread+K20c vs. serial code ran on 1 CPU thread; on arc09. Be aware that arc09 cpu core is 1.75x slower than orca's; so for comparing K20c to orca, one has to divide this number by 1.75.
  5. Speedup of 8 CPU threads+K20c (GPU farm) vs. serial code ran on 8 CPU threads (serial farm); on arc09. Be aware that arc09 cpu core is 1.75x slower than orca's; so for comparing K20c to orca, one has to divide this number by 1.75.

Benchmarked codes

  • Random Lens Design. Written by Sergey Mashchenko. Discovery of new complex lens designs by means of global optimization (search of the global minimum of the merit function in 40-100 dimensional space). Monte Carlo type search (good for serial/GPU farming). Purely double precision (needed for the smoothness of the merit function.) Both serial and CUDA (capability 2.x) versions. Merit function is computed from the results of ray tracing through the system; each ray is handled by a separate CUDA thread. Low resolution runs (~10,000 rays/threads) are used for the initial search; high resolution runs (~100,000 rays/threads) are used to fine-tune the best candidates. The CUDA code has more than 10 kernels and a few device functions. The device memory consumption is ~200MB for low res, ~400 MB for high res. At low resolution, the occupancy number is fairly low (~0.20).