From Documentation
Jump to: navigation, search
Note: Some of the information on this page is for our legacy systems only. The page is scheduled for an update to make it applicable to Graham.

This page will give a brief introduction about how to use Nsight for profiling purpose.

Nsight Eclipse Edition

NVIDIA® Nsight™ Eclipse Edition is a full-featured IDE powered by the Eclipse platform that provides an all-in-one integrated environment to edit, build, debug and profile CUDA-C applications. Nsight Eclipse Edition is part of the CUDA Toolkit Installer for Linux and Mac. [1]

Interactive job on SHARCNET

Nsight is graphical window application. You can login to the development node on Monk by adding -Y flag in your ssh command to enable X11 forwarding. You can also submit a interactive job on angel and then login to the node which you job is running.

To submit an interactive job, you can submit "sleep <seconds>" program to the gpu queue with no output file.

[feimao@ang241 ~]$ sqsub -q gpu -f threaded -n 8 --gpp=2 --mpp=8g -r 10h -o /dev/null sleep 36000

Once the job starts running, you should use sqjobs to check which node the job is running:

[feimao@ang241 ~]$ sqjobs
jobid queue state ncpus nodes  time command
----- ----- ----- ----- ----- ----- -------
82767   gpu     D     8 ang22 21.5h sleep 36000
82771   gpu     R     8 ang22  362s sleep 36000

And then use ssh -Y to login to that node. Please be sure you add -Y before you login to Angel's login node.

Once you login to the compute node (ang22), you will be able to use "nvidia-smi" to check the GPU status.

[feimao@ang241 ~]$ ssh -Y ang22
Last login: Thu Sep  3 14:04:25 2015 from ang241.angel.sharcnet
[feimao@ang22 ~]$ nvidia-smi
Fri Sep  4 13:28:15 2015       
| NVIDIA-SMI 340.32     Driver Version: 340.32         |                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 750 Ti  Off  | 0000:10:00.0     N/A |                  N/A |
| 40%   15C    P0    N/A /  N/A |      7MiB /  2047MiB |     N/A      Default |
|   1  GeForce GTX 750 Ti  Off  | 0000:12:00.0     N/A |                  N/A |
| 40%   18C    P0    N/A /  N/A |      7MiB /  2047MiB |     N/A      Default |
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|    0            Not Supported                                               |
|    1            Not Supported                                               |

After setting the modules, you will be able to run "nsight" to start an graphical window application. I add "&" after nsight to make it run in the background.

[feimao@ang22 ~]$ nsight &

Profiling kernels

Once you have your code running correctly, you can click the profile bottom, and the nsight will change to NVVP (NVidia Visual Profiler) mode. Here you can see the timeline of each kernel as well as host APIs.


To profile a kernel, you have to select it on the timeline and scroll down "Analysis" label in the left bottom corner and select "Switch to unguided analysis" then choose "Analyze All":

Analysis label.png
Analysis all.png

Kernel Performance Limiter

The performance limiter will show you the utilization on both computing and memory. It will also tell you if your code is compute bound or memory bound. The example I use here is a "reduction" kernel from Thrust library. So it shows a high memory utilization and it is memory bound code.

Kernel performance limiter.png

A typical compute bound code (matrix multiplication) will have a graph shows high Function Unit utilization and relatively lower memory utilization.

Kernel performance limiter 2.png

Kernel Latency

Instruction stall reasons

Instruction stall reasons indicate the condition that prevents warps form executing on any given cycle.

Instruction stall reasons.png


Occupancy is defined as the ratio of active warps on an SM to the maximum number of active warps supported by the SM. Occupancy varies over time as warps begin and end, and can be different for each SM. Low occupancy results in poor instruction issue efficiency, because there are not enough eligible warps to hide latency between dependent instructions. When occupancy is at a sufficient level to hide latency, increasing it further may degrade performance due to the reduction in resources per thread. An early step of kernel performance analysis should be to check occupancy and observe the effects on kernel execution time when running at different occupancy levels. Please reference here for more details about Occupancy.[2]


Multiprocessor Utilization

This graph shows the utilization of each SM.

SM utilization.png

Kernel Compute

Function Unit Utilization

This graph shows the utilization level of different function units. It is also indicating utilization of the four major logical pipelines (Load/Store, Texture, Control Flow, Arithmetic) of the SMs during the execution of the kernel. Useful for investigating if a pipeline is oversubscribed and therefore is limiting the kernel's performance. Also helpful to estimate if adding more work will scale well or if a pipeline limit will be hit. Please reference here for more details of each pipelines.[3]

Function unit utilization.png

Kernel Memory

This chart shows the usage of memory system. An optimized memory bound code could show a high global memory bandwidth.[4]

Kernel memory.png