This page will give a brief introduction about how to use Nsight for profiling purpose.
Contents
Nsight Eclipse Edition
NVIDIA® Nsight™ Eclipse Edition is a full-featured IDE powered by the Eclipse platform that provides an all-in-one integrated environment to edit, build, debug and profile CUDA-C applications. Nsight Eclipse Edition is part of the CUDA Toolkit Installer for Linux and Mac. [1]
Interactive job on SHARCNET
Nsight is graphical window application. You can login to the development node on Monk by adding -Y flag in your ssh command to enable X11 forwarding. You can also submit a interactive job on angel and then login to the node which you job is running.
To submit an interactive job, you can submit "sleep <seconds>" program to the gpu queue with no output file.
[feimao@ang241 ~]$ sqsub -q gpu -f threaded -n 8 --gpp=2 --mpp=8g -r 10h -o /dev/null sleep 36000
Once the job starts running, you should use sqjobs to check which node the job is running:
[feimao@ang241 ~]$ sqjobs jobid queue state ncpus nodes time command ----- ----- ----- ----- ----- ----- ------- 82767 gpu D 8 ang22 21.5h sleep 36000 82771 gpu R 8 ang22 362s sleep 36000
And then use ssh -Y to login to that node. Please be sure you add -Y before you login to Angel's login node.
Once you login to the compute node (ang22), you will be able to use "nvidia-smi" to check the GPU status.
[feimao@ang241 ~]$ ssh -Y ang22 Last login: Thu Sep 3 14:04:25 2015 from ang241.angel.sharcnet [feimao@ang22 ~]$ nvidia-smi Fri Sep 4 13:28:15 2015 +------------------------------------------------------+ | NVIDIA-SMI 340.32 Driver Version: 340.32 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 750 Ti Off | 0000:10:00.0 N/A | N/A | | 40% 15C P0 N/A / N/A | 7MiB / 2047MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 750 Ti Off | 0000:12:00.0 N/A | N/A | | 40% 18C P0 N/A / N/A | 7MiB / 2047MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | +-----------------------------------------------------------------------------+
After setting the modules, you will be able to run "nsight" to start an graphical window application. I add "&" after nsight to make it run in the background.
[feimao@ang22 ~]$ nsight &
Profiling kernels
Once you have your code running correctly, you can click the profile bottom, and the nsight will change to NVVP (NVidia Visual Profiler) mode. Here you can see the timeline of each kernel as well as host APIs.
To profile a kernel, you have to select it on the timeline and scroll down "Analysis" label in the left bottom corner and select "Switch to unguided analysis" then choose "Analyze All":
Kernel Performance Limiter
The performance limiter will show you the utilization on both computing and memory. It will also tell you if your code is compute bound or memory bound. The example I use here is a "reduction" kernel from Thrust library. So it shows a high memory utilization and it is memory bound code.
A typical compute bound code (matrix multiplication) will have a graph shows high Function Unit utilization and relatively lower memory utilization.
Kernel Latency
Instruction stall reasons
Instruction stall reasons indicate the condition that prevents warps form executing on any given cycle.
Occupancy
Occupancy is defined as the ratio of active warps on an SM to the maximum number of active warps supported by the SM. Occupancy varies over time as warps begin and end, and can be different for each SM. Low occupancy results in poor instruction issue efficiency, because there are not enough eligible warps to hide latency between dependent instructions. When occupancy is at a sufficient level to hide latency, increasing it further may degrade performance due to the reduction in resources per thread. An early step of kernel performance analysis should be to check occupancy and observe the effects on kernel execution time when running at different occupancy levels. Please reference here for more details about Occupancy.[2]
Multiprocessor Utilization
This graph shows the utilization of each SM.
Kernel Compute
Function Unit Utilization
This graph shows the utilization level of different function units. It is also indicating utilization of the four major logical pipelines (Load/Store, Texture, Control Flow, Arithmetic) of the SMs during the execution of the kernel. Useful for investigating if a pipeline is oversubscribed and therefore is limiting the kernel's performance. Also helpful to estimate if adding more work will scale well or if a pipeline limit will be hit. Please reference here for more details of each pipelines.[3]
Kernel Memory
This chart shows the usage of memory system. An optimized memory bound code could show a high global memory bandwidth.[4]