Abstract
NVIDIA Nsight Systems is a system-wide performance analysis tool that visualizes CPU and GPU activity on a unified timeline, providing low-overhead profiling across the full NVIDIA hardware range — from DGX servers and RTX workstations to DRIVE automotive and Jetson edge devices. It exposes CPU parallelization and core utilization, GPU streaming-multiprocessor (SM) optimization, CUDA/cuBLAS/cuDNN/TensorRT library traces, network communications, OS interactions, and hardware-level metrics including NVLink throughput, DRAM activity, Tensor Core activity, and warp occupancy. Multi-node profiling automatically diagnoses performance limiters across data center–scale clusters. The tool is part of the NVIDIA Nsight Developer Tools suite alongside Nsight Compute (kernel-level CUDA profiler), Nsight Graphics (frame debugger), and Nsight Aftermath SDK (GPU crash reporter). Reported outcomes include Tracxpoint achieving >90% GPU utilization and reducing training time 600→90 minutes, and Deepset achieving a 3.9× speedup with 12.8× cost reduction for NLP training.
Key Concepts
- Unified timeline view: Nsight Systems displays CPU and GPU activity side by side, correlating events to expose dependencies, bottlenecks, and resource contention across the full system.
- GPU Metrics Sampling: Toggle-on feature plotting low-level IO metrics — PCIe throughput, NVLink bandwidth, DRAM activity — plus SM utilization, Tensor Core activity, instruction throughput, and warp occupancy.
- NVTX (NVIDIA Tools Extension): Custom instrumentation API allowing developers to insert named markers and ranges into application code, visible as annotations in the Nsight Systems timeline. Plugins can emit NVTX events that appear in the same timeline as the main application.
- Multi-node analysis: Automatic diagnosis of performance limiters across many nodes simultaneously; includes network metrics alongside Python backtrace sampling for complete multi-GPU, multi-CPU, DPU, and inter-node visibility.
- Python/Jupyter integration: Backtraces and automatic call stack sampling for deep learning Python applications; Jupyter Lab plugin enables profiling directly in notebooks.
- CUDA profiler runtime API: On/off toggle mechanism used by TensorRT-LLM’s
TLLM_PROFILE_START_STOPfeature to capture only specific iteration ranges (see Performance Analysis — TensorRT LLM). - Nsight Compute: Companion interactive kernel profiler for CUDA applications; operates at a deeper level than Nsight Systems with detailed per-kernel performance metrics and API debugging.
- Nsight Graphics: Standalone tool for debugging, profiling, and exporting frames built with Direct3D, Vulkan, OpenGL, OpenVR, and Oculus SDK; includes ray-tracing support.
- Nsight Aftermath SDK: Library integrating into D3D12 or Vulkan crash reporters to generate GPU “mini-dumps” (pipeline information) on TDR exceptions.
Key Claims and Findings
- Tracxpoint: Nsight Systems revealed GPU starvation on a Quadro P6000; after optimization, GPU utilization exceeded 90% and training time dropped from 600 to 90 minutes.
- Deepset: Working with AWS and NVIDIA, achieved 3.9× speedup and 12.8× cost reduction for NLP model training.
- Microsoft Azure HPC+AI: Nsight Systems provided event-level visibility across CPUs, GPUs, NICs, and the OS, enabling identification of top time-consuming functions and cold spots.
- Adobe: Nsight Graphics and Nsight Systems described as “invaluable” for profiling Vulkan ray-tracing applications in Substance 3D products.
- Nsight Systems 2026.2.1 is the current release as of the article’s creation date.
The Nsight Developer Tools Suite
| Tool | Scope | Primary Use |
|---|---|---|
| Nsight Systems | System-wide (CPU + GPU + network) | Timeline analysis, bottleneck identification, multi-node |
| Nsight Compute | Single CUDA kernel | Deep kernel metrics, warp efficiency, memory throughput |
| Nsight Graphics | Graphics API frames | D3D/Vulkan/GL debugging, ray-tracing profiling |
| Nsight Aftermath SDK | GPU crash state | Post-crash mini-dump for D3D12/Vulkan applications |
Terminology
- SM (Streaming Multiprocessor): Basic processing unit of NVIDIA GPUs; SM utilization reflects how effectively GPU compute is used.
- Warp occupancy: Ratio of active warps to maximum possible warps on an SM; low occupancy often indicates memory latency hiding issues.
- TDR (Timeout Detection and Recovery): Windows OS mechanism that resets the GPU on hang; Nsight Aftermath captures GPU state before the reset.
- DPU (Data Processing Unit): NVIDIA BlueField; Nsight Systems can include DPU metrics in multi-node profiles.
Connections to Existing Wiki Pages
- Performance Analysis — TensorRT LLM — detailed guide on using Nsight Systems specifically with TensorRT-LLM workloads, including
TLLM_PROFILE_START_STOPandENABLE_PERFECT_ROUTER. - Optimization — NVIDIA Triton Inference Server — Triton perf_analyzer and Model Analyzer provide application-level benchmarking; Nsight Systems provides deeper profiling underneath.
Cross-Section Pages
- NVIDIA Nsight Systems (NVIDIA Platform) — angle: Nsight Systems as an NVIDIA developer platform tool; relationship to the TensorRT-LLM and Triton profiling ecosystem.