Cross-section page — full summary at Performance Analysis — TensorRT LLM.
NVIDIA Platform Angle
TensorRT-LLM has first-class Nsight Systems integration built into its CLI tooling (trtllm-bench, trtllm-serve), making profiling a native part of the NVIDIA inference platform rather than a separate instrumentation step. This page covers TensorRT-LLM’s profiling capabilities as components of the NVIDIA platform stack.
Nsight Systems Integration Points in TensorRT-LLM
TensorRT-LLM exposes three profiling mechanisms:
-
CUDA profiler API gating (
TLLM_PROFILE_START_STOP): Rather than profiling entire multi-hour inference runs, users specify iteration ranges. Nsight’s-c cudaProfilerApimode activates only when TensorRT-LLM turns the CUDA profiler on. This tightly scopes the profile to the region of interest. -
NVTX markers: The PyTorch workflow provides NVTX markers by default (
TLLM_LLMAPI_ENABLE_NVTX=1). The C++/TensorRT workflow requires building with--nvtx. GC NVTX markers (TLLM_PROFILE_RECORD_GC=1) are also available for GIL/garbage collection diagnostics. These markers appear in the Nsight Systems timeline, labeling inference passes, batches, and GC events. -
PyTorch profiler export (
TLLM_TORCH_PROFILE_TRACE): Produces a.jsontrace alongside the.nsys-repNsight file, enabling dual-tool analysis for the PyTorch workflow.
ENABLE_PERFECT_ROUTER — MoE Platform Feature
For Mixture-of-Experts models (DeepSeek-V3/R1, GPT-OSS), TensorRT-LLM exposes ENABLE_PERFECT_ROUTER=1 to isolate expert load imbalance from kernel performance. The feature is model-specific — it must be integrated into each MoE model implementation. Currently supported on the NVIDIA platform:
- GPT-OSS — uses
RenormalizeMoeRoutingMethod(TopK + Softmax) - DeepSeek-V3 / DeepSeek-R1 — uses
DeepSeekV3MoeRoutingMethod
The MoE backends it works with (CUTLASS, TRTLLM, TRITON) are all NVIDIA-platform backends.
Profiling Workflow for NVIDIA Inference Stack
trtllm-bench/trtllm-serve ← application-level throughput/latency
↓ (TLLM_PROFILE_START_STOP + nsys)
Nsight Systems ← system-wide timeline (CPU/GPU/NVTX)
↓ (targeted kernel)
Nsight Compute ← per-kernel deep metrics
This mirrors the general NVIDIA profiling hierarchy described in NVIDIA Nsight Systems (NVIDIA Platform).
Connections to NVIDIA Platform Stack
- NVIDIA Nsight Systems (NVIDIA Platform) — the profiler tool that TensorRT-LLM’s environment variables control.
- Scaling LLMs with Triton and TensorRT-LLM (NVIDIA Platform) — TensorRT-LLM engine building and Triton deployment; the operational context in which this profiling workflow is applied.