This NVIDIA developer guide documents how to integrate nvidia-nsight-systems profiling into TensorRT-LLM workloads for precise, low-overhead performance analysis. The core mechanism is toggling the CUDA profiler runtime API on and off via the TLLM_PROFILE_START_STOP=A-B environment variable combined with Nsight’s -c cudaProfilerApi flag — limiting capture to specific iteration ranges and producing smaller, interpretable profile files. A complementary feature, ENABLE_PERFECT_ROUTER, supports Mixture-of-Experts (MoE) expert load balance analysis by replacing the learned router with pre-computed, perfectly balanced logits, isolating routing bottlenecks from kernel performance without affecting the timing realism of other operations.
Key Concepts
TLLM_PROFILE_START_STOP=A-B: Environment variable that limits Nsight profiling to iterations A through B. Produces smaller files and makes the profiled region unambiguous. Use with -c cudaProfilerApi in the Nsight invocation.
-c cudaProfilerApi: Nsight flag that activates/deactivates the profiler at CUDA profiler API boundaries set by TLLM_PROFILE_START_STOP.
TLLM_LLMAPI_ENABLE_NVTX=1: Enables NVTX markers on the PyTorch workflow (default); on the C++/TensorRT workflow, requires building with --nvtx flag appended to scripts/build_wheel.py.
TLLM_PROFILE_RECORD_GC=1: Enables garbage collection NVTX markers for diagnosing Python GC pauses in the Nsight Systems timeline.
TLLM_TORCH_PROFILE_TRACE=<file>.json: Exports a PyTorch profiler trace (.json) alongside the Nsight Systems report (.nsys-rep) for dual-tool analysis.
ENABLE_PERFECT_ROUTER: Environment variable (=1) that bypasses the learned MoE router and replaces its output with pre-computed, perfectly load-balanced routing logits. The gate computation still runs (maintaining realistic timing), but its output is discarded. For performance analysis only — produces incorrect model outputs.
MoE Expert Load Balance Analysis: Comparison workflow: (1) benchmark with normal routing, (2) benchmark with ENABLE_PERFECT_ROUTER=1, (3) compare throughput. A >10% improvement with the perfect router indicates that routing imbalance is a significant bottleneck.
RenormalizeMoeRoutingMethod: TopK-first-then-Softmax routing method for which the perfect router logits are designed. Used by GPT-OSS.
DeepSeekV3MoeRoutingMethod: DeepSeek-V3/R1’s custom routing method; perfect router logit generation logic is adapted for this method in the supported implementation.
Key Claims and Findings
ENABLE_PERFECT_ROUTER works with all MoE backends: CUTLASS, TRTLLM, and TRITON.
Currently supported models for perfect routing: GPT-OSS (RenormalizeMoeRoutingMethod), DeepSeek-V3 / DeepSeek-R1 (DeepSeekV3MoeRoutingMethod).
Adding perfect routing support to a new model requires adding integration following the pattern used in existing implementations — it is model-specific plumbing.
The perfect router caches pre-computed logits for common batch sizes to minimize overhead.
Profiling output: Nsight Systems report saved to <name>.nsys-rep; PyTorch profiler trace to <name>.json.
Performance Analysis — TensorRT LLM (NVIDIA Platform) — angle: TensorRT-LLM’s Nsight Systems integration as an NVIDIA platform profiling capability; MoE routing analysis in the context of supported NVIDIA model implementations.