Abstract

This NVIDIA developer guide documents how to integrate nvidia-nsight-systems profiling into TensorRT-LLM workloads for precise, low-overhead performance analysis. The core mechanism is toggling the CUDA profiler runtime API on and off via the TLLM_PROFILE_START_STOP=A-B environment variable combined with Nsight’s -c cudaProfilerApi flag — limiting capture to specific iteration ranges and producing smaller, interpretable profile files. A complementary feature, ENABLE_PERFECT_ROUTER, supports Mixture-of-Experts (MoE) expert load balance analysis by replacing the learned router with pre-computed, perfectly balanced logits, isolating routing bottlenecks from kernel performance without affecting the timing realism of other operations.

Key Concepts

  • TLLM_PROFILE_START_STOP=A-B: Environment variable that limits Nsight profiling to iterations A through B. Produces smaller files and makes the profiled region unambiguous. Use with -c cudaProfilerApi in the Nsight invocation.
  • -c cudaProfilerApi: Nsight flag that activates/deactivates the profiler at CUDA profiler API boundaries set by TLLM_PROFILE_START_STOP.
  • TLLM_LLMAPI_ENABLE_NVTX=1: Enables NVTX markers on the PyTorch workflow (default); on the C++/TensorRT workflow, requires building with --nvtx flag appended to scripts/build_wheel.py.
  • TLLM_PROFILE_RECORD_GC=1: Enables garbage collection NVTX markers for diagnosing Python GC pauses in the Nsight Systems timeline.
  • TLLM_TORCH_PROFILE_TRACE=<file>.json: Exports a PyTorch profiler trace (.json) alongside the Nsight Systems report (.nsys-rep) for dual-tool analysis.
  • ENABLE_PERFECT_ROUTER: Environment variable (=1) that bypasses the learned MoE router and replaces its output with pre-computed, perfectly load-balanced routing logits. The gate computation still runs (maintaining realistic timing), but its output is discarded. For performance analysis only — produces incorrect model outputs.
  • MoE Expert Load Balance Analysis: Comparison workflow: (1) benchmark with normal routing, (2) benchmark with ENABLE_PERFECT_ROUTER=1, (3) compare throughput. A >10% improvement with the perfect router indicates that routing imbalance is a significant bottleneck.
  • RenormalizeMoeRoutingMethod: TopK-first-then-Softmax routing method for which the perfect router logits are designed. Used by GPT-OSS.
  • DeepSeekV3MoeRoutingMethod: DeepSeek-V3/R1’s custom routing method; perfect router logit generation logic is adapted for this method in the supported implementation.

Key Claims and Findings

  • ENABLE_PERFECT_ROUTER works with all MoE backends: CUTLASS, TRTLLM, and TRITON.
  • Currently supported models for perfect routing: GPT-OSS (RenormalizeMoeRoutingMethod), DeepSeek-V3 / DeepSeek-R1 (DeepSeekV3MoeRoutingMethod).
  • Adding perfect routing support to a new model requires adding integration following the pattern used in existing implementations — it is model-specific plumbing.
  • The perfect router caches pre-computed logits for common batch sizes to minimize overhead.
  • Profiling output: Nsight Systems report saved to <name>.nsys-rep; PyTorch profiler trace to <name>.json.

Environment Variables Reference

VariableValueEffect
TLLM_PROFILE_START_STOPA-B (e.g. 100-150)Profile only iterations A through B
TLLM_LLMAPI_ENABLE_NVTX1Enable NVTX markers (PyTorch workflow)
TLLM_PROFILE_RECORD_GC1Add GC NVTX markers
TLLM_TORCH_PROFILE_TRACE<filename>.jsonExport PyTorch profiler trace
ENABLE_PERFECT_ROUTER1Bypass learned MoE router with balanced logits

Example: Profiling trtllm-bench (Iterations 100–150)

TLLM_PROFILE_START_STOP=100-150 nsys profile \
  -o trace -f true \
  -t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
  --cuda-graph-trace node \
  -e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
  --trace-fork-before-exec=true \
  trtllm-bench \
    --model deepseek-ai/DeepSeek-V3 \
    --model_path ${MODEL_PATH} \
    throughput \
    --dataset /tmp/dataset.txt --warmup 0 \
    --backend pytorch \
    --streaming

Interpreting MoE Router Results

ScenarioInterpretation
Similar performance with/without ENABLE_PERFECT_ROUTERRouter load balancing is not a bottleneck; optimize elsewhere
>10% improvement with ENABLE_PERFECT_ROUTERLearned router is causing load imbalance; consider router optimization or load balancing strategies

Connections to Existing Wiki Pages

Cross-Section Pages