Performance Analysis — TensorRT LLM

Abstract

This NVIDIA developer guide documents how to integrate nvidia-nsight-systems profiling into TensorRT-LLM workloads for precise, low-overhead performance analysis. The core mechanism is toggling the CUDA profiler runtime API on and off via the TLLM_PROFILE_START_STOP=A-B environment variable combined with Nsight’s -c cudaProfilerApi flag — limiting capture to specific iteration ranges and producing smaller, interpretable profile files. A complementary feature, ENABLE_PERFECT_ROUTER, supports Mixture-of-Experts (MoE) expert load balance analysis by replacing the learned router with pre-computed, perfectly balanced logits, isolating routing bottlenecks from kernel performance without affecting the timing realism of other operations.

Key Concepts

TLLM_PROFILE_START_STOP=A-B: Environment variable that limits Nsight profiling to iterations A through B. Produces smaller files and makes the profiled region unambiguous. Use with -c cudaProfilerApi in the Nsight invocation.
-c cudaProfilerApi: Nsight flag that activates/deactivates the profiler at CUDA profiler API boundaries set by TLLM_PROFILE_START_STOP.
TLLM_LLMAPI_ENABLE_NVTX=1: Enables NVTX markers on the PyTorch workflow (default); on the C++/TensorRT workflow, requires building with --nvtx flag appended to scripts/build_wheel.py.
TLLM_PROFILE_RECORD_GC=1: Enables garbage collection NVTX markers for diagnosing Python GC pauses in the Nsight Systems timeline.
TLLM_TORCH_PROFILE_TRACE=<file>.json: Exports a PyTorch profiler trace (.json) alongside the Nsight Systems report (.nsys-rep) for dual-tool analysis.
ENABLE_PERFECT_ROUTER: Environment variable (=1) that bypasses the learned MoE router and replaces its output with pre-computed, perfectly load-balanced routing logits. The gate computation still runs (maintaining realistic timing), but its output is discarded. For performance analysis only — produces incorrect model outputs.
MoE Expert Load Balance Analysis: Comparison workflow: (1) benchmark with normal routing, (2) benchmark with ENABLE_PERFECT_ROUTER=1, (3) compare throughput. A >10% improvement with the perfect router indicates that routing imbalance is a significant bottleneck.
RenormalizeMoeRoutingMethod: TopK-first-then-Softmax routing method for which the perfect router logits are designed. Used by GPT-OSS.
DeepSeekV3MoeRoutingMethod: DeepSeek-V3/R1’s custom routing method; perfect router logit generation logic is adapted for this method in the supported implementation.

Key Claims and Findings

ENABLE_PERFECT_ROUTER works with all MoE backends: CUTLASS, TRTLLM, and TRITON.
Currently supported models for perfect routing: GPT-OSS (RenormalizeMoeRoutingMethod), DeepSeek-V3 / DeepSeek-R1 (DeepSeekV3MoeRoutingMethod).
Adding perfect routing support to a new model requires adding integration following the pattern used in existing implementations — it is model-specific plumbing.
The perfect router caches pre-computed logits for common batch sizes to minimize overhead.
Profiling output: Nsight Systems report saved to <name>.nsys-rep; PyTorch profiler trace to <name>.json.

Environment Variables Reference

Variable	Value	Effect
`TLLM_PROFILE_START_STOP`	`A-B` (e.g. `100-150`)	Profile only iterations A through B
`TLLM_LLMAPI_ENABLE_NVTX`	`1`	Enable NVTX markers (PyTorch workflow)
`TLLM_PROFILE_RECORD_GC`	`1`	Add GC NVTX markers
`TLLM_TORCH_PROFILE_TRACE`	`<filename>.json`	Export PyTorch profiler trace
`ENABLE_PERFECT_ROUTER`	`1`	Bypass learned MoE router with balanced logits

Example: Profiling trtllm-bench (Iterations 100–150)

TLLM_PROFILE_START_STOP=100-150 nsys profile \
  -o trace -f true \
  -t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
  --cuda-graph-trace node \
  -e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
  --trace-fork-before-exec=true \
  trtllm-bench \
    --model deepseek-ai/DeepSeek-V3 \
    --model_path ${MODEL_PATH} \
    throughput \
    --dataset /tmp/dataset.txt --warmup 0 \
    --backend pytorch \
    --streaming

Interpreting MoE Router Results

Scenario	Interpretation
Similar performance with/without `ENABLE_PERFECT_ROUTER`	Router load balancing is not a bottleneck; optimize elsewhere
>10% improvement with `ENABLE_PERFECT_ROUTER`	Learned router is causing load imbalance; consider router optimization or load balancing strategies

Connections to Existing Wiki Pages

NVIDIA Nsight Systems — the system-wide profiler this guide integrates with; covers Nsight Systems features, GPU metrics, and the full Nsight suite.
Optimization — NVIDIA Triton Inference Server — Triton’s perf_analyzer is the application-level benchmarking tool; Nsight Systems/TensorRT-LLM profiling operates one layer deeper.

Cross-Section Pages

Performance Analysis — TensorRT LLM (NVIDIA Platform) — angle: TensorRT-LLM’s Nsight Systems integration as an NVIDIA platform profiling capability; MoE routing analysis in the context of supported NVIDIA model implementations.

Personal Wiki

Explorer