Performance Analysis — TensorRT LLM (NVIDIA Platform)

Cross-section page — full summary at Performance Analysis — TensorRT LLM.

NVIDIA Platform Angle

TensorRT-LLM has first-class Nsight Systems integration built into its CLI tooling (trtllm-bench, trtllm-serve), making profiling a native part of the NVIDIA inference platform rather than a separate instrumentation step. This page covers TensorRT-LLM’s profiling capabilities as components of the NVIDIA platform stack.

Nsight Systems Integration Points in TensorRT-LLM

TensorRT-LLM exposes three profiling mechanisms:

CUDA profiler API gating (TLLM_PROFILE_START_STOP): Rather than profiling entire multi-hour inference runs, users specify iteration ranges. Nsight’s -c cudaProfilerApi mode activates only when TensorRT-LLM turns the CUDA profiler on. This tightly scopes the profile to the region of interest.
NVTX markers: The PyTorch workflow provides NVTX markers by default (TLLM_LLMAPI_ENABLE_NVTX=1). The C++/TensorRT workflow requires building with --nvtx. GC NVTX markers (TLLM_PROFILE_RECORD_GC=1) are also available for GIL/garbage collection diagnostics. These markers appear in the Nsight Systems timeline, labeling inference passes, batches, and GC events.
PyTorch profiler export (TLLM_TORCH_PROFILE_TRACE): Produces a .json trace alongside the .nsys-rep Nsight file, enabling dual-tool analysis for the PyTorch workflow.

ENABLE_PERFECT_ROUTER — MoE Platform Feature

For Mixture-of-Experts models (DeepSeek-V3/R1, GPT-OSS), TensorRT-LLM exposes ENABLE_PERFECT_ROUTER=1 to isolate expert load imbalance from kernel performance. The feature is model-specific — it must be integrated into each MoE model implementation. Currently supported on the NVIDIA platform:

GPT-OSS — uses RenormalizeMoeRoutingMethod (TopK + Softmax)
DeepSeek-V3 / DeepSeek-R1 — uses DeepSeekV3MoeRoutingMethod

The MoE backends it works with (CUTLASS, TRTLLM, TRITON) are all NVIDIA-platform backends.

Profiling Workflow for NVIDIA Inference Stack

trtllm-bench/trtllm-serve  ← application-level throughput/latency
       ↓ (TLLM_PROFILE_START_STOP + nsys)
Nsight Systems             ← system-wide timeline (CPU/GPU/NVTX)
       ↓ (targeted kernel)
Nsight Compute             ← per-kernel deep metrics

This mirrors the general NVIDIA profiling hierarchy described in NVIDIA Nsight Systems (NVIDIA Platform).

Connections to NVIDIA Platform Stack

NVIDIA Nsight Systems (NVIDIA Platform) — the profiler tool that TensorRT-LLM’s environment variables control.
Scaling LLMs with Triton and TensorRT-LLM (NVIDIA Platform) — TensorRT-LLM engine building and Triton deployment; the operational context in which this profiling workflow is applied.

Personal Wiki

Explorer

Performance Analysis — TensorRT LLM (NVIDIA Platform)

NVIDIA Platform Angle

Nsight Systems Integration Points in TensorRT-LLM

ENABLE_PERFECT_ROUTER — MoE Platform Feature

Profiling Workflow for NVIDIA Inference Stack

Connections to NVIDIA Platform Stack

Graph View

Table of Contents

Backlinks