Troubleshooting — TensorRT-LLM
Abstract
The official NVIDIA TensorRT-LLM troubleshooting reference covering debugging workflows for model execution errors, input/output shape mismatches, plugin failures, and MPI/Slurm environment interference. Two complementary debugging pathways are described: unit-test-level debugging using register_network_output() to expose named intermediate tensors as inspectable network outputs, and end-to-end model debugging using --enable_debug_output at engine build time combined with --debug_mode --use_py_session at runtime. Execution errors divide into three categories: plugin-related failures (diagnosed by setting CUDA_LAUNCH_BLOCKING=1 for synchronous error reporting), input tensor shape mismatches between build-time optimization profiles and runtime inputs (diagnosed via TLLM_LOG_LEVEL=TRACE which surfaces full tensor shape tables before each forward pass), and MPI/Slurm environment conflicts (resolved by prefixing commands with mpirun -n 1). Memory-related build failures can be addressed by reducing maximum batch size or input/output lengths, or enabling plugins. The guide also documents multi-GPU NCCL requirements (--shm-size=1g --ulimit memlock=-1) and the semantics of mpirun -n 1 in Slurm: it sets up the MPI environment for TensorRT-LLM’s internal process management, never the number of GPU processes.
Key Concepts
register_network_output(name, tensor)API: Module method called inforward()to register an intermediate tensor as a named debug output; the registered name appears as a key indebug_bufferat runtime; enables layer-level output inspection without modifying inference logic--enable_debug_output:trtllm-buildflag that compiles the engine with debug output capability; required for intermediate tensor access at inference timeTLLM_LOG_LEVEL=TRACE: Environment variable enabling verbose logging; outputs full engine I/O tensor shape tables (name, type, shape) and optimization profile bounds (min/opt/max per input) before the first forward pass, and actual runtime shapes before each subsequent forward pass — the primary tool for diagnosing shape mismatch errorsCUDA_LAUNCH_BLOCKING=1: Environment variable forcing synchronous CUDA kernel execution; makes plugin errors surface immediately at the point of failure rather than being deferred to the next CPU-GPU synchronization- Optimization Profile: Build-time specification of minimum, optimal, and maximum input tensor shapes; runtime inputs whose shapes fall outside the profile bounds cause shape assertion errors
- MPI/Slurm Conflict: TensorRT-LLM uses
mpi4pyinternally; running under Slurmsrunwithout correct PMI configuration causesPMI2_Init failederrors or hangs; resolved bympirun -n 1regardless of GPU count
Key Claims and Findings
- “Could not find any supported formats consistent with input/output data types” build errors are memory-related; solutions are reducing max batch size/sequence lengths or enabling plugins (e.g.
--gpt_attention_plugin) - Shape mismatch errors (“Expected (-1,-1,-1), got [8,16]”) indicate build/runtime configuration mismatch —
TLLM_LOG_LEVEL=TRACEexposes the full optimization profile table to identify the violated bound - On Slurm,
mpirun -n 1is always correct regardless of GPU count — TensorRT-LLM manages GPU parallelism internally;mpirunhere is purely for PMI environment initialization --shm-size=1g --ulimit memlock=-1are required Docker flags to prevent NCCL errors in multi-GPU inference; omitting them causes silent or cryptic failures
Debugging Workflows
Unit Test Debugging
# 1. Register intermediate tensor in forward()
self.register_network_output('inter', inter)
# 2. Mark as network output
for k, v in gm.named_network_outputs():
net._mark_output(v, k, dtype)
# 3. Print at runtime
print(outputs['inter'])End-to-End Model Debugging
# Build with debug output enabled
trtllm-build \
--checkpoint_dir gpt2/trt_ckpt/fp16/1-gpu \
--enable_debug_output \
--output_dir gpt2/trt_engines/fp16/1-gpu
# Run with debug mode
python3 run.py \
--engine_dir gpt2/trt_engines/fp16/1-gpu \
--debug_mode \
--use_py_sessionAt runtime, self.debug_buffer['transformer.layers.N.mlp_output'] contains the intermediate tensor for each registered layer.
Error Diagnostics Reference
| Error Pattern | Likely Cause | Solution |
|---|---|---|
| ”could not find any supported formats” | Memory exhaustion during build | Reduce max batch size / sequence length; enable --gpt_attention_plugin |
| ”Tensor ‘X’ has invalid shape” | Build/runtime shape mismatch | Check optimization profile with TLLM_LOG_LEVEL=TRACE |
| ”Sizes of tensors must match except in dimension 0” | Mismatched configuration between build and run | Verify max_batch_size, max_input_len, max_output_len consistency |
PMI2_Init failed / program hangs on Slurm | MPI/Slurm PMI conflict | Prefix command with mpirun -n 1 |
| NCCL errors in multi-GPU | Insufficient shared / locked memory | Add --shm-size=1g --ulimit memlock=-1 to Docker run command |
Terminology
trtllm-build: CLI for compiling TensorRT-LLM checkpoint directories into optimized TensorRT engine filesdebug_buffer: Runtime dictionary keyed by registered output names containing intermediate tensor values after each forward step--use_py_session: Runtime flag using the Python session implementation; required for debug mode (C++ session does not support it)- NCCL: NVIDIA Collective Communications Library; handles multi-GPU tensor communication; requires shared memory (
--shm-size) and unlocked memory (--ulimit memlock=-1) TLLM_LOG_LEVEL: TensorRT-LLM logging verbosity environment variable; levels include INFO, WARNING, ERROR, VERBOSE, TRACE
Connections to Existing Wiki Pages
- Performance Analysis — TensorRT-LLM — the debug output mechanism (
register_network_output,debug_buffer) is the same instrumentation pathway used for performance profiling; troubleshooting identifies which layers produce incorrect values, performance analysis identifies which layers are slow - Performance Tuning Guide — Megatron-Bridge LLM Training — that guide addresses build-time configuration choices (batch size, sequence length, plugin selection) that directly determine whether the build-time memory errors described here occur