Troubleshooting — TensorRT-LLM

Abstract

The official NVIDIA TensorRT-LLM troubleshooting reference covering debugging workflows for model execution errors, input/output shape mismatches, plugin failures, and MPI/Slurm environment interference. Two complementary debugging pathways are described: unit-test-level debugging using register_network_output() to expose named intermediate tensors as inspectable network outputs, and end-to-end model debugging using --enable_debug_output at engine build time combined with --debug_mode --use_py_session at runtime. Execution errors divide into three categories: plugin-related failures (diagnosed by setting CUDA_LAUNCH_BLOCKING=1 for synchronous error reporting), input tensor shape mismatches between build-time optimization profiles and runtime inputs (diagnosed via TLLM_LOG_LEVEL=TRACE which surfaces full tensor shape tables before each forward pass), and MPI/Slurm environment conflicts (resolved by prefixing commands with mpirun -n 1). Memory-related build failures can be addressed by reducing maximum batch size or input/output lengths, or enabling plugins. The guide also documents multi-GPU NCCL requirements (--shm-size=1g --ulimit memlock=-1) and the semantics of mpirun -n 1 in Slurm: it sets up the MPI environment for TensorRT-LLM’s internal process management, never the number of GPU processes.

Key Concepts

register_network_output(name, tensor) API: Module method called in forward() to register an intermediate tensor as a named debug output; the registered name appears as a key in debug_buffer at runtime; enables layer-level output inspection without modifying inference logic
--enable_debug_output: trtllm-build flag that compiles the engine with debug output capability; required for intermediate tensor access at inference time
TLLM_LOG_LEVEL=TRACE: Environment variable enabling verbose logging; outputs full engine I/O tensor shape tables (name, type, shape) and optimization profile bounds (min/opt/max per input) before the first forward pass, and actual runtime shapes before each subsequent forward pass — the primary tool for diagnosing shape mismatch errors
CUDA_LAUNCH_BLOCKING=1: Environment variable forcing synchronous CUDA kernel execution; makes plugin errors surface immediately at the point of failure rather than being deferred to the next CPU-GPU synchronization
Optimization Profile: Build-time specification of minimum, optimal, and maximum input tensor shapes; runtime inputs whose shapes fall outside the profile bounds cause shape assertion errors
MPI/Slurm Conflict: TensorRT-LLM uses mpi4py internally; running under Slurm srun without correct PMI configuration causes PMI2_Init failed errors or hangs; resolved by mpirun -n 1 regardless of GPU count

Key Claims and Findings

“Could not find any supported formats consistent with input/output data types” build errors are memory-related; solutions are reducing max batch size/sequence lengths or enabling plugins (e.g. --gpt_attention_plugin)
Shape mismatch errors (“Expected (-1,-1,-1), got [8,16]”) indicate build/runtime configuration mismatch — TLLM_LOG_LEVEL=TRACE exposes the full optimization profile table to identify the violated bound
On Slurm, mpirun -n 1 is always correct regardless of GPU count — TensorRT-LLM manages GPU parallelism internally; mpirun here is purely for PMI environment initialization
--shm-size=1g --ulimit memlock=-1 are required Docker flags to prevent NCCL errors in multi-GPU inference; omitting them causes silent or cryptic failures

Debugging Workflows

Unit Test Debugging

# 1. Register intermediate tensor in forward()
self.register_network_output('inter', inter)
 
# 2. Mark as network output
for k, v in gm.named_network_outputs():
    net._mark_output(v, k, dtype)
 
# 3. Print at runtime
print(outputs['inter'])

End-to-End Model Debugging

# Build with debug output enabled
trtllm-build \
    --checkpoint_dir gpt2/trt_ckpt/fp16/1-gpu \
    --enable_debug_output \
    --output_dir gpt2/trt_engines/fp16/1-gpu
 
# Run with debug mode
python3 run.py \
    --engine_dir gpt2/trt_engines/fp16/1-gpu \
    --debug_mode \
    --use_py_session

At runtime, self.debug_buffer['transformer.layers.N.mlp_output'] contains the intermediate tensor for each registered layer.

Error Diagnostics Reference

Error Pattern	Likely Cause	Solution
”could not find any supported formats”	Memory exhaustion during build	Reduce max batch size / sequence length; enable `--gpt_attention_plugin`
”Tensor ‘X’ has invalid shape”	Build/runtime shape mismatch	Check optimization profile with `TLLM_LOG_LEVEL=TRACE`
”Sizes of tensors must match except in dimension 0”	Mismatched configuration between build and run	Verify `max_batch_size`, `max_input_len`, `max_output_len` consistency
`PMI2_Init failed` / program hangs on Slurm	MPI/Slurm PMI conflict	Prefix command with `mpirun -n 1`
NCCL errors in multi-GPU	Insufficient shared / locked memory	Add `--shm-size=1g --ulimit memlock=-1` to Docker run command

Terminology

trtllm-build: CLI for compiling TensorRT-LLM checkpoint directories into optimized TensorRT engine files
debug_buffer: Runtime dictionary keyed by registered output names containing intermediate tensor values after each forward step
--use_py_session: Runtime flag using the Python session implementation; required for debug mode (C++ session does not support it)
NCCL: NVIDIA Collective Communications Library; handles multi-GPU tensor communication; requires shared memory (--shm-size) and unlocked memory (--ulimit memlock=-1)
TLLM_LOG_LEVEL: TensorRT-LLM logging verbosity environment variable; levels include INFO, WARNING, ERROR, VERBOSE, TRACE

Connections to Existing Wiki Pages

Performance Analysis — TensorRT-LLM — the debug output mechanism (register_network_output, debug_buffer) is the same instrumentation pathway used for performance profiling; troubleshooting identifies which layers produce incorrect values, performance analysis identifies which layers are slow
Performance Tuning Guide — Megatron-Bridge LLM Training — that guide addresses build-time configuration choices (batch size, sequence length, plugin selection) that directly determine whether the build-time memory errors described here occur

Personal Wiki

Explorer

Troubleshooting — TensorRT-LLM

Troubleshooting — TensorRT-LLM

Abstract

Key Concepts

Key Claims and Findings

Debugging Workflows

Unit Test Debugging

End-to-End Model Debugging

Error Diagnostics Reference

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks