Troubleshooting — TensorRT-LLM

Abstract

The official NVIDIA TensorRT-LLM troubleshooting reference covering debugging workflows for model execution errors, input/output shape mismatches, plugin failures, and MPI/Slurm environment interference. Two complementary debugging pathways are described: unit-test-level debugging using register_network_output() to expose named intermediate tensors as inspectable network outputs, and end-to-end model debugging using --enable_debug_output at engine build time combined with --debug_mode --use_py_session at runtime. Execution errors divide into three categories: plugin-related failures (diagnosed by setting CUDA_LAUNCH_BLOCKING=1 for synchronous error reporting), input tensor shape mismatches between build-time optimization profiles and runtime inputs (diagnosed via TLLM_LOG_LEVEL=TRACE which surfaces full tensor shape tables before each forward pass), and MPI/Slurm environment conflicts (resolved by prefixing commands with mpirun -n 1). Memory-related build failures can be addressed by reducing maximum batch size or input/output lengths, or enabling plugins. The guide also documents multi-GPU NCCL requirements (--shm-size=1g --ulimit memlock=-1) and the semantics of mpirun -n 1 in Slurm: it sets up the MPI environment for TensorRT-LLM’s internal process management, never the number of GPU processes.


Key Concepts

  • register_network_output(name, tensor) API: Module method called in forward() to register an intermediate tensor as a named debug output; the registered name appears as a key in debug_buffer at runtime; enables layer-level output inspection without modifying inference logic
  • --enable_debug_output: trtllm-build flag that compiles the engine with debug output capability; required for intermediate tensor access at inference time
  • TLLM_LOG_LEVEL=TRACE: Environment variable enabling verbose logging; outputs full engine I/O tensor shape tables (name, type, shape) and optimization profile bounds (min/opt/max per input) before the first forward pass, and actual runtime shapes before each subsequent forward pass — the primary tool for diagnosing shape mismatch errors
  • CUDA_LAUNCH_BLOCKING=1: Environment variable forcing synchronous CUDA kernel execution; makes plugin errors surface immediately at the point of failure rather than being deferred to the next CPU-GPU synchronization
  • Optimization Profile: Build-time specification of minimum, optimal, and maximum input tensor shapes; runtime inputs whose shapes fall outside the profile bounds cause shape assertion errors
  • MPI/Slurm Conflict: TensorRT-LLM uses mpi4py internally; running under Slurm srun without correct PMI configuration causes PMI2_Init failed errors or hangs; resolved by mpirun -n 1 regardless of GPU count

Key Claims and Findings

  • “Could not find any supported formats consistent with input/output data types” build errors are memory-related; solutions are reducing max batch size/sequence lengths or enabling plugins (e.g. --gpt_attention_plugin)
  • Shape mismatch errors (“Expected (-1,-1,-1), got [8,16]”) indicate build/runtime configuration mismatch — TLLM_LOG_LEVEL=TRACE exposes the full optimization profile table to identify the violated bound
  • On Slurm, mpirun -n 1 is always correct regardless of GPU count — TensorRT-LLM manages GPU parallelism internally; mpirun here is purely for PMI environment initialization
  • --shm-size=1g --ulimit memlock=-1 are required Docker flags to prevent NCCL errors in multi-GPU inference; omitting them causes silent or cryptic failures

Debugging Workflows

Unit Test Debugging

# 1. Register intermediate tensor in forward()
self.register_network_output('inter', inter)
 
# 2. Mark as network output
for k, v in gm.named_network_outputs():
    net._mark_output(v, k, dtype)
 
# 3. Print at runtime
print(outputs['inter'])

End-to-End Model Debugging

# Build with debug output enabled
trtllm-build \
    --checkpoint_dir gpt2/trt_ckpt/fp16/1-gpu \
    --enable_debug_output \
    --output_dir gpt2/trt_engines/fp16/1-gpu
 
# Run with debug mode
python3 run.py \
    --engine_dir gpt2/trt_engines/fp16/1-gpu \
    --debug_mode \
    --use_py_session

At runtime, self.debug_buffer['transformer.layers.N.mlp_output'] contains the intermediate tensor for each registered layer.


Error Diagnostics Reference

Error PatternLikely CauseSolution
”could not find any supported formats”Memory exhaustion during buildReduce max batch size / sequence length; enable --gpt_attention_plugin
”Tensor ‘X’ has invalid shape”Build/runtime shape mismatchCheck optimization profile with TLLM_LOG_LEVEL=TRACE
”Sizes of tensors must match except in dimension 0”Mismatched configuration between build and runVerify max_batch_size, max_input_len, max_output_len consistency
PMI2_Init failed / program hangs on SlurmMPI/Slurm PMI conflictPrefix command with mpirun -n 1
NCCL errors in multi-GPUInsufficient shared / locked memoryAdd --shm-size=1g --ulimit memlock=-1 to Docker run command

Terminology

  • trtllm-build: CLI for compiling TensorRT-LLM checkpoint directories into optimized TensorRT engine files
  • debug_buffer: Runtime dictionary keyed by registered output names containing intermediate tensor values after each forward step
  • --use_py_session: Runtime flag using the Python session implementation; required for debug mode (C++ session does not support it)
  • NCCL: NVIDIA Collective Communications Library; handles multi-GPU tensor communication; requires shared memory (--shm-size) and unlocked memory (--ulimit memlock=-1)
  • TLLM_LOG_LEVEL: TensorRT-LLM logging verbosity environment variable; levels include INFO, WARNING, ERROR, VERBOSE, TRACE

Connections to Existing Wiki Pages

  • Performance Analysis — TensorRT-LLM — the debug output mechanism (register_network_output, debug_buffer) is the same instrumentation pathway used for performance profiling; troubleshooting identifies which layers produce incorrect values, performance analysis identifies which layers are slow
  • Performance Tuning Guide — Megatron-Bridge LLM Training — that guide addresses build-time configuration choices (batch size, sequence length, plugin selection) that directly determine whether the build-time memory errors described here occur