Abstract
This document details the architectural design and deployment characteristics of the Jamba 1.5 family of large language models developed by AI21 Labs. The models utilize a hybrid decoder architecture that interweaves standard Transformer attention layers, Mamba state-space layers, and Mixture of Experts (MoE) routing into unified Jamba blocks. This configuration optimizes the trade-off between computational efficiency, inference latency, and reasoning accuracy while natively supporting a 256K token context window and JSON-structured function calling. Designed to fit on single accelerated GPUs and integrate seamlessly with retrieval-augmented generation, Jamba 1.5 targets enterprise applications requiring precise long-context processing and reliable downstream tool execution.
Key Concepts
- Hybrid Transformer-Mamba-MoE decoder architecture
- Sparse Mixture of Experts routing (2 out of 16 experts active per token)
- Fixed 256K token context window (~800 pages of text)
- Attention-to-Mamba layer ratio of 1:7 within Jamba blocks
- JSON-enabled function calling for structured tool integration
- Long-context RAG compatibility eliminating continuous chunking requirements
- Single-H100 80 GB GPU deployment footprint
Key Equations and Algorithms
None
Key Claims and Findings
- Interweaving Transformer and Mamba layers balances the high accuracy of attention mechanisms with the linear-complexity, low-latency processing of state-space models.
- Each Jamba block contains eight layers and is engineered to reside entirely within a single NVIDIA H100 80 GB GPU, simplifying distributed training and inference infrastructure.
- The 256K token context window enables direct processing of extensive documents without sliding-window fragmentation, improving RAG retrieval accuracy and relevance.
- MoE routing activates only two experts per token across a pool of sixteen, increasing total parameter capacity without proportionally increasing active compute or memory access.
- Native JSON output and function calling capabilities allow real-time, high-precision execution of complex enterprise workflows such as financial document analysis and retail assistance.
Terminology
- Jamba block: A cohesive decoder unit comprising eight interlayered attention, Mamba, and MoE modules that operate as a single processing stage to balance compute and capacity.
- Attention-to-Mamba ratio: The architectural specification dictating the proportion of standard self-attention layers to Mamba sequence modeling layers in a Jamba block (configured as 1:7).
- Sparse expert activation: An MoE inference mechanism wherein a gating network selects exactly two of the sixteenth available experts to process each token.
- Fixed context window: A static maximum sequence capacity (256K tokens) supported natively by the model architecture, removing the need for external context management during inference.
Connections to Existing Wiki Pages
[[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-09-tooling-your-llms]](Function calling and structured tool integration)[[ai-ml/nvidia-certs/Generative AI LLM Exam Study Guide/sec-08-rag]](Retrieval-augmented generation and context management)[[ai-ml/Training-Compute-Optimal-Large-Language-Models]](MoE scaling and compute-efficient architecture principles)[[entities/nvidia]](Inference microservices hosting and accelerated hardware deployment)[[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-08-structuring-outputs]](JSON formatting and structured response generation)