Abstract

This document details the architectural design and deployment characteristics of the Jamba 1.5 family of large language models developed by AI21 Labs. The models utilize a hybrid decoder architecture that interweaves standard Transformer attention layers, Mamba state-space layers, and Mixture of Experts (MoE) routing into unified Jamba blocks. This configuration optimizes the trade-off between computational efficiency, inference latency, and reasoning accuracy while natively supporting a 256K token context window and JSON-structured function calling. Designed to fit on single accelerated GPUs and integrate seamlessly with retrieval-augmented generation, Jamba 1.5 targets enterprise applications requiring precise long-context processing and reliable downstream tool execution.

Key Concepts

  • Hybrid Transformer-Mamba-MoE decoder architecture
  • Sparse Mixture of Experts routing (2 out of 16 experts active per token)
  • Fixed 256K token context window (~800 pages of text)
  • Attention-to-Mamba layer ratio of 1:7 within Jamba blocks
  • JSON-enabled function calling for structured tool integration
  • Long-context RAG compatibility eliminating continuous chunking requirements
  • Single-H100 80 GB GPU deployment footprint

Key Equations and Algorithms

None

Key Claims and Findings

  • Interweaving Transformer and Mamba layers balances the high accuracy of attention mechanisms with the linear-complexity, low-latency processing of state-space models.
  • Each Jamba block contains eight layers and is engineered to reside entirely within a single NVIDIA H100 80 GB GPU, simplifying distributed training and inference infrastructure.
  • The 256K token context window enables direct processing of extensive documents without sliding-window fragmentation, improving RAG retrieval accuracy and relevance.
  • MoE routing activates only two experts per token across a pool of sixteen, increasing total parameter capacity without proportionally increasing active compute or memory access.
  • Native JSON output and function calling capabilities allow real-time, high-precision execution of complex enterprise workflows such as financial document analysis and retail assistance.

Terminology

  • Jamba block: A cohesive decoder unit comprising eight interlayered attention, Mamba, and MoE modules that operate as a single processing stage to balance compute and capacity.
  • Attention-to-Mamba ratio: The architectural specification dictating the proportion of standard self-attention layers to Mamba sequence modeling layers in a Jamba block (configured as 1:7).
  • Sparse expert activation: An MoE inference mechanism wherein a gating network selects exactly two of the sixteenth available experts to process each token.
  • Fixed context window: A static maximum sequence capacity (256K tokens) supported natively by the model architecture, removing the need for external context management during inference.

Connections to Existing Wiki Pages

  • [[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-09-tooling-your-llms]] (Function calling and structured tool integration)
  • [[ai-ml/nvidia-certs/Generative AI LLM Exam Study Guide/sec-08-rag]] (Retrieval-augmented generation and context management)
  • [[ai-ml/Training-Compute-Optimal-Large-Language-Models]] (MoE scaling and compute-efficient architecture principles)
  • [[entities/nvidia]] (Inference microservices hosting and accelerated hardware deployment)
  • [[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-08-structuring-outputs]] (JSON formatting and structured response generation)