Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint

By Nicola Sessions — NVIDIA Developer Blog, 2025-06-11

Abstract

The AI-Q NVIDIA Blueprint is an open-source reference implementation for building enterprise AI agents that connect to, reason across, and deliver insights from diverse organisational data sources at petabyte scale. Built on three core building blocks — NVIDIA NeMo Retriever microservices, NIM inference microservices, and the NeMo Agent Toolkit — it demonstrates a full multimodal agentic pipeline covering data ingestion, semantic retrieval, advanced reasoning, and production observability. The blueprint is framework-agnostic and integrates with CrewAI, LangChain, LlamaIndex, Agno, and others through the Agent Toolkit’s plugin system. Real-world deployments at Therapyside and Pangaea Data demonstrate 22 minutes saved per patient per clinical day and 98% accuracy in rare-disease patient data retrieval, respectively.

Key Concepts

  • AI-Q Blueprint: open-source NVIDIA Blueprint providing a developer-friendly workflow for building Artificial General Agents (AGA) that query and reason over enterprise data
  • NeMo Retriever: NVIDIA microservices for extraction, embedding, and reranking of multimodal enterprise data — processes PDFs, images, tables, and databases up to 15× faster using accelerated computing
  • cuVS: CUDA-accelerated vector search library; vectors are stored in a cuVS-managed database and served via Docker Compose
  • Llama Nemotron: reasoning model with a unique toggleable reasoning mode — enables dynamic performance-cost trade-off and achieves up to 5× faster inference when reasoning is disabled
  • NeMo Agent Toolkit: orchestration layer managing multi-agent REST APIs, observability telemetry, and workflow profiling; acts as the unifying layer across diverse agentic frameworks
  • Multimodal PDF extraction: NeMo Retriever extraction microservices ingest structured, semi-structured, and unstructured content from diverse source formats

Key Claims and Findings

  • 68% of available enterprise data goes unused (Gartner); AI-Q addresses this by providing always-current semantic search over private data
  • NeMo Retriever processes multimodal enterprise data at petabyte scale, up to 15× faster than non-accelerated pipelines
  • Llama Nemotron’s toggleable reasoning provides up to 5× faster inference while maintaining high-quality outputs on complex tasks
  • Therapyside’s Maia agent (built on Agent Toolkit + NeMo Retriever) saves clinicians 22 minutes per patient per day
  • Pangaea Data achieved 98% accuracy in clinical data retrieval for rare-disease detection, reducing configuration time from weeks to days

Architecture

The AI-Q pipeline follows five stages:

  1. Multimodal ingestion — NeMo Retriever extraction microservices process PDFs, images, tables, and databases
  2. Embedding and indexing — data is continuously embedded and stored in a cuVS-accelerated vector database
  3. Retrieval — RAG with NeMo Retriever Reranking surfaces the most relevant context for each query
  4. Reasoning — Llama Nemotron models decompose problems, plan iteratively, and reflect to produce nuanced answers
  5. Observability — the Agent Toolkit profiler exports OpenTelemetry-compatible telemetry covering token usage, latency, and cost per agent/tool

Terminology

  • AGA (Artificial General Agent): agent capable of autonomous reasoning and action across heterogeneous data sources
  • OpenTelemetry: CNCF standard for distributed observability; supported natively by the NeMo Agent Toolkit for integration with monitoring tools
  • Data-grounded response: answer generated using retrieved, privately held enterprise data rather than parametric model knowledge
  • Framework-agnostic: the Agent Toolkit integrates with any popular agentic framework via modular plugin packages without requiring teams to replatform

Connections to Existing Wiki Pages