Three Building Blocks for Creating AI Virtual Assistants for Customer Service with an NVIDIA AI Blueprint
Abstract
This NVIDIA developer blog post by Isabel Hulseman describes the NVIDIA AI Blueprint for AI virtual assistants in customer service, decomposing it into three functional components: a RAG-based data ingestion and retrieval pipeline, a LangGraph-based AI agent powered by Llama 3.1 70B Instruct NIM, and an operations pipeline for call analytics and feedback. The blueprint uses NVIDIA NIM microservices — specifically the NeMo Retriever Embedding NIM and NeMo Retriever Reranking NIM — to retrieve from both structured data (customer profiles, order history) and unstructured data (product manuals, FAQs) in a unified pipeline. Short-term and long-term memory enable multi-turn conversation continuity. The post also surveys nine NVIDIA consulting partners (Accenture, Deloitte, Wipro, Tech Mahindra, Infosys, TCS, Quantiphi, SoftServe, EXL) who have productized variants of the blueprint.
Key Concepts
- Three Functional Components:
- Data Ingestion and Retrieval Pipeline — administrators load structured data (customer profiles, order history, order status) and unstructured data (product manuals, catalog, FAQs) into databases; a RAG pipeline makes this accessible at inference time
- AI Agent — implemented in LangGraph; plans and recursively solves complex customer queries; uses tool calling on Llama 3.1 70B Instruct NIM to retrieve from both data source types; manages short-term and long-term conversation memory; summarizes and stores conversation history with sentiment at session end
- Operations Pipeline — provides administrators with chat history, user feedback, sentiment analysis, call summaries, and analytics (average call time, time to resolution, customer satisfaction); analytics also feed back into LLM retraining
- NIM Microservices Stack:
- Llama 3.1 70B Instruct NIM — powers the core LLM for reasoning, planning, tool calling, and text generation
- NeMo Retriever Embedding NIM — generates high-quality embeddings for RAG retrieval over unstructured data
- NeMo Retriever Reranking NIM — reranks retrieved passages before passing context to the LLM, improving answer relevance
- Short-Term and Long-Term Memory: Active conversation queries and responses are embedded and stored for retrieval later in the same conversation (short-term). At session end, the AI agent stores summarized conversation history in a structured database for retrieval in future sessions (long-term). This eliminates the need for customers to repeat context across interactions.
- Sentiment Determination: At conversation end, the agent classifies conversation sentiment and stores it alongside the call summary; administrators use this signal to assess agent effectiveness and guide retraining.
- Blueprint Portability: NVIDIA NIM microservices run where the data resides — on-premises, cloud, or hybrid — enabling data sovereignty and compliance with privacy regulations without moving sensitive data.
Key Algorithms
- Retrieve-Rerank-Generate (RAG Pipeline): The retrieval pipeline uses a two-stage approach: the Embedding NIM converts queries and documents to vectors for similarity search; the Reranking NIM then scores the retrieved candidates for relevance before passing the top passages as context to the LLM. This two-stage pattern is more accurate than single-stage embedding retrieval alone.
- LangGraph Recursive Planning: The LangGraph agent decomposes complex customer queries into sub-tasks, recursively calling tools (structured DB queries, unstructured RAG retrieval) and synthesizing results — contrast with single-turn chatbot approaches that cannot reason over multi-step workflows.
- Tool Calling via NIM: The agent uses the tool-calling feature of Llama 3.1 70B Instruct NIM to dynamically select between structured and unstructured retrieval tools, choosing the appropriate data source based on query type.
Key Claims and Findings
- Legacy static-script and manual customer service approaches cannot deliver the personalized, real-time responses customers expect; RAG-backed AI agents address this by grounding responses in up-to-date enterprise data.
- The two-stage retrieve-rerank pipeline provides meaningfully better retrieval quality than embedding similarity alone; the Reranker NIM is positioned as essential for production-quality RAG in customer service.
- Multi-turn memory (short-term session + long-term history) is a structural requirement for natural customer service interactions, not an optional feature; without it, customers must repeat information across turns and sessions.
- The blueprint is a starting point, not a fixed product — operators can swap in domain-specific NIM microservices (e.g. Nemotron 4 Hindi for local-language support) or connect to digital human pipelines.
- The operations pipeline’s analytics and sentiment data create a feedback loop: conversation outcomes improve future LLM performance through targeted retraining signals.
Terminology
| Term | Definition |
|---|---|
| NIM (NVIDIA Inference Microservice) | Containerized, optimized model serving endpoint; runs on-premises or cloud; exposes an OpenAI-compatible API |
| NeMo Retriever | NVIDIA’s collection of retrieval-focused NIM microservices (embedding and reranking) for RAG pipelines |
| LangGraph | A graph-based agentic LLM framework from LangChain; models agent workflows as directed graphs with conditional branching |
| Reranker | A second-stage model that scores retrieved documents for relevance to the query; more accurate than embedding similarity but more expensive, so applied only to top-K candidates |
| Sentiment Determination | Per-conversation classification of customer sentiment (positive, neutral, negative) produced by the agent at session close |
Connections
- Building Autonomous AI with NVIDIA Agentic NeMo — NeMo Guardrails, RAG pipelines, and the Triton/TensorRT-LLM serving layer described there are the production-depth counterparts to the NIM microservices used in this blueprint
- Agent Architecture Study Note — the LangGraph recursive planner maps to the plan-and-execute control loop pattern; short/long-term memory maps to the stateful agent design discussed in the note
- Catch Me If You Can: Multi-Agent Fraud Detection — complementary case study showing a six-agent pipeline for fraud; both articles illustrate domain-specific multi-agent instantiations of the perceive-reason-act loop
- What are Multi-Agent Systems? — foundational NVIDIA glossary definition of multi-agent systems that contextualizes why the blueprint’s agent-plus-tools architecture was chosen