Abstract

This document details the architecture and implementation workflow for securing generative AI deployments by integrating NVIDIA NIM microservices with NVIDIA-NeMo-Guardrails. It describes a pattern where GPU-accelerated inference endpoints for large language models and embedding models are connected to a guardrails runtime to enforce trustworthiness, safety, and controlled dialog. The work provides a technical walkthrough of a deployment use case where topical rails are utilized to intercept and block unauthorized queries regarding personal data, leveraging the NeMo Retriever embedding NIM for efficient vector-based semantic matching against policy definitions. Additionally, it specifies the configuration requirements, including Colang flow definitions and YAML engine settings, necessary to instantiate the security controls within the application.

Key Concepts

  • NVIDIA NIM Microservices: Containerized microservices that provide industry-standard APIs for the secure, reliable, and high-performance deployment of pre-trained and customized AI models across various infrastructure types.
  • NeMo Guardrails: A framework that enables the development of programmable guardrails to ensure LLM applications adhere to safety, security, and compliance principles.
  • Topical Rails: Policy enforcement mechanisms that restrict user interactions and model responses based on semantic similarity to defined topics, particularly sensitive or prohibited subjects.
  • Embedding-Assisted Policy Evaluation: A process where the NeMo Retriever embedding NIM converts user queries into embedding vectors, enabling the guardrails runtime to perform efficient semantic searches against stored policy definitions.
  • Colang Configuration: The domain-specific language used to define dialog flows, user intents, and bot responses within the guardrails system, typically stored in files with the .co extension.
  • NIM Engine Integration: The configuration pattern where the guardrails runtime connects to inference endpoints by setting the engine type to nvidia_ai_endpoints and specifying the respective NIM base URLs in the configuration.

Key Equations and Algorithms

None

Key Claims and Findings

  • Integrating NeMo Guardrails with NIM microservices accelerates the performance of safety filtering and dialog management in conversational AI applications.
  • The NeMo Retriever embedding NIM plays a critical role in the integration by transforming incoming queries into embedding vectors, which allows for rapid comparison against guardrails policies to determine if a query matches prohibited or out-of-scope topics.
  • Topical rails can effectively intercept and block unauthorized requests involving sensitive personal data, such as instructions to hack into accounts or access private information, by triggering refusal responses.
  • Successful implementation requires the NeMo Guardrails library version 0.9.1.1 or later; older versions lack compatibility with the demonstrated tutorial features.
  • The integration architecture relies on defining specific models for inference, such as Meta Llama 3.1 70B Instruct for the LLM NIM and NVIDIA Embed QA E5 v5 for the embedding NIM, to support the guardrailing workflow.

Terminology

  • Colang: The configuration language specific to NVIDIA NeMo Guardrails used to define conversation logic, intents, and actions (e.g., defining flows in flows.co).
  • flows.co: The file extension for Colang configuration files within the NeMo Guardrails directory structure that contain the executable guardrail logic.
  • nvidia_ai_endpoints: The specific engine type value required in config.yml to configure the guardrails runtime to communicate with NVIDIA NIM inference endpoints.
  • LLM NIM: The microservice instance hosting the Large Language Model (e.g., Llama 3.1) responsible for generating text responses, gated by the guardrails runtime.
  • Retriever embedding NIM: The microservice instance hosting the embedding model (e.g., Embed QA E5 v5) used to vectorize inputs for semantic search and policy matching.
  • Topical rails: Guardrail type that enforces restrictions based on the topic of conversation, often used to prevent the disclosure of sensitive data or unauthorized actions.

Connections to Existing Wiki Pages