Abstract
This document outlines a structured evaluation framework for agentic AI systems that utilize tool-use workflows, focusing on quantitative metrics to validate agent behavior. It introduces four core evaluation functions—TopicAdherence, ToolCallAccuracy, ToolCallF1, and AgentGoalAccuracy—and defines their underlying mathematical formulations for precision, recall, F1 scoring, and binary goal verification. The work establishes practical guidelines for metric selection based on sequential vs. parallel tool dependencies, provides implementation patterns via the ragas library, and documents the migration from deprecated legacy APIs to modern collection-based evaluators.
Key Concepts
- Topic adherence evaluation: Quantifying domain confinement by comparing answered/refused queries against predefined reference topics using precision, recall, and F1.
- Strict vs. flexible tool sequencing: Differentiating evaluation modes where tool execution order is enforced sequentially versus allowed concurrently for parallel operations.
- Tool call accuracy scoring: Composite metric combining argument parameter matching with binary sequence alignment checks to produce a strict correctness score.
- F1-based tool evaluation: Unordered precision/recall calculation that penalizes missing or extraneous tool calls while ignoring execution order, suitable for iterative development.
- Reference-based vs. reference-less goal verification: Binary outcome assessment that either compares the agent’s final state against ground truth or infers intent vs. result via LLM analysis.
- Metric selection taxonomy: Decision framework mapping evaluation objectives (exact tool matching, functional correctness, or high-level goal achievement) to appropriate
ragasmetrics.
Key Equations and Algorithms
- Precision:
- Recall:
- F1 Score:
- Tool Call Accuracy:
- Agent Goal Accuracy: Binary classification metric outputting for successfully met user intent and for unachieved goals, evaluated via LLM comparison against references or inferred conversation context.
Key Claims and Findings
- Tool call evaluation must explicitly account for execution dependency; strict order enforcement is necessary for sequential workflows, while flexible modes accommodate parallel tool invocation.
ToolCallF1provides a granular, tolerance-based evaluation for agent iteration by quantifying partial overlaps in tool name and parameter matching, unlike the binary strictness ofToolCallAccuracy.- Topic adherence metrics effectively constrain conversational scope but require careful accounting of refused queries that should have been answered in recall calculations.
- Goal accuracy can be reliably assessed without explicit references by leveraging LLM inference to extract intended objectives and compare them against the agent’s operational end-state.
- The modern
ragas.metrics.collectionsAPI supersedes legacyragas.metricsclasses, replacingMultiTurnSamplerequirements with directuser_inputandreference_tool_callsarguments for streamlined evaluation pipelines.
Terminology
- Topic Adherence: Metric assessing whether conversational turns remain within a predefined set of domain topics, computed via precision, recall, and F1 against reference topics.
- Strict Order Mode: Evaluation configuration where tool calls must exactly match a predefined execution sequence to achieve a non-zero accuracy score.
- Flexible Order Mode: Evaluation configuration that decouples sequence evaluation, allowing tool calls to be matched regardless of invocation order.
- Agent Goal Accuracy: Binary metric determining whether an agent’s final operational state satisfies the user’s initial request, evaluated with or without ground-truth references.
- ToolCallF1: Unordered evaluation metric that computes harmonic mean of precision and recall for tool invocations, penalizing both missing and redundant tool calls.
- Reference Tool Calls: Ground-truth list of expected tool names and argument dictionaries used as the baseline for comparing agent-executed tool workflows.
Connections to Existing Wiki Pages
[[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-07-control-structure-and-tooling]][[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-09-tooling-your-llms]][[ai-ml/Building_Agentic_AI_Applications_with_LLMs/index]][[ai-ml/nvidia-certs/ncp-aai/evaluation-and-tuning/ai-agent-evaluation-summary]][[ai-ml/nvidia-certs/ncp-aai/agent-development/index]]