Agentic or Tool use

Abstract

This document outlines a structured evaluation framework for agentic AI systems that utilize tool-use workflows, focusing on quantitative metrics to validate agent behavior. It introduces four core evaluation functions—TopicAdherence, ToolCallAccuracy, ToolCallF1, and AgentGoalAccuracy—and defines their underlying mathematical formulations for precision, recall, F1 scoring, and binary goal verification. The work establishes practical guidelines for metric selection based on sequential vs. parallel tool dependencies, provides implementation patterns via the ragas library, and documents the migration from deprecated legacy APIs to modern collection-based evaluators.

Key Concepts

Topic adherence evaluation: Quantifying domain confinement by comparing answered/refused queries against predefined reference topics using precision, recall, and F1.
Strict vs. flexible tool sequencing: Differentiating evaluation modes where tool execution order is enforced sequentially versus allowed concurrently for parallel operations.
Tool call accuracy scoring: Composite metric combining argument parameter matching with binary sequence alignment checks to produce a strict correctness score.
F1-based tool evaluation: Unordered precision/recall calculation that penalizes missing or extraneous tool calls while ignoring execution order, suitable for iterative development.
Reference-based vs. reference-less goal verification: Binary outcome assessment that either compares the agent’s final state against ground truth or infers intent vs. result via LLM analysis.
Metric selection taxonomy: Decision framework mapping evaluation objectives (exact tool matching, functional correctness, or high-level goal achievement) to appropriate ragas metrics.

Key Equations and Algorithms

Precision: $Precision = \frac{tool calls that match both name and parameters}{match + extra unmatched calls}$
Recall: $Recall = \frac{tool calls that match both name and parameters}{match + expected unmatched calls}$
F1 Score: $F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}$
Tool Call Accuracy: $Final Score = Argument Accuracy \times (Sequence Aligned ? 1 : 0)$
Agent Goal Accuracy: Binary classification metric outputting $1$ for successfully met user intent and $0$ for unachieved goals, evaluated via LLM comparison against references or inferred conversation context.

Key Claims and Findings

Tool call evaluation must explicitly account for execution dependency; strict order enforcement is necessary for sequential workflows, while flexible modes accommodate parallel tool invocation.
ToolCallF1 provides a granular, tolerance-based evaluation for agent iteration by quantifying partial overlaps in tool name and parameter matching, unlike the binary strictness of ToolCallAccuracy.
Topic adherence metrics effectively constrain conversational scope but require careful accounting of refused queries that should have been answered in recall calculations.
Goal accuracy can be reliably assessed without explicit references by leveraging LLM inference to extract intended objectives and compare them against the agent’s operational end-state.
The modern ragas.metrics.collections API supersedes legacy ragas.metrics classes, replacing MultiTurnSample requirements with direct user_input and reference_tool_calls arguments for streamlined evaluation pipelines.

Terminology

Topic Adherence: Metric assessing whether conversational turns remain within a predefined set of domain topics, computed via precision, recall, and F1 against reference topics.
Strict Order Mode: Evaluation configuration where tool calls must exactly match a predefined execution sequence to achieve a non-zero accuracy score.
Flexible Order Mode: Evaluation configuration that decouples sequence evaluation, allowing tool calls to be matched regardless of invocation order.
Agent Goal Accuracy: Binary metric determining whether an agent’s final operational state satisfies the user’s initial request, evaluated with or without ground-truth references.
ToolCallF1: Unordered evaluation metric that computes harmonic mean of precision and recall for tool invocations, penalizing both missing and redundant tool calls.
Reference Tool Calls: Ground-truth list of expected tool names and argument dictionaries used as the baseline for comparing agent-executed tool workflows.

Connections to Existing Wiki Pages

[[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-07-control-structure-and-tooling]]
[[ai-ml/Building_Agentic_AI_Applications_with_LLMs/sec-09-tooling-your-llms]]
[[ai-ml/Building_Agentic_AI_Applications_with_LLMs/index]]
[[ai-ml/nvidia-certs/ncp-aai/evaluation-and-tuning/ai-agent-evaluation-summary]]
[[ai-ml/nvidia-certs/ncp-aai/agent-development/index]]

Personal Wiki

Explorer