Evaluating LLM agents you can defend in a board meeting

Most LLM agent demos die the moment a real user asks a question outside the happy path. Evaluating an agent for production is a different discipline from prompt-tuning a notebook.

This is the harness we run on every grounded-LLM system we ship.

The five axes

Task success — does the agent achieve the user's goal, scored by a rubric, not vibes.
Groundedness — every factual claim must be traceable to a cited source in the retrieval set.
Cost per resolution — tokens × model price × tool-call cost, amortized across the full session.
Latency budget — p50, p95, p99. p99 is the one users remember.
Refusal quality — when the agent says "I don't know," does it say so for the right reasons?

The harness

We maintain three suites:

Golden set — ~200 hand-curated examples reflecting the user spec. Runs on every commit.
Adversarial set — prompts engineered to break grounding, route through tools incorrectly, or extract restricted data. Runs nightly.
Production replay — sampled real traffic, replayed against the current build with diffs to the prior version. Runs on every release candidate.

Each suite emits a single dashboard panel with green/yellow/red thresholds. Anything red blocks ship.

What you measure determines what you ship

Teams that only measure accuracy ship agents that fabricate confidently. Teams that measure groundedness ship agents that say "I don't know" — which is what regulated buyers actually want.

If you want to talk through how we set this up for an engagement, send us a brief.

What we do

Computer Vision & Visual AI

Forecasting & Predictive Analytics

Generative AI & Conversational AI

Agentic AI & Cognitive Automation

MLOps & Data Engineering

AI Strategy & Consulting

Flagship platforms

Garbha — HYSEA 2025

Drishti — Road safety AI

NWR - North Western Railways — Wagon inspection

GMR Power Trading

HIES AI Platform

Public Sector

Healthcare & Life Sciences

Industrial & Resource

Consumer & Cross-Industry

About AiSPRY

Evaluating LLM agents you can defend in a board meeting

The five axes

The harness

What you measure determines what you ship