AiSPRY Engineering
Evaluating LLM agents you can defend in a board meeting
Most LLM agent demos die the moment a real user asks a question outside the happy path. Evaluating an agent for production is a different discipline from prompt-tuning a notebook.
This is the harness we run on every grounded-LLM system we ship.
The five axes
- Task success — does the agent achieve the user's goal, scored by a rubric, not vibes.
- Groundedness — every factual claim must be traceable to a cited source in the retrieval set.
- Cost per resolution — tokens × model price × tool-call cost, amortized across the full session.
- Latency budget — p50, p95, p99. p99 is the one users remember.
- Refusal quality — when the agent says "I don't know," does it say so for the right reasons?
The harness
We maintain three suites:
- Golden set — ~200 hand-curated examples reflecting the user spec. Runs on every commit.
- Adversarial set — prompts engineered to break grounding, route through tools incorrectly, or extract restricted data. Runs nightly.
- Production replay — sampled real traffic, replayed against the current build with diffs to the prior version. Runs on every release candidate.
Each suite emits a single dashboard panel with green/yellow/red thresholds. Anything red blocks ship.
What you measure determines what you ship
Teams that only measure accuracy ship agents that fabricate confidently. Teams that measure groundedness ship agents that say "I don't know" — which is what regulated buyers actually want.
If you want to talk through how we set this up for an engagement, send us a brief.