LLM as a Judge for AI Agent Evaluation

📅 April 2026 ✍️ Wenxue Cao 🏷️ AI Evaluation · LLM-as-Judge · Agentic Systems

LLM-as-Judge Agent Evaluation Step Attribution Bias Mitigation Claude Code Skill Python

The Evaluation Gap

As agentic AI moves from demos into production, the hard question shifts from "can the model answer?" to "did the agent do the right thing, step by step, across a whole trajectory?" Classical metrics — accuracy, BLEU, exact-match — only reach single-turn outputs. They say nothing about whether a tool was called correctly, whether the agent gave up when it should have retried, or which specific step caused a failure three turns later.

Human review works but does not scale. Hand-labelling every trajectory is too slow for CI, too expensive for regression tests, and too inconsistent across reviewers to produce reliable trend data. This is the gap LLM-as-judge is meant to close: use a strong LLM to apply a fixed rubric at scale, with human calibration on a small labelled subset to keep the judge honest.

What this project is

A Claude Code skill — framework-agnostic, open-source, MIT-licensed — that walks the user through designing a rigorous evaluation for an agent or LLM application and emits a runnable Python harness. It is the workflow I want every AI team to have by default, codified as a reusable tool.

Why LLM-as-Judge, and Where It Breaks

The underlying idea is simple: write an explicit rubric, give it to a judge model along with the system's input, output, and trajectory, and ask for a structured verdict. Done well, it produces reproducible quality signals at a fraction of human review cost. Done carelessly, it produces confident but biased numbers that drift over time.

The failure modes are well-documented. Judges prefer whichever candidate appears first in a pairwise comparison. They rate longer outputs higher even when the extra content is wrong. They favour outputs from the same model family as themselves. They collapse to middle scores when uncertain rather than reading the trajectory carefully. None of these are model bugs — they are structural properties of using an LLM to score another LLM, and the only defence is a judge prompt explicitly designed around them.

Six Phases, One Skill

The skill is organised as a six-phase workflow. Each phase produces a concrete artefact in the user's project directory; nothing is left as advice.

1Scope

Clarify what system is being evaluated, what "good" looks like, what ground truth is available, and the deployment context (research, production, regulated).

2Rubric Design

For each criterion: operational definition, step-level or trajectory-level, binary / 3-point / 5-point / numeric scale, failure modes caught, bias risks flagged.

3Judge Prompts

One prompt per criterion. Role, definition, scale anchors, structured JSON output, rationale-before-score ordering. Bias controls baked in.

4Harness Generation

Runnable Python harness on the Anthropic SDK. Runs judges against the dataset, writes structured JSON you can diff across releases.

5Attribution & Automation

Per-step influence weights for each criterion. Content-addressed cache so new examples run incrementally; a polling watcher re-triggers on change.

6Calibration

20–50 human-labelled examples, Cohen's kappa (categorical) or Spearman (ordinal). Judge agreement below threshold sends you back to step 3.

Step Attribution: Finding the Critical Step

Scoring steps in isolation is not enough. For an agent trajectory that ultimately fails, the deciding moment is often a single step — a typo in a tool argument, a premature give-up, a hallucinated tool name — and the remaining steps are either setup or consequence. The skill supports two attribution modes:

Blame mode runs one extra judge call per trajectory per criterion. The judge reads the full trajectory and the final verdict, then assigns each step a weight in [−1.0, +1.0] and identifies a critical_step. Cheap, fast, soft — it is correlational and the judge is guessing at causation.

Counterfactual mode is rigorous. For each step N, the agent is re-run from step N with that step masked or modified, the trajectory is re-scored, and the change in final score is the causal influence of step N. This is ablation analysis applied to agent trajectories. Expensive, opt-in, and only possible if the agent supports midpoint resume — a strong design constraint, but one worth paying for on high-stakes criteria like task success and safety.

Incremental and Automatic Re-Evaluation

A harness that re-runs every example on every invocation wastes compute and makes iteration painful. This one caches verdicts by content hash:

Hash	Covers	When it changes
example_hash	id + input + reference answer	New or edited example added to the dataset
rubric_hash	judge model + all judge prompts + attribution prompt	Any rubric or prompt edit

Adding a new example to the dataset re-runs only that example. Editing a judge prompt bumps the rubric hash and invalidates every cached verdict against it — so stale scores cannot ship. A companion watch.py polls the dataset and judge directory and re-invokes the eval when anything changes, without any external dependency. Drop a new example into the file, walk away, come back to the updated results.

Dataset
Example

→

Run Agent
(trajectory)

→

Per-Criterion
Judges

→

Step
Attribution

→

Structured
JSON

Design Principles

Rubric first, prompt second. Most eval failures are underspecified criteria, not bad prompts. The skill refuses to generate a judge before the rubric is explicit and measurable from the trajectory alone.

Rationale before score. The judge is forced to write its reasoning before committing to a verdict. This small ordering change — from the G-Eval paper — turns the score from a gut reaction into a conclusion.

Structured output, no free text. Every judge returns strict JSON with rationale, score, ambiguous, and evidence fields. A parseable verdict is the minimum requirement for a signal you can track over time.

Calibration is mandatory. A judge that does not agree with human reviewers on a small labelled subset is not a judge — it is a random number generator with an API bill. The skill ships a calibration script that computes per-criterion agreement and flags criteria below threshold.

Humans stay in the loop for safety. LLM judges are good enough to gate quality signals and catch regressions. They are not good enough to be the sole gate on a production release. The skill says this explicitly, in the documentation and in the generated report.

Honest Limits

What this skill does not do

Does not fine-tune a specialised judge model — relies on a strong base LLM with a good prompt.
Does not replace human evaluation — calibration against human labels is required for any production use.
Does not commit to a vendor framework — emits a plain Python harness, with notes on when to graduate to DeepEval, Inspect AI, or promptfoo.
Does not do automatic midpoint replay for every agent — counterfactual attribution requires the agent to support resume from an arbitrary step, which is agent-specific.

Why It Matters

Evaluation is the quiet bottleneck of agentic AI deployment. A bank, hospital, or regulated institution cannot ship an agent it cannot measure, and it cannot measure what it cannot define. Turning ad-hoc "looks good to me" review into a repeatable pipeline — rubric, judges, calibration, attribution, regression tracking — is the engineering work that separates an impressive demo from a system that can be trusted in production.

This skill is the smallest, most transparent version of that pipeline I could build. It runs on one SDK, fits in a few hundred lines of Python, and is readable end to end. The aim is not to replace heavier frameworks but to give every team a clear starting point — and to make it cheap enough that designing an evaluation is no longer the reason an AI project stalls.

Open source. The full skill, templates, and harness are available on GitHub.

View on GitHub ↗