AIMay 14, 202611 min read

The eval harness you’ll wish you wrote first.

An LLM workflow without an eval suite is a prototype. Here is the smallest one we ship in production — a Postgres table, a CI job, a dashboard — and why it usually catches regressions before any human notices.

The most common bug in a production LLM workflow is the one that nobody notices. Output quality drifts by a percentage point a week, the model provider rolls out a silent update, a prompt gets edited in a hurry, and three months later you discover the support team has been quietly cleaning up the mess that AI used to clean up for them. The fix is the same fix as for every other piece of production software: tests, run in CI, on every change. We call ours the eval harness.

This post is the smallest version of the harness we ship. Three components, all boring. A scoring table in Postgres. A CI job that runs prompts against a fixed test set and writes results into that table. A dashboard that shows the last fifty runs side by side. The whole thing fits in a weekend.

The scoring table

Every evaluation run produces rows. Each row is a single (prompt × test-case × model-version) triple, with the input, the output, the expected output, a score between 0 and 1, the latency, and the token spend. We keep every row forever. Postgres is happy at one row per execution at our scale; we have never had a customer where this got close to expensive.

-- The table we never regret writing first.
CREATE TABLE eval_run (
  id            uuid          PRIMARY KEY DEFAULT gen_random_uuid(),
  ts            timestamptz   NOT NULL DEFAULT now(),
  prompt_id     text          NOT NULL,
  case_id       text          NOT NULL,
  model         text          NOT NULL,
  input         jsonb         NOT NULL,
  output        jsonb         NOT NULL,
  expected      jsonb,
  score         numeric(4,3),
  latency_ms    integer,
  tokens_in     integer,
  tokens_out    integer,
  notes         text
);
CREATE INDEX ON eval_run (prompt_id, ts);

The schema is deliberately wider than you’d think. We have been bitten too many times by an eval table that turned out to be missing the one field that would have let us answer the actual question. Postgres columns are cheap; storage is cheap; the cost of replaying eight months of evals after adding a column is not cheap.

The CI job

The harness runs on every pull request that touches a prompt, a tool, or a model gateway config — and nightly on main. It pulls the latest version of every prompt, runs them against the test set, writes a row per execution into the scoring table, and posts a summary back to the PR. If the score on any prompt drops by more than 5%, the check fails.

The test set

The test set is the hardest part of an eval harness, and the only part you can’t copy from this post. It is a curated list of input cases, each tagged with an expected output and a scoring function. We start each project with ten cases, hand-written by the team. By the time we ship, the set is in the low hundreds, grown one regression at a time.

Every regression that escapes to production becomes a test case the next morning. That is the only rule.

Scoring functions are the next-hardest part. Substring match works for half of cases. Regex for another quarter. The last quarter need an LLM-graded rubric, run by a model from a different provider than the one being evaluated. We have written about LLM-graded rubrics in an earlier post; the short version is: write the rubric like you’re grading a junior engineer, not like you’re prompting the model.

The dashboard

The dashboard is a single page rendered out of the scoring table. The last fifty runs, plotted as a small-multiple grid: x-axis time, y-axis score, one panel per prompt. Hover any panel and you get the per-case detail. Drift is impossible to miss in this view; it is hard to miss it in a column-chart on a spreadsheet, but column-charts are not where your engineers are looking.

[ small-multiples chart placeholder · 50 runs × 8 prompts ]
Fig. 1 — Eval dashboard, simplified. The two horizontal bands are the warn / fail thresholds.

Why this is enough

The fancy version of this harness uses an MLflow-style experiment tracker, a vector store for inputs, an LLM-as-judge service with its own eval suite, and a Grafana panel that pages an on-call engineer when scores drop. We have built all four. We rarely turn them on for a project under a year of operation, because the simple version — Postgres + CI + a page — catches the same problems for one tenth the engineering cost.

The harness pays for itself the first time it catches a silent provider regression. We have measured this across nine projects. The breakeven moment, on average, is week six.

If you take one thing from this post, take this: write the harness on day one. Not week three. Not after the first regression. Day one. The cost is half a day of engineering and a Postgres table; the alternative is a customer email three months from now that begins with “I think something changed?”

  • A Postgres table — not a CSV file in S3.
  • A CI job — not a manual notebook nobody runs.
  • A small-multiples dashboard — not a Slack message.
  • Curated cases — not synthetic ones from another LLM.

The eval harness is the part of an AI build that earns the system its name. Without it you have a demo.

Subscribe

One email per post.

Slow cadence, opinionated content. No newsletter, no marketing.

Related

Other writing on AI.

Want this kind of system?

If you’re building an LLM workflow and you don’t have a harness yet, this is what discovery looks like.