The most common bug in a production LLM workflow is the one that nobody notices. Output quality drifts by a percentage point a week, the model provider rolls out a silent update, a prompt gets edited in a hurry, and three months later you discover the support team has been quietly cleaning up the mess that AI used to clean up for them. The fix is the same fix as for every other piece of production software: tests, run in CI, on every change. We call ours the eval harness.
This post is the smallest version of the harness we ship. Three components, all boring. A scoring table in Postgres. A CI job that runs prompts against a fixed test set and writes results into that table. A dashboard that shows the last fifty runs side by side. The whole thing fits in a weekend.
The scoring table
Every evaluation run produces rows. Each row is a single (prompt × test-case × model-version) triple, with the input, the output, the expected output, a score between 0 and 1, the latency, and the token spend. We keep every row forever. Postgres is happy at one row per execution at our scale; we have never had a customer where this got close to expensive.
-- The table we never regret writing first.
CREATE TABLE eval_run (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
ts timestamptz NOT NULL DEFAULT now(),
prompt_id text NOT NULL,
case_id text NOT NULL,
model text NOT NULL,
input jsonb NOT NULL,
output jsonb NOT NULL,
expected jsonb,
score numeric(4,3),
latency_ms integer,
tokens_in integer,
tokens_out integer,
notes text
);
CREATE INDEX ON eval_run (prompt_id, ts);
The schema is deliberately wider than you’d think. We have been bitten too many times by an eval table that turned out to be missing the one field that would have let us answer the actual question. Postgres columns are cheap; storage is cheap; the cost of replaying eight months of evals after adding a column is not cheap.
The CI job
The harness runs on every pull request that touches a prompt, a tool, or a model gateway config — and nightly on main. It pulls the latest version of every prompt, runs them against the test set, writes a row per execution into the scoring table, and posts a summary back to the PR. If the score on any prompt drops by more than 5%, the check fails.
The test set
The test set is the hardest part of an eval harness, and the only part you can’t copy from this post. It is a curated list of input cases, each tagged with an expected output and a scoring function. We start each project with ten cases, hand-written by the team. By the time we ship, the set is in the low hundreds, grown one regression at a time.
Every regression that escapes to production becomes a test case the next morning. That is the only rule.
Scoring functions are the next-hardest part. Substring match works for half of cases. Regex for another quarter. The last quarter need an LLM-graded rubric, run by a model from a different provider than the one being evaluated. We have written about LLM-graded rubrics in an earlier post; the short version is: write the rubric like you’re grading a junior engineer, not like you’re prompting the model.
The dashboard
The dashboard is a single page rendered out of the scoring table. The last fifty runs, plotted as a small-multiple grid: x-axis time, y-axis score, one panel per prompt. Hover any panel and you get the per-case detail. Drift is impossible to miss in this view; it is hard to miss it in a column-chart on a spreadsheet, but column-charts are not where your engineers are looking.
Why this is enough
The fancy version of this harness uses an MLflow-style experiment tracker, a vector store for inputs, an LLM-as-judge service with its own eval suite, and a Grafana panel that pages an on-call engineer when scores drop. We have built all four. We rarely turn them on for a project under a year of operation, because the simple version — Postgres + CI + a page — catches the same problems for one tenth the engineering cost.
If you take one thing from this post, take this: write the harness on day one. Not week three. Not after the first regression. Day one. The cost is half a day of engineering and a Postgres table; the alternative is a customer email three months from now that begins with “I think something changed?”
- A Postgres table — not a CSV file in S3.
- A CI job — not a manual notebook nobody runs.
- A small-multiples dashboard — not a Slack message.
- Curated cases — not synthetic ones from another LLM.
The eval harness is the part of an AI build that earns the system its name. Without it you have a demo.