2026-04-09

Why every Struvo agent ships with an eval set

You don't ship an agent. You ship an agent and its eval set. The eval is the contract.

#evals#struvo#agents#quality

The rule

If a Struvo pipeline doesn't have an eval set, it doesn't ship. No exceptions.

Transcription accuracy. Entity extraction. Report quality. Classification. RAG retrieval. Each one has a eval-datasets/<pipeline>/ folder with golden inputs, expected outputs, and a scoring rubric.

Why this rule exists

Without evals, every model swap is a guess. Gemini 3.5 vs Claude Sonnet vs GPT-5.4 — you can't compare them on real data without a golden set. You're left with vibes.

I refuse to ship "vibes" deployments to customer-facing pipelines. Construction GCs are not in the business of running A/B tests on their inspection reports. They want stability. Evals give us stability we can prove.

The schema

Each eval set has:

  • golden/ — input/output pairs hand-graded by Lucas or my cofounder
  • rubric.md — what counts as correct (with edge cases)
  • score.ts — runs a candidate model against the golden set, returns a score
  • regression.sql — flags any score drop > 5% on previously-shipped models

Total time to build the first eval set per pipeline: about 4 hours. Time saved later: about 40 hours per pipeline per year.

What you can steal

If you're shipping anything that uses an LLM in production:

  1. Pick the 5 most common inputs your pipeline sees. Hand-grade the outputs. That's your golden set v1.
  2. Write a score.ts that runs the candidate model and returns a number 0-1.
  3. Block deploys on score regression. CI gate, no override.

You don't need a perfect eval set. You need ANY eval set. The bar is "better than vibes."