Why every Struvo agent ships with an eval set
You don't ship an agent. You ship an agent and its eval set. The eval is the contract.
The rule
If a Struvo pipeline doesn't have an eval set, it doesn't ship. No exceptions.
Transcription accuracy. Entity extraction. Report quality. Classification. RAG retrieval. Each one has a eval-datasets/<pipeline>/ folder with golden inputs, expected outputs, and a scoring rubric.
Why this rule exists
Without evals, every model swap is a guess. Gemini 3.5 vs Claude Sonnet vs GPT-5.4 — you can't compare them on real data without a golden set. You're left with vibes.
I refuse to ship "vibes" deployments to customer-facing pipelines. Construction GCs are not in the business of running A/B tests on their inspection reports. They want stability. Evals give us stability we can prove.
The schema
Each eval set has:
golden/— input/output pairs hand-graded by Lucas or my cofounderrubric.md— what counts as correct (with edge cases)score.ts— runs a candidate model against the golden set, returns a scoreregression.sql— flags any score drop > 5% on previously-shipped models
Total time to build the first eval set per pipeline: about 4 hours. Time saved later: about 40 hours per pipeline per year.
What you can steal
If you're shipping anything that uses an LLM in production:
- Pick the 5 most common inputs your pipeline sees. Hand-grade the outputs. That's your golden set v1.
- Write a
score.tsthat runs the candidate model and returns a number 0-1. - Block deploys on score regression. CI gate, no override.
You don't need a perfect eval set. You need ANY eval set. The bar is "better than vibes."