Why this exists
RAG regressions often appear after prompt edits, model swaps, index rebuilds, or chunking changes. Target: measure those changes before merge
Next checks
- golden-set retrieval recall and faithfulness suites
- per-run latency and cost budget
- score records keyed by commit and dataset hash
- GitHub Actions gate that compares a pull request to a baseline branch
Public line
Public after a repo, reproducible benchmark, and one evaluation run exist